From yael at mellanox.co.il Wed Feb 1 01:11:38 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 01 Feb 2006 11:11:38 +0200 Subject: [openib-general] [PATCH] Opensm - fix error message Message-ID: <5zvevz1m0l.fsf@mtl066.yok.mtl.com> Hi Hal, I saw that there is a problem with the printing of the error message in __osm_pi_rcv_process_switch_port - the link state printed is the state saved on the port, and not the state checked in the portInfo received. The attached patch fixes this. Thanks, Yael Signed-off-by: Yael Kalka Index: opensm/osm_port_info_rcv.c =================================================================== --- opensm/osm_port_info_rcv.c (revision 5204) +++ opensm/osm_port_info_rcv.c (working copy) @@ -334,7 +334,7 @@ __osm_pi_rcv_process_switch_port( osm_log( p_rcv->p_log, OSM_LOG_ERROR, "__osm_pi_rcv_process_switch_port: ERR 0F03: " "Unknown link state = %u, port = 0x%X\n", - osm_physp_get_port_state( p_physp ), + ib_port_info_get_port_state( p_pi ), p_pi->local_port_num ); break; } From hughie.cooganyfbq at gmail.com Wed Feb 1 13:36:45 2006 From: hughie.cooganyfbq at gmail.com (Shaun Bunch) Date: Wen, 1 Feb 2006 16:36:45 -0500 Subject: [openib-general] One tab that will bring you to the top of performance Message-ID: <20060201111251.B60B52283D8@openib.ca.sandia.gov> No matter your age and actual performance, you can always do better. And the great news is that now you don?t have to wait ? the soft tab gets into bloodstream, including your buddy, in just 15-20 minutes. Down the little thing and start pleasing her in the foreplay, because minutes later you will win her very personal First Prize. Now you can be up for the entire night, reaching heavens of pleasure for both of you. You can now become the king of the bed (or wherever you use it) ? Fast, safe and easy! http://aegjlmdi.ecologyglobal.info/?bcfhkdixwqowyaezvtgjlm From rep.nop at aon.at Wed Feb 1 03:46:00 2006 From: rep.nop at aon.at (Bernhard Fischer) Date: Wed, 1 Feb 2006 12:46:00 +0100 Subject: [openib-general] imgen/mic.cpp compilation error In-Reply-To: <20060130110512.GW31887@mellanox.co.il> References: <20060130105831.GA11627@aon.at> <20060130110512.GW31887@mellanox.co.il> Message-ID: <20060201114600.GB27258@aon.at> On Mon, Jan 30, 2006 at 01:05:12PM +0200, Michael S. Tsirkin wrote: > >https://openib.org/tiki/tiki-index.php?page=Installation+Cheat+Sheet I'm getting a compilation error in mic.cpp since the bool for hinting whether to add an info section is missing: g++ -Wall -W -Werror -g -O2 -MP -MD '-DBLD_VER_STR="devel"' '-DIBADM_VER_STR=""' -fno-exceptions -c -o mic.o mic.cpp mic.cpp: In function 'int main(int, char**)': mic.cpp:452: error: no matching function for call to 'TImage::TImage(ParamList*, char*&, char*&, const char*)' TImage.h:84: note: candidates are: TImage::TImage(ParamList*, const char*, const char*, const char*, bool) TImage.h:78: note: TImage::TImage(const TImage&) make: *** [mic.o] Error 1 Looks like there should be a command-line switch to toggle this. Perhaps mic.cpp wasn't updated when you "Update from MFT 1.0.1 (IBG2.0.1)" TImage.h ? The reason i think about updating my FW (currently at 4.6.2) is that i get errors from mvapich2 from openib svn (rev 5204) like this: $ mpdrun -np 2 ./cpi cannot create cq Fail to init hca rank 1 in job 11 x86-64n001_57976 caused collective abort of all ranks exit status of rank 1: killed by signal 11 RDMA_DEFAULT_MAX_CQ_SIZE seems to default to 40000, so: $ export RDMA_DEFAULT_MAX_CQ_SIZE=64 $ mpdrun -np 2 ./cpi [Init] Fail to create qp for rank 0 Fail to init hca rank 1 in job 12 x86-64n001_57976 caused collective abort of all ranks exit status of rank 1: killed by signal 11 The ibv_*_pingpong tests do work properly even with the old firmware. Does mvapich2 from svn work for anybody else? From mst at mellanox.co.il Wed Feb 1 04:28:54 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 1 Feb 2006 14:28:54 +0200 Subject: [openib-general] Re: imgen/mic.cpp compilation error In-Reply-To: <20060201114600.GB27258@aon.at> References: <20060201114600.GB27258@aon.at> Message-ID: <20060201122854.GZ31887@mellanox.co.il> Quoting r. Bernhard Fischer : > Perhaps mic.cpp wasn't updated when you > "Update from MFT 1.0.1 (IBG2.0.1)" TImage.h ? Correct, good catch. Pls try again now. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From openib-general at clustervision.com Wed Feb 1 07:11:23 2006 From: openib-general at clustervision.com (Guido Passet) Date: Wed, 01 Feb 2006 16:11:23 +0100 Subject: [openib-general] libibverbs.la seems to have a non-working default. Message-ID: <43E0CF9B.3080509@clustervision.com> Hi, I ran into a problem compiling libibcm and managed to trace it back to libibverbs. It seems that libibverbs.la has a static link for libsysfs.la to /lib64 on a 64bit host and to /lib on a 32bits host. However, on my SUSE10 system the libtool archive is located in /usr/lib/libsysfs.la.. dependency_libs=' -L/usr/local/my-favorite-installdir/OpenIB/lib /lib64/libsysfs.la -lpthread -ldl' For now i have a workaround this, but i am not sure if this is either a feature or a bug ;) Best regards, -- Guido Passet Email: guido.passet at clustervision.com ClusterVision BV Email support: support at clustervision.com Nieuw-Zeelandweg 15B Web: http://www.clustervision.com 1045 AL Amsterdam Tel: +31 20 407 7550 The Netherlands Fax: +31 84 759 8389 From yael at mellanox.co.il Wed Feb 1 05:21:00 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 01 Feb 2006 15:21:00 +0200 Subject: [openib-general] [PATCH] Opensm - asserts before OSM_LOG_ENTER Message-ID: <5zu0bj1agz.fsf@mtl066.yok.mtl.com> Hi Hal, When trying to compile the windows stack with some late updates, I've encountered an issue with the addition/change of place of asserts to before the OSM_LOG_ENTER. Since OSM_LOG_ENTER declares a variable, then these asserts cause failure due to declaration in the middle of the function. These asserts are all on the reciever object or the manager object, so I don't think they are really necessary. The Following patch removes these asserts. Thanks, Yael Signed-off-by: Yael Kalka Index: opensm/osm_pkey_rcv.c =================================================================== --- opensm/osm_pkey_rcv.c (revision 5246) +++ opensm/osm_pkey_rcv.c (working copy) @@ -71,8 +71,6 @@ void osm_pkey_rcv_destroy( IN osm_pkey_rcv_t* const p_rcv ) { - CL_ASSERT( p_rcv ); - OSM_LOG_ENTER( p_rcv->p_log, osm_pkey_rcv_destroy ); OSM_LOG_EXIT( p_rcv->p_log ); @@ -125,8 +123,6 @@ osm_pkey_rcv_process( uint8_t port_num; uint16_t block_num; - CL_ASSERT( p_rcv ); - OSM_LOG_ENTER( p_rcv->p_log, osm_pkey_rcv_process ); CL_ASSERT( p_madw ); Index: opensm/osm_sm_state_mgr.c =================================================================== --- opensm/osm_sm_state_mgr.c (revision 5246) +++ opensm/osm_sm_state_mgr.c (working copy) @@ -406,8 +406,6 @@ void osm_sm_state_mgr_destroy( IN osm_sm_state_mgr_t * const p_sm_mgr ) { - CL_ASSERT( p_sm_mgr ); - OSM_LOG_ENTER( p_sm_mgr->p_log, osm_sm_state_mgr_destroy ); cl_spinlock_destroy( &p_sm_mgr->state_lock ); @@ -500,8 +498,6 @@ osm_sm_state_mgr_process( { ib_api_status_t status = IB_SUCCESS; - CL_ASSERT( p_sm_mgr ); - OSM_LOG_ENTER( p_sm_mgr->p_log, osm_sm_state_mgr_process ); /* @@ -760,8 +756,6 @@ osm_sm_state_mgr_check_legality( { ib_api_status_t status = IB_SUCCESS; - CL_ASSERT( p_sm_mgr ); - OSM_LOG_ENTER( p_sm_mgr->p_log, osm_sm_state_mgr_check_legality ); /* Index: opensm/osm_state_mgr.c =================================================================== --- opensm/osm_state_mgr.c (revision 5246) +++ opensm/osm_state_mgr.c (working copy) @@ -86,8 +86,6 @@ void osm_state_mgr_destroy( IN osm_state_mgr_t * const p_mgr ) { - CL_ASSERT( p_mgr ); - OSM_LOG_ENTER( p_mgr->p_log, osm_state_mgr_destroy ); /* destroy the locks */ @@ -1884,8 +1882,6 @@ osm_state_mgr_process( ib_api_status_t status; osm_remote_sm_t *p_remote_sm; - CL_ASSERT( p_mgr ); - OSM_LOG_ENTER( p_mgr->p_log, osm_state_mgr_process ); /* if we are exiting do nothing */ Index: opensm/osm_sa_guidinfo_record.c =================================================================== --- opensm/osm_sa_guidinfo_record.c (revision 5246) +++ opensm/osm_sa_guidinfo_record.c (working copy) @@ -433,8 +433,6 @@ osm_gir_rcv_process( ib_api_status_t status; osm_physp_t* p_req_physp; - CL_ASSERT( p_rcv ); - OSM_LOG_ENTER( p_rcv->p_log, osm_gir_rcv_process ); CL_ASSERT( p_madw ); Index: opensm/osm_sa_vlarb_record.c =================================================================== --- opensm/osm_sa_vlarb_record.c (revision 5246) +++ opensm/osm_sa_vlarb_record.c (working copy) @@ -348,8 +348,6 @@ osm_vlarb_rec_rcv_process( ib_net64_t comp_mask; osm_physp_t* p_req_physp; - CL_ASSERT( p_rcv ); - OSM_LOG_ENTER( p_rcv->p_log, osm_vlarb_rec_rcv_process ); /* update the requestor physical port. */ Index: opensm/osm_sa_lft_record.c =================================================================== --- opensm/osm_sa_lft_record.c (revision 5246) +++ opensm/osm_sa_lft_record.c (working copy) @@ -329,8 +329,6 @@ osm_lftr_rcv_process( ib_api_status_t status; osm_physp_t* p_req_physp; - CL_ASSERT( p_rcv ); - OSM_LOG_ENTER( p_rcv->p_log, osm_lftr_rcv_process ); CL_ASSERT( p_madw ); Index: opensm/osm_sa_portinfo_record.c =================================================================== --- opensm/osm_sa_portinfo_record.c (revision 5246) +++ opensm/osm_sa_portinfo_record.c (working copy) @@ -600,8 +600,6 @@ osm_pir_rcv_process( osm_physp_t* p_req_physp; boolean_t trusted_req = TRUE; - CL_ASSERT( p_rcv ); - OSM_LOG_ENTER( p_rcv->p_log, osm_pir_rcv_process ); CL_ASSERT( p_madw ); Index: opensm/osm_req.c =================================================================== --- opensm/osm_req.c (revision 5246) +++ opensm/osm_req.c (working copy) @@ -131,8 +131,6 @@ osm_req_get( ib_api_status_t status = IB_SUCCESS; ib_net64_t tid; - CL_ASSERT( p_req ); - OSM_LOG_ENTER( p_req->p_log, osm_req_get ); CL_ASSERT( p_path ); @@ -222,8 +220,6 @@ osm_req_set( ib_api_status_t status = IB_SUCCESS; ib_net64_t tid; - CL_ASSERT( p_req ); - OSM_LOG_ENTER( p_req->p_log, osm_req_set ); CL_ASSERT( p_path ); Index: opensm/osm_sa_pkey_record.c =================================================================== --- opensm/osm_sa_pkey_record.c (revision 5246) +++ opensm/osm_sa_pkey_record.c (working copy) @@ -344,8 +344,6 @@ osm_pkey_rec_rcv_process( ib_net64_t comp_mask; osm_physp_t* p_req_physp; - CL_ASSERT( p_rcv ); - OSM_LOG_ENTER( p_rcv->p_log, osm_pkey_rec_rcv_process ); CL_ASSERT( p_madw ); Index: opensm/osm_lin_fwd_rcv.c =================================================================== --- opensm/osm_lin_fwd_rcv.c (revision 5246) +++ opensm/osm_lin_fwd_rcv.c (working copy) @@ -75,8 +75,6 @@ void osm_lft_rcv_destroy( IN osm_lft_rcv_t* const p_rcv ) { - CL_ASSERT( p_rcv ); - OSM_LOG_ENTER( p_rcv->p_log, osm_lft_rcv_destroy ); OSM_LOG_EXIT( p_rcv->p_log ); @@ -121,8 +119,6 @@ osm_lft_rcv_process( ib_net64_t node_guid; ib_api_status_t status; - CL_ASSERT( p_rcv ); - OSM_LOG_ENTER( p_rcv->p_log, osm_lft_rcv_process ); CL_ASSERT( p_madw ); Index: opensm/osm_sa_slvl_record.c =================================================================== --- opensm/osm_sa_slvl_record.c (revision 5246) +++ opensm/osm_sa_slvl_record.c (working copy) @@ -324,8 +324,6 @@ osm_slvl_rec_rcv_process( ib_net64_t comp_mask; osm_physp_t* p_req_physp; - CL_ASSERT( p_rcv ); - OSM_LOG_ENTER( p_rcv->p_log, osm_slvl_rec_rcv_process ); CL_ASSERT( p_madw ); Index: opensm/osm_sminfo_rcv.c =================================================================== --- opensm/osm_sminfo_rcv.c (revision 5246) +++ opensm/osm_sminfo_rcv.c (working copy) @@ -80,8 +80,6 @@ void osm_sminfo_rcv_destroy( IN osm_sminfo_rcv_t* const p_rcv ) { - CL_ASSERT( p_rcv ); - OSM_LOG_ENTER( p_rcv->p_log, osm_sminfo_rcv_destroy ); OSM_LOG_EXIT( p_rcv->p_log ); Index: opensm/osm_node_info_rcv.c =================================================================== --- opensm/osm_node_info_rcv.c (revision 5246) +++ opensm/osm_node_info_rcv.c (working copy) @@ -981,8 +981,6 @@ void osm_ni_rcv_destroy( IN osm_ni_rcv_t* const p_rcv ) { - CL_ASSERT( p_rcv ); - OSM_LOG_ENTER( p_rcv->p_log, osm_ni_rcv_destroy ); OSM_LOG_EXIT( p_rcv->p_log ); @@ -1028,8 +1026,6 @@ osm_ni_rcv_process( osm_node_t *p_node; boolean_t process_new_flag = FALSE; - CL_ASSERT( p_rcv ); - OSM_LOG_ENTER( p_rcv->p_log, osm_ni_rcv_process ); CL_ASSERT( p_madw ); Index: opensm/osm_mcast_mgr.c =================================================================== --- opensm/osm_mcast_mgr.c (revision 5246) +++ opensm/osm_mcast_mgr.c (working copy) @@ -394,8 +394,6 @@ void osm_mcast_mgr_destroy( IN osm_mcast_mgr_t* const p_mgr ) { - CL_ASSERT( p_mgr ); - OSM_LOG_ENTER( p_mgr->p_log, osm_mcast_mgr_destroy ); OSM_LOG_EXIT( p_mgr->p_log ); @@ -449,8 +447,6 @@ __osm_mcast_mgr_set_tbl( ib_net16_t block[IB_MCAST_BLOCK_SIZE]; osm_signal_t signal = OSM_SIGNAL_DONE; - CL_ASSERT( p_mgr ); - OSM_LOG_ENTER( p_mgr->p_log, __osm_mcast_mgr_set_tbl ); CL_ASSERT( p_sw ); Index: opensm/osm_sa_sminfo_record.c =================================================================== --- opensm/osm_sa_sminfo_record.c (revision 5246) +++ opensm/osm_sa_sminfo_record.c (working copy) @@ -89,8 +89,6 @@ void osm_smir_rcv_destroy( IN osm_smir_rcv_t* const p_rcv ) { - CL_ASSERT( p_rcv ); - OSM_LOG_ENTER( p_rcv->p_log, osm_smir_rcv_destroy ); OSM_LOG_EXIT( p_rcv->p_log ); @@ -142,8 +140,6 @@ osm_smir_rcv_process( ib_net64_t local_guid; osm_port_t* local_port; - CL_ASSERT( p_rcv ); - OSM_LOG_ENTER( p_rcv->p_log, osm_smir_rcv_process ); CL_ASSERT( p_madw ); Index: opensm/osm_trap_rcv.c =================================================================== --- opensm/osm_trap_rcv.c (revision 5246) +++ opensm/osm_trap_rcv.c (working copy) @@ -189,8 +189,6 @@ void osm_trap_rcv_destroy( IN osm_trap_rcv_t* const p_rcv ) { - CL_ASSERT( p_rcv ); - OSM_LOG_ENTER( p_rcv->p_log, osm_trap_rcv_destroy ); cl_event_wheel_destroy( &p_rcv->trap_aging_tracker ); Index: opensm/osm_ucast_mgr.c =================================================================== --- opensm/osm_ucast_mgr.c (revision 5246) +++ opensm/osm_ucast_mgr.c (working copy) @@ -90,8 +90,6 @@ void osm_ucast_mgr_destroy( IN osm_ucast_mgr_t* const p_mgr ) { - CL_ASSERT( p_mgr ); - OSM_LOG_ENTER( p_mgr->p_log, osm_ucast_mgr_destroy ); OSM_LOG_EXIT( p_mgr->p_log ); @@ -785,8 +783,6 @@ __osm_ucast_mgr_set_table( uint32_t block_id_ho = 0; uint8_t block[IB_SMP_DATA_SIZE]; - CL_ASSERT( p_mgr ); - OSM_LOG_ENTER( p_mgr->p_log, __osm_ucast_mgr_set_table ); CL_ASSERT( p_sw ); Index: opensm/osm_sa_node_record.c =================================================================== --- opensm/osm_sa_node_record.c (revision 5246) +++ opensm/osm_sa_node_record.c (working copy) @@ -435,8 +435,6 @@ osm_nr_rcv_process( ib_api_status_t status; osm_physp_t* p_req_physp; - CL_ASSERT( p_rcv ); - OSM_LOG_ENTER( p_rcv->p_log, osm_nr_rcv_process ); CL_ASSERT( p_madw ); Index: opensm/osm_sw_info_rcv.c =================================================================== --- opensm/osm_sw_info_rcv.c (revision 5246) +++ opensm/osm_sw_info_rcv.c (working copy) @@ -363,8 +363,6 @@ __osm_si_rcv_process_new( ib_smp_t *p_smp; cl_qmap_t *p_sw_guid_tbl; - CL_ASSERT( p_rcv ); - OSM_LOG_ENTER( p_rcv->p_log, __osm_si_rcv_process_new ); CL_ASSERT( p_madw ); @@ -582,8 +580,6 @@ void osm_si_rcv_destroy( IN osm_si_rcv_t* const p_rcv ) { - CL_ASSERT( p_rcv ); - OSM_LOG_ENTER( p_rcv->p_log, osm_si_rcv_destroy ); OSM_LOG_EXIT( p_rcv->p_log ); @@ -631,8 +627,6 @@ osm_si_rcv_process( ib_net64_t node_guid; osm_si_context_t *p_context; - CL_ASSERT( p_rcv ); - OSM_LOG_ENTER( p_rcv->p_log, osm_si_rcv_process ); CL_ASSERT( p_madw ); Index: opensm/osm_mcast_fwd_rcv.c =================================================================== --- opensm/osm_mcast_fwd_rcv.c (revision 5246) +++ opensm/osm_mcast_fwd_rcv.c (working copy) @@ -77,8 +77,6 @@ void osm_mft_rcv_destroy( IN osm_mft_rcv_t* const p_rcv ) { - CL_ASSERT( p_rcv ); - OSM_LOG_ENTER( p_rcv->p_log, osm_mft_rcv_destroy ); OSM_LOG_EXIT( p_rcv->p_log ); @@ -124,8 +122,6 @@ osm_mft_rcv_process( ib_net64_t node_guid; ib_api_status_t status; - CL_ASSERT( p_rcv ); - OSM_LOG_ENTER( p_rcv->p_log, osm_mft_rcv_process ); CL_ASSERT( p_madw ); Index: opensm/osm_slvl_map_rcv.c =================================================================== --- opensm/osm_slvl_map_rcv.c (revision 5246) +++ opensm/osm_slvl_map_rcv.c (working copy) @@ -83,8 +83,6 @@ void osm_slvl_rcv_destroy( IN osm_slvl_rcv_t* const p_rcv ) { - CL_ASSERT( p_rcv ); - OSM_LOG_ENTER( p_rcv->p_log, osm_slvl_rcv_destroy ); OSM_LOG_EXIT( p_rcv->p_log ); @@ -136,8 +134,6 @@ osm_slvl_rcv_process( ib_net64_t node_guid; uint8_t out_port_num, in_port_num; - CL_ASSERT( p_rcv ); - OSM_LOG_ENTER( p_rcv->p_log, osm_slvl_rcv_process ); CL_ASSERT( p_madw ); Index: opensm/osm_node_desc_rcv.c =================================================================== --- opensm/osm_node_desc_rcv.c (revision 5246) +++ opensm/osm_node_desc_rcv.c (working copy) @@ -109,8 +109,6 @@ void osm_nd_rcv_destroy( IN osm_nd_rcv_t* const p_rcv ) { - CL_ASSERT( p_rcv ); - OSM_LOG_ENTER( p_rcv->p_log, osm_nd_rcv_destroy ); OSM_LOG_EXIT( p_rcv->p_log ); @@ -152,8 +150,6 @@ osm_nd_rcv_process( osm_node_t *p_node; ib_net64_t node_guid; - CL_ASSERT( p_rcv ); - OSM_LOG_ENTER( p_rcv->p_log, osm_nd_rcv_process ); CL_ASSERT( p_madw ); Index: opensm/osm_sa_mcmember_record.c =================================================================== --- opensm/osm_sa_mcmember_record.c (revision 5246) +++ opensm/osm_sa_mcmember_record.c (working copy) @@ -109,8 +109,6 @@ void osm_mcmr_rcv_destroy( IN osm_mcmr_recv_t* const p_rcv ) { - CL_ASSERT( p_rcv ); - OSM_LOG_ENTER( p_rcv->p_log, osm_mcmr_rcv_destroy ); cl_qlock_pool_destroy( &p_rcv->pool ); @@ -1967,8 +1965,6 @@ osm_mcmr_query_mgrp(IN osm_mcmr_recv_t* osm_physp_t* p_req_physp; boolean_t trusted_req = TRUE; - CL_ASSERT( p_rcv ); - OSM_LOG_ENTER( p_rcv->p_log, osm_mcmr_query_mgrp ); CL_ASSERT( p_madw ); @@ -2173,8 +2169,6 @@ osm_mcmr_rcv_process( ib_member_rec_t *p_recvd_mcmember_rec; boolean_t valid; - CL_ASSERT( p_rcv ); - OSM_LOG_ENTER( p_rcv->p_log, osm_mcmr_rcv_process ); CL_ASSERT( p_madw ); Index: opensm/osm_drop_mgr.c =================================================================== --- opensm/osm_drop_mgr.c (revision 5246) +++ opensm/osm_drop_mgr.c (working copy) @@ -81,8 +81,6 @@ void osm_drop_mgr_destroy( IN osm_drop_mgr_t* const p_mgr ) { - CL_ASSERT( p_mgr ); - OSM_LOG_ENTER( p_mgr->p_log, osm_drop_mgr_destroy ); OSM_LOG_EXIT( p_mgr->p_log ); @@ -597,8 +595,6 @@ osm_drop_mgr_process( uint8_t port_num; osm_physp_t *p_physp; - CL_ASSERT( p_mgr ); - OSM_LOG_ENTER( p_mgr->p_log, osm_drop_mgr_process ); p_node_guid_tbl = &p_mgr->p_subn->node_guid_tbl; Index: opensm/osm_lid_mgr.c =================================================================== --- opensm/osm_lid_mgr.c (revision 5246) +++ opensm/osm_lid_mgr.c (working copy) @@ -1312,8 +1312,6 @@ osm_lid_mgr_process_subnet( osm_physp_t *p_physp; int lid_changed; - CL_ASSERT( p_mgr ); - OSM_LOG_ENTER( p_mgr->p_log, osm_lid_mgr_process_subnet ); CL_PLOCK_EXCL_ACQUIRE( p_mgr->p_lock ); Index: opensm/osm_pkey_mgr.c =================================================================== --- opensm/osm_pkey_mgr.c (revision 5246) +++ opensm/osm_pkey_mgr.c (working copy) @@ -73,8 +73,6 @@ void osm_pkey_mgr_destroy( IN osm_pkey_mgr_t * const p_mgr ) { - CL_ASSERT( p_mgr ); - OSM_LOG_ENTER( p_mgr->p_log, osm_pkey_mgr_destroy ); OSM_LOG_EXIT( p_mgr->p_log ); @@ -238,8 +236,6 @@ osm_pkey_mgr_process( osm_physp_t *p_physp; osm_signal_t result = OSM_SIGNAL_DONE; - CL_ASSERT( p_mgr ); - OSM_LOG_ENTER( p_mgr->p_log, osm_pkey_mgr_process ); p_node_guid_tbl = &p_mgr->p_subn->node_guid_tbl; Index: opensm/osm_vl_arb_rcv.c =================================================================== --- opensm/osm_vl_arb_rcv.c (revision 5246) +++ opensm/osm_vl_arb_rcv.c (working copy) @@ -83,8 +83,6 @@ void osm_vla_rcv_destroy( IN osm_vla_rcv_t* const p_rcv ) { - CL_ASSERT( p_rcv ); - OSM_LOG_ENTER( p_rcv->p_log, osm_vla_rcv_destroy ); OSM_LOG_EXIT( p_rcv->p_log ); @@ -136,8 +134,6 @@ osm_vla_rcv_process( ib_net64_t node_guid; uint8_t port_num, block_num; - CL_ASSERT( p_rcv ); - OSM_LOG_ENTER( p_rcv->p_log, osm_vla_rcv_process ); CL_ASSERT( p_madw ); From mst at mellanox.co.il Wed Feb 1 05:24:48 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 1 Feb 2006 15:24:48 +0200 Subject: [openib-general] Re: libibverbs.la seems to have a non-working default. In-Reply-To: <43E0CF9B.3080509@clustervision.com> References: <43E0CF9B.3080509@clustervision.com> Message-ID: <20060201132448.GD31887@mellanox.co.il> Quoting r. Guido Passet : > Subject: libibverbs.la seems to have a non-working default. > > Hi, > > I ran into a problem compiling libibcm and managed to trace it back to > libibverbs. It seems that libibverbs.la has a static link for > libsysfs.la to /lib64 on a 64bit host and to /lib on a 32bits host. > > However, on my SUSE10 system the libtool archive is located in > /usr/lib/libsysfs.la.. > > dependency_libs=' -L/usr/local/my-favorite-installdir/OpenIB/lib > /lib64/libsysfs.la -lpthread -ldl' > > For now i have a workaround this, but i am not sure if this is either a > feature or a bug ;) > > Best regards, > -- > Guido Passet Email: guido.passet at clustervision.com Looks like autotools problem at your end: libibverbs.la is a generated file. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Wed Feb 1 05:40:18 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 1 Feb 2006 15:40:18 +0200 Subject: [openib-general] Re: [PATCH] Opensm - asserts before OSM_LOG_ENTER In-Reply-To: <5zu0bj1agz.fsf@mtl066.yok.mtl.com> References: <5zu0bj1agz.fsf@mtl066.yok.mtl.com> Message-ID: <20060201134018.GE31887@mellanox.co.il> Quoting r. Yael Kalka : > Subject: [PATCH] Opensm - asserts before OSM_LOG_ENTER > > > Hi Hal, > > When trying to compile the windows stack with some late updates, I've > encountered an issue with the addition/change of place of asserts to > before the OSM_LOG_ENTER. Since OSM_LOG_ENTER declares a variable, > then these asserts cause failure due to declaration in the middle of > the function. > These asserts are all on the reciever object or the manager object, so > I don't think they are really necessary. > > The Following patch removes these asserts. > > Thanks, > Yael > > Signed-off-by: Yael Kalka Macros that declare variables are evil. Macros that end with ';' are evil too. --- gcc has __func__, and Visual has __FUNCTION__, so ugly hackery around OSM_LOG_ENTER is not needed anymore. Signed-off-by: Michael S. Tsirkin Index: openib/src/userspace/management/osm/include/opensm/osm_log.h =================================================================== --- openib.orig/src/userspace/management/osm/include/opensm/osm_log.h 2005-10-16 10:49:04.791297000 +0200 +++ openib/src/userspace/management/osm/include/opensm/osm_log.h 2006-02-01 15:36:38.652180000 +0200 @@ -71,17 +71,13 @@ BEGIN_C_DECLS #define LOG_ENTRY_SIZE_MAX 4096 #define BUF_SIZE LOG_ENTRY_SIZE_MAX -#define OSM_LOG_DEFINE_FUNC( NAME ) \ - static const char osm_log_func_name[] = #NAME - #define OSM_LOG_ENTER( OSM_LOG_PTR, NAME ) \ - OSM_LOG_DEFINE_FUNC( NAME ); \ osm_log( OSM_LOG_PTR, OSM_LOG_FUNCS, \ - "%s: [\n", osm_log_func_name ); + "%s: [\n", __func__ ) #define OSM_LOG_EXIT( OSM_LOG_PTR ) \ osm_log( OSM_LOG_PTR, OSM_LOG_FUNCS, \ - "%s: ]\n", osm_log_func_name ); + "%s: ]\n", __func__ ) /****h* OpenSM/Log * NAME -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From halr at voltaire.com Wed Feb 1 05:44:49 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Feb 2006 08:44:49 -0500 Subject: [openib-general] Re: [PATCH] Opensm - fix error message In-Reply-To: <5zvevz1m0l.fsf@mtl066.yok.mtl.com> References: <5zvevz1m0l.fsf@mtl066.yok.mtl.com> Message-ID: <1138801488.4453.29994.camel@hal.voltaire.com> Hi Yael, On Wed, 2006-02-01 at 04:11, Yael Kalka wrote: > Hi Hal, > > I saw that there is a problem with the printing of the error message > in __osm_pi_rcv_process_switch_port - the link state printed is the > state saved on the port, and not the state checked in the portInfo > received. > The attached patch fixes this. Thanks. Applied. -- Hal > Thanks, > Yael > > > Signed-off-by: Yael Kalka > > Index: opensm/osm_port_info_rcv.c > =================================================================== > --- opensm/osm_port_info_rcv.c (revision 5204) > +++ opensm/osm_port_info_rcv.c (working copy) > @@ -334,7 +334,7 @@ __osm_pi_rcv_process_switch_port( > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > "__osm_pi_rcv_process_switch_port: ERR 0F03: " > "Unknown link state = %u, port = 0x%X\n", > - osm_physp_get_port_state( p_physp ), > + ib_port_info_get_port_state( p_pi ), > p_pi->local_port_num ); > break; > } > From Arkady.Kanevsky at netapp.com Wed Feb 1 06:09:43 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Wed, 1 Feb 2006 09:09:43 -0500 Subject: [openib-general] IP Addressing Annex v4. Message-ID: Major changes are: definition and use of "downward compatible", ARI suggested value. Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ip_address_annex_v4.pdf Type: application/octet-stream Size: 69494 bytes Desc: ip_address_annex_v4.pdf URL: From halr at voltaire.com Wed Feb 1 06:10:48 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Feb 2006 09:10:48 -0500 Subject: [openib-general] Re: [PATCH] Opensm - asserts before OSM_LOG_ENTER In-Reply-To: <5zu0bj1agz.fsf@mtl066.yok.mtl.com> References: <5zu0bj1agz.fsf@mtl066.yok.mtl.com> Message-ID: <1138802923.4453.30336.camel@hal.voltaire.com> Hi Yael, On Wed, 2006-02-01 at 08:21, Yael Kalka wrote: > Hi Hal, > > When trying to compile the windows stack with some late updates, I've > encountered an issue with the addition/change of place of asserts to > before the OSM_LOG_ENTER. Since OSM_LOG_ENTER declares a variable, > then these asserts cause failure due to declaration in the middle of > the function. The asserts are on a passed in pointer rather than the static variable created by the MACRO based on the second parameter to OSM_LOG_ENTER. I don't understand how this causes a problem. Is it Windows only ? > These asserts are all on the reciever object or the manager object, so > I don't think they are really necessary. They compile out when not using debug. I saw these trip at SC05. -- Hal > The Following patch removes these asserts. > > Thanks, > Yael > > Signed-off-by: Yael Kalka > > Index: opensm/osm_pkey_rcv.c > =================================================================== > --- opensm/osm_pkey_rcv.c (revision 5246) > +++ opensm/osm_pkey_rcv.c (working copy) > @@ -71,8 +71,6 @@ void > osm_pkey_rcv_destroy( > IN osm_pkey_rcv_t* const p_rcv ) > { > - CL_ASSERT( p_rcv ); > - > OSM_LOG_ENTER( p_rcv->p_log, osm_pkey_rcv_destroy ); > > OSM_LOG_EXIT( p_rcv->p_log ); > @@ -125,8 +123,6 @@ osm_pkey_rcv_process( > uint8_t port_num; > uint16_t block_num; > > - CL_ASSERT( p_rcv ); > - > OSM_LOG_ENTER( p_rcv->p_log, osm_pkey_rcv_process ); > > CL_ASSERT( p_madw ); > Index: opensm/osm_sm_state_mgr.c > =================================================================== > --- opensm/osm_sm_state_mgr.c (revision 5246) > +++ opensm/osm_sm_state_mgr.c (working copy) > @@ -406,8 +406,6 @@ void > osm_sm_state_mgr_destroy( > IN osm_sm_state_mgr_t * const p_sm_mgr ) > { > - CL_ASSERT( p_sm_mgr ); > - > OSM_LOG_ENTER( p_sm_mgr->p_log, osm_sm_state_mgr_destroy ); > > cl_spinlock_destroy( &p_sm_mgr->state_lock ); > @@ -500,8 +498,6 @@ osm_sm_state_mgr_process( > { > ib_api_status_t status = IB_SUCCESS; > > - CL_ASSERT( p_sm_mgr ); > - > OSM_LOG_ENTER( p_sm_mgr->p_log, osm_sm_state_mgr_process ); > > /* > @@ -760,8 +756,6 @@ osm_sm_state_mgr_check_legality( > { > ib_api_status_t status = IB_SUCCESS; > > - CL_ASSERT( p_sm_mgr ); > - > OSM_LOG_ENTER( p_sm_mgr->p_log, osm_sm_state_mgr_check_legality ); > > /* > Index: opensm/osm_state_mgr.c > =================================================================== > --- opensm/osm_state_mgr.c (revision 5246) > +++ opensm/osm_state_mgr.c (working copy) > @@ -86,8 +86,6 @@ void > osm_state_mgr_destroy( > IN osm_state_mgr_t * const p_mgr ) > { > - CL_ASSERT( p_mgr ); > - > OSM_LOG_ENTER( p_mgr->p_log, osm_state_mgr_destroy ); > > /* destroy the locks */ > @@ -1884,8 +1882,6 @@ osm_state_mgr_process( > ib_api_status_t status; > osm_remote_sm_t *p_remote_sm; > > - CL_ASSERT( p_mgr ); > - > OSM_LOG_ENTER( p_mgr->p_log, osm_state_mgr_process ); > > /* if we are exiting do nothing */ > Index: opensm/osm_sa_guidinfo_record.c > =================================================================== > --- opensm/osm_sa_guidinfo_record.c (revision 5246) > +++ opensm/osm_sa_guidinfo_record.c (working copy) > @@ -433,8 +433,6 @@ osm_gir_rcv_process( > ib_api_status_t status; > osm_physp_t* p_req_physp; > > - CL_ASSERT( p_rcv ); > - > OSM_LOG_ENTER( p_rcv->p_log, osm_gir_rcv_process ); > > CL_ASSERT( p_madw ); > Index: opensm/osm_sa_vlarb_record.c > =================================================================== > --- opensm/osm_sa_vlarb_record.c (revision 5246) > +++ opensm/osm_sa_vlarb_record.c (working copy) > @@ -348,8 +348,6 @@ osm_vlarb_rec_rcv_process( > ib_net64_t comp_mask; > osm_physp_t* p_req_physp; > > - CL_ASSERT( p_rcv ); > - > OSM_LOG_ENTER( p_rcv->p_log, osm_vlarb_rec_rcv_process ); > > /* update the requestor physical port. */ > Index: opensm/osm_sa_lft_record.c > =================================================================== > --- opensm/osm_sa_lft_record.c (revision 5246) > +++ opensm/osm_sa_lft_record.c (working copy) > @@ -329,8 +329,6 @@ osm_lftr_rcv_process( > ib_api_status_t status; > osm_physp_t* p_req_physp; > > - CL_ASSERT( p_rcv ); > - > OSM_LOG_ENTER( p_rcv->p_log, osm_lftr_rcv_process ); > > CL_ASSERT( p_madw ); > Index: opensm/osm_sa_portinfo_record.c > =================================================================== > --- opensm/osm_sa_portinfo_record.c (revision 5246) > +++ opensm/osm_sa_portinfo_record.c (working copy) > @@ -600,8 +600,6 @@ osm_pir_rcv_process( > osm_physp_t* p_req_physp; > boolean_t trusted_req = TRUE; > > - CL_ASSERT( p_rcv ); > - > OSM_LOG_ENTER( p_rcv->p_log, osm_pir_rcv_process ); > > CL_ASSERT( p_madw ); > Index: opensm/osm_req.c > =================================================================== > --- opensm/osm_req.c (revision 5246) > +++ opensm/osm_req.c (working copy) > @@ -131,8 +131,6 @@ osm_req_get( > ib_api_status_t status = IB_SUCCESS; > ib_net64_t tid; > > - CL_ASSERT( p_req ); > - > OSM_LOG_ENTER( p_req->p_log, osm_req_get ); > > CL_ASSERT( p_path ); > @@ -222,8 +220,6 @@ osm_req_set( > ib_api_status_t status = IB_SUCCESS; > ib_net64_t tid; > > - CL_ASSERT( p_req ); > - > OSM_LOG_ENTER( p_req->p_log, osm_req_set ); > > CL_ASSERT( p_path ); > Index: opensm/osm_sa_pkey_record.c > =================================================================== > --- opensm/osm_sa_pkey_record.c (revision 5246) > +++ opensm/osm_sa_pkey_record.c (working copy) > @@ -344,8 +344,6 @@ osm_pkey_rec_rcv_process( > ib_net64_t comp_mask; > osm_physp_t* p_req_physp; > > - CL_ASSERT( p_rcv ); > - > OSM_LOG_ENTER( p_rcv->p_log, osm_pkey_rec_rcv_process ); > > CL_ASSERT( p_madw ); > Index: opensm/osm_lin_fwd_rcv.c > =================================================================== > --- opensm/osm_lin_fwd_rcv.c (revision 5246) > +++ opensm/osm_lin_fwd_rcv.c (working copy) > @@ -75,8 +75,6 @@ void > osm_lft_rcv_destroy( > IN osm_lft_rcv_t* const p_rcv ) > { > - CL_ASSERT( p_rcv ); > - > OSM_LOG_ENTER( p_rcv->p_log, osm_lft_rcv_destroy ); > > OSM_LOG_EXIT( p_rcv->p_log ); > @@ -121,8 +119,6 @@ osm_lft_rcv_process( > ib_net64_t node_guid; > ib_api_status_t status; > > - CL_ASSERT( p_rcv ); > - > OSM_LOG_ENTER( p_rcv->p_log, osm_lft_rcv_process ); > > CL_ASSERT( p_madw ); > Index: opensm/osm_sa_slvl_record.c > =================================================================== > --- opensm/osm_sa_slvl_record.c (revision 5246) > +++ opensm/osm_sa_slvl_record.c (working copy) > @@ -324,8 +324,6 @@ osm_slvl_rec_rcv_process( > ib_net64_t comp_mask; > osm_physp_t* p_req_physp; > > - CL_ASSERT( p_rcv ); > - > OSM_LOG_ENTER( p_rcv->p_log, osm_slvl_rec_rcv_process ); > > CL_ASSERT( p_madw ); > Index: opensm/osm_sminfo_rcv.c > =================================================================== > --- opensm/osm_sminfo_rcv.c (revision 5246) > +++ opensm/osm_sminfo_rcv.c (working copy) > @@ -80,8 +80,6 @@ void > osm_sminfo_rcv_destroy( > IN osm_sminfo_rcv_t* const p_rcv ) > { > - CL_ASSERT( p_rcv ); > - > OSM_LOG_ENTER( p_rcv->p_log, osm_sminfo_rcv_destroy ); > > OSM_LOG_EXIT( p_rcv->p_log ); > Index: opensm/osm_node_info_rcv.c > =================================================================== > --- opensm/osm_node_info_rcv.c (revision 5246) > +++ opensm/osm_node_info_rcv.c (working copy) > @@ -981,8 +981,6 @@ void > osm_ni_rcv_destroy( > IN osm_ni_rcv_t* const p_rcv ) > { > - CL_ASSERT( p_rcv ); > - > OSM_LOG_ENTER( p_rcv->p_log, osm_ni_rcv_destroy ); > > OSM_LOG_EXIT( p_rcv->p_log ); > @@ -1028,8 +1026,6 @@ osm_ni_rcv_process( > osm_node_t *p_node; > boolean_t process_new_flag = FALSE; > > - CL_ASSERT( p_rcv ); > - > OSM_LOG_ENTER( p_rcv->p_log, osm_ni_rcv_process ); > > CL_ASSERT( p_madw ); > Index: opensm/osm_mcast_mgr.c > =================================================================== > --- opensm/osm_mcast_mgr.c (revision 5246) > +++ opensm/osm_mcast_mgr.c (working copy) > @@ -394,8 +394,6 @@ void > osm_mcast_mgr_destroy( > IN osm_mcast_mgr_t* const p_mgr ) > { > - CL_ASSERT( p_mgr ); > - > OSM_LOG_ENTER( p_mgr->p_log, osm_mcast_mgr_destroy ); > > OSM_LOG_EXIT( p_mgr->p_log ); > @@ -449,8 +447,6 @@ __osm_mcast_mgr_set_tbl( > ib_net16_t block[IB_MCAST_BLOCK_SIZE]; > osm_signal_t signal = OSM_SIGNAL_DONE; > > - CL_ASSERT( p_mgr ); > - > OSM_LOG_ENTER( p_mgr->p_log, __osm_mcast_mgr_set_tbl ); > > CL_ASSERT( p_sw ); > Index: opensm/osm_sa_sminfo_record.c > =================================================================== > --- opensm/osm_sa_sminfo_record.c (revision 5246) > +++ opensm/osm_sa_sminfo_record.c (working copy) > @@ -89,8 +89,6 @@ void > osm_smir_rcv_destroy( > IN osm_smir_rcv_t* const p_rcv ) > { > - CL_ASSERT( p_rcv ); > - > OSM_LOG_ENTER( p_rcv->p_log, osm_smir_rcv_destroy ); > > OSM_LOG_EXIT( p_rcv->p_log ); > @@ -142,8 +140,6 @@ osm_smir_rcv_process( > ib_net64_t local_guid; > osm_port_t* local_port; > > - CL_ASSERT( p_rcv ); > - > OSM_LOG_ENTER( p_rcv->p_log, osm_smir_rcv_process ); > > CL_ASSERT( p_madw ); > Index: opensm/osm_trap_rcv.c > =================================================================== > --- opensm/osm_trap_rcv.c (revision 5246) > +++ opensm/osm_trap_rcv.c (working copy) > @@ -189,8 +189,6 @@ void > osm_trap_rcv_destroy( > IN osm_trap_rcv_t* const p_rcv ) > { > - CL_ASSERT( p_rcv ); > - > OSM_LOG_ENTER( p_rcv->p_log, osm_trap_rcv_destroy ); > > cl_event_wheel_destroy( &p_rcv->trap_aging_tracker ); > Index: opensm/osm_ucast_mgr.c > =================================================================== > --- opensm/osm_ucast_mgr.c (revision 5246) > +++ opensm/osm_ucast_mgr.c (working copy) > @@ -90,8 +90,6 @@ void > osm_ucast_mgr_destroy( > IN osm_ucast_mgr_t* const p_mgr ) > { > - CL_ASSERT( p_mgr ); > - > OSM_LOG_ENTER( p_mgr->p_log, osm_ucast_mgr_destroy ); > > OSM_LOG_EXIT( p_mgr->p_log ); > @@ -785,8 +783,6 @@ __osm_ucast_mgr_set_table( > uint32_t block_id_ho = 0; > uint8_t block[IB_SMP_DATA_SIZE]; > > - CL_ASSERT( p_mgr ); > - > OSM_LOG_ENTER( p_mgr->p_log, __osm_ucast_mgr_set_table ); > > CL_ASSERT( p_sw ); > Index: opensm/osm_sa_node_record.c > =================================================================== > --- opensm/osm_sa_node_record.c (revision 5246) > +++ opensm/osm_sa_node_record.c (working copy) > @@ -435,8 +435,6 @@ osm_nr_rcv_process( > ib_api_status_t status; > osm_physp_t* p_req_physp; > > - CL_ASSERT( p_rcv ); > - > OSM_LOG_ENTER( p_rcv->p_log, osm_nr_rcv_process ); > > CL_ASSERT( p_madw ); > Index: opensm/osm_sw_info_rcv.c > =================================================================== > --- opensm/osm_sw_info_rcv.c (revision 5246) > +++ opensm/osm_sw_info_rcv.c (working copy) > @@ -363,8 +363,6 @@ __osm_si_rcv_process_new( > ib_smp_t *p_smp; > cl_qmap_t *p_sw_guid_tbl; > > - CL_ASSERT( p_rcv ); > - > OSM_LOG_ENTER( p_rcv->p_log, __osm_si_rcv_process_new ); > > CL_ASSERT( p_madw ); > @@ -582,8 +580,6 @@ void > osm_si_rcv_destroy( > IN osm_si_rcv_t* const p_rcv ) > { > - CL_ASSERT( p_rcv ); > - > OSM_LOG_ENTER( p_rcv->p_log, osm_si_rcv_destroy ); > > OSM_LOG_EXIT( p_rcv->p_log ); > @@ -631,8 +627,6 @@ osm_si_rcv_process( > ib_net64_t node_guid; > osm_si_context_t *p_context; > > - CL_ASSERT( p_rcv ); > - > OSM_LOG_ENTER( p_rcv->p_log, osm_si_rcv_process ); > > CL_ASSERT( p_madw ); > Index: opensm/osm_mcast_fwd_rcv.c > =================================================================== > --- opensm/osm_mcast_fwd_rcv.c (revision 5246) > +++ opensm/osm_mcast_fwd_rcv.c (working copy) > @@ -77,8 +77,6 @@ void > osm_mft_rcv_destroy( > IN osm_mft_rcv_t* const p_rcv ) > { > - CL_ASSERT( p_rcv ); > - > OSM_LOG_ENTER( p_rcv->p_log, osm_mft_rcv_destroy ); > > OSM_LOG_EXIT( p_rcv->p_log ); > @@ -124,8 +122,6 @@ osm_mft_rcv_process( > ib_net64_t node_guid; > ib_api_status_t status; > > - CL_ASSERT( p_rcv ); > - > OSM_LOG_ENTER( p_rcv->p_log, osm_mft_rcv_process ); > > CL_ASSERT( p_madw ); > Index: opensm/osm_slvl_map_rcv.c > =================================================================== > --- opensm/osm_slvl_map_rcv.c (revision 5246) > +++ opensm/osm_slvl_map_rcv.c (working copy) > @@ -83,8 +83,6 @@ void > osm_slvl_rcv_destroy( > IN osm_slvl_rcv_t* const p_rcv ) > { > - CL_ASSERT( p_rcv ); > - > OSM_LOG_ENTER( p_rcv->p_log, osm_slvl_rcv_destroy ); > > OSM_LOG_EXIT( p_rcv->p_log ); > @@ -136,8 +134,6 @@ osm_slvl_rcv_process( > ib_net64_t node_guid; > uint8_t out_port_num, in_port_num; > > - CL_ASSERT( p_rcv ); > - > OSM_LOG_ENTER( p_rcv->p_log, osm_slvl_rcv_process ); > > CL_ASSERT( p_madw ); > Index: opensm/osm_node_desc_rcv.c > =================================================================== > --- opensm/osm_node_desc_rcv.c (revision 5246) > +++ opensm/osm_node_desc_rcv.c (working copy) > @@ -109,8 +109,6 @@ void > osm_nd_rcv_destroy( > IN osm_nd_rcv_t* const p_rcv ) > { > - CL_ASSERT( p_rcv ); > - > OSM_LOG_ENTER( p_rcv->p_log, osm_nd_rcv_destroy ); > > OSM_LOG_EXIT( p_rcv->p_log ); > @@ -152,8 +150,6 @@ osm_nd_rcv_process( > osm_node_t *p_node; > ib_net64_t node_guid; > > - CL_ASSERT( p_rcv ); > - > OSM_LOG_ENTER( p_rcv->p_log, osm_nd_rcv_process ); > > CL_ASSERT( p_madw ); > Index: opensm/osm_sa_mcmember_record.c > =================================================================== > --- opensm/osm_sa_mcmember_record.c (revision 5246) > +++ opensm/osm_sa_mcmember_record.c (working copy) > @@ -109,8 +109,6 @@ void > osm_mcmr_rcv_destroy( > IN osm_mcmr_recv_t* const p_rcv ) > { > - CL_ASSERT( p_rcv ); > - > OSM_LOG_ENTER( p_rcv->p_log, osm_mcmr_rcv_destroy ); > > cl_qlock_pool_destroy( &p_rcv->pool ); > @@ -1967,8 +1965,6 @@ osm_mcmr_query_mgrp(IN osm_mcmr_recv_t* > osm_physp_t* p_req_physp; > boolean_t trusted_req = TRUE; > > - CL_ASSERT( p_rcv ); > - > OSM_LOG_ENTER( p_rcv->p_log, osm_mcmr_query_mgrp ); > > CL_ASSERT( p_madw ); > @@ -2173,8 +2169,6 @@ osm_mcmr_rcv_process( > ib_member_rec_t *p_recvd_mcmember_rec; > boolean_t valid; > > - CL_ASSERT( p_rcv ); > - > OSM_LOG_ENTER( p_rcv->p_log, osm_mcmr_rcv_process ); > > CL_ASSERT( p_madw ); > Index: opensm/osm_drop_mgr.c > =================================================================== > --- opensm/osm_drop_mgr.c (revision 5246) > +++ opensm/osm_drop_mgr.c (working copy) > @@ -81,8 +81,6 @@ void > osm_drop_mgr_destroy( > IN osm_drop_mgr_t* const p_mgr ) > { > - CL_ASSERT( p_mgr ); > - > OSM_LOG_ENTER( p_mgr->p_log, osm_drop_mgr_destroy ); > > OSM_LOG_EXIT( p_mgr->p_log ); > @@ -597,8 +595,6 @@ osm_drop_mgr_process( > uint8_t port_num; > osm_physp_t *p_physp; > > - CL_ASSERT( p_mgr ); > - > OSM_LOG_ENTER( p_mgr->p_log, osm_drop_mgr_process ); > > p_node_guid_tbl = &p_mgr->p_subn->node_guid_tbl; > Index: opensm/osm_lid_mgr.c > =================================================================== > --- opensm/osm_lid_mgr.c (revision 5246) > +++ opensm/osm_lid_mgr.c (working copy) > @@ -1312,8 +1312,6 @@ osm_lid_mgr_process_subnet( > osm_physp_t *p_physp; > int lid_changed; > > - CL_ASSERT( p_mgr ); > - > OSM_LOG_ENTER( p_mgr->p_log, osm_lid_mgr_process_subnet ); > > CL_PLOCK_EXCL_ACQUIRE( p_mgr->p_lock ); > Index: opensm/osm_pkey_mgr.c > =================================================================== > --- opensm/osm_pkey_mgr.c (revision 5246) > +++ opensm/osm_pkey_mgr.c (working copy) > @@ -73,8 +73,6 @@ void > osm_pkey_mgr_destroy( > IN osm_pkey_mgr_t * const p_mgr ) > { > - CL_ASSERT( p_mgr ); > - > OSM_LOG_ENTER( p_mgr->p_log, osm_pkey_mgr_destroy ); > > OSM_LOG_EXIT( p_mgr->p_log ); > @@ -238,8 +236,6 @@ osm_pkey_mgr_process( > osm_physp_t *p_physp; > osm_signal_t result = OSM_SIGNAL_DONE; > > - CL_ASSERT( p_mgr ); > - > OSM_LOG_ENTER( p_mgr->p_log, osm_pkey_mgr_process ); > > p_node_guid_tbl = &p_mgr->p_subn->node_guid_tbl; > Index: opensm/osm_vl_arb_rcv.c > =================================================================== > --- opensm/osm_vl_arb_rcv.c (revision 5246) > +++ opensm/osm_vl_arb_rcv.c (working copy) > @@ -83,8 +83,6 @@ void > osm_vla_rcv_destroy( > IN osm_vla_rcv_t* const p_rcv ) > { > - CL_ASSERT( p_rcv ); > - > OSM_LOG_ENTER( p_rcv->p_log, osm_vla_rcv_destroy ); > > OSM_LOG_EXIT( p_rcv->p_log ); > @@ -136,8 +134,6 @@ osm_vla_rcv_process( > ib_net64_t node_guid; > uint8_t port_num, block_num; > > - CL_ASSERT( p_rcv ); > - > OSM_LOG_ENTER( p_rcv->p_log, osm_vla_rcv_process ); > > CL_ASSERT( p_madw ); > From halr at voltaire.com Wed Feb 1 06:15:09 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Feb 2006 09:15:09 -0500 Subject: [openib-general] Re: [PATCH] Opensm - asserts before OSM_LOG_ENTER In-Reply-To: <20060201134018.GE31887@mellanox.co.il> References: <5zu0bj1agz.fsf@mtl066.yok.mtl.com> <20060201134018.GE31887@mellanox.co.il> Message-ID: <1138803307.15119.63.camel@hal.voltaire.com> Hi Michael, , On Wed, 2006-02-01 at 08:40, Michael S. Tsirkin wrote: > Quoting r. Yael Kalka : > > Subject: [PATCH] Opensm - asserts before OSM_LOG_ENTER > > > > > > Hi Hal, > > > > When trying to compile the windows stack with some late updates, I've > > encountered an issue with the addition/change of place of asserts to > > before the OSM_LOG_ENTER. Since OSM_LOG_ENTER declares a variable, > > then these asserts cause failure due to declaration in the middle of > > the function. > > These asserts are all on the reciever object or the manager object, so > > I don't think they are really necessary. > > > > The Following patch removes these asserts. > > > > Thanks, > > Yael > > > > Signed-off-by: Yael Kalka > > Macros that declare variables are evil. Macros that end with ';' > are evil too. > > --- > > gcc has __func__, and Visual has __FUNCTION__, so ugly > hackery around OSM_LOG_ENTER is not needed anymore. > > Signed-off-by: Michael S. Tsirkin > > Index: openib/src/userspace/management/osm/include/opensm/osm_log.h > =================================================================== > --- openib.orig/src/userspace/management/osm/include/opensm/osm_log.h 2005-10-16 10:49:04.791297000 +0200 > +++ openib/src/userspace/management/osm/include/opensm/osm_log.h 2006-02-01 15:36:38.652180000 +0200 > @@ -71,17 +71,13 @@ BEGIN_C_DECLS > #define LOG_ENTRY_SIZE_MAX 4096 > #define BUF_SIZE LOG_ENTRY_SIZE_MAX > > -#define OSM_LOG_DEFINE_FUNC( NAME ) \ > - static const char osm_log_func_name[] = #NAME > - > #define OSM_LOG_ENTER( OSM_LOG_PTR, NAME ) \ > - OSM_LOG_DEFINE_FUNC( NAME ); \ > osm_log( OSM_LOG_PTR, OSM_LOG_FUNCS, \ > - "%s: [\n", osm_log_func_name ); > + "%s: [\n", __func__ ) > > #define OSM_LOG_EXIT( OSM_LOG_PTR ) \ > osm_log( OSM_LOG_PTR, OSM_LOG_FUNCS, \ > - "%s: ]\n", osm_log_func_name ); > + "%s: ]\n", __func__ ) > > /****h* OpenSM/Log > * NAME This looks fine for Linux to me but since this file is shared by Windows, doesn't it need a conditionalization for those builds ? Also, will __FUNCTION__ work in all the Windows cases ? -- Hal From eitan at mellanox.co.il Wed Feb 1 06:50:12 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 1 Feb 2006 16:50:12 +0200 Subject: [openib-general] RE: [PATCH] Opensm - asserts before OSM_LOG_ENTER Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B68D@mtlexch01.mtl.com> Hi Hal, Please see below: > > When trying to compile the windows stack with some late updates, I've > > encountered an issue with the addition/change of place of asserts to > > before the OSM_LOG_ENTER. Since OSM_LOG_ENTER declares a variable, > > then these asserts cause failure due to declaration in the middle of > > the function. > > The asserts are on a passed in pointer rather than the static variable > created by the MACRO based on the second parameter to OSM_LOG_ENTER. I > don't understand how this causes a problem. Is it Windows only? [EZ] In C you are not allowed to mix variable declarations with statements like "if" (on the same code block). In debug build the CL_ASSERT includes an "if" statement that is later followed by OSM_LOG_ENTER which declares the static variable. It used to fail build in Linux but for some reason it stopped. I guess if we used -pedantic or ANSI it would fail. Anyway, the assert is on the passed down parameter which is passed as NULL. This might happen only on race in the "destroy" flow - but if this is a race it is not guaranteed to catch the bug as the pointer might be free'ed after the assert. It should be caught as a "segfault" (dereferencing NULL pointer) or in valgrind. We have few options: a. Do not use same code tree for WinIB - I do not think we want that. b. Put everything after the CL_ASSERT in an internal code block (i.e. "{") - I do not think we want to do this either. c. Move the CL_ASSERT before the function call (into the function caller). d. Give up these few asserts as this only can happen as a race during resource destruction. I think that in this case it is more important to keep the WinIB and Linux tree identical. > > > These asserts are all on the reciever object or the manager object, so > > I don't think they are really necessary. > > They compile out when not using debug. I saw these trip at SC05. [EZ] As explained - yes they can trip - but only if we have memory pollution (that could be caught by valgrind) or during exit - when and they really a race and might not be caught by the assert. > > -- Hal > > > The Following patch removes these asserts. > > > > Thanks, > > Yael > > > > Signed-off-by: Yael Kalka > > > > Index: opensm/osm_pkey_rcv.c > > =================================================================== > > --- opensm/osm_pkey_rcv.c (revision 5246) > > +++ opensm/osm_pkey_rcv.c (working copy) > > @@ -71,8 +71,6 @@ void > > osm_pkey_rcv_destroy( > > IN osm_pkey_rcv_t* const p_rcv ) > > { > > - CL_ASSERT( p_rcv ); > > - > > OSM_LOG_ENTER( p_rcv->p_log, osm_pkey_rcv_destroy ); > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > @@ -125,8 +123,6 @@ osm_pkey_rcv_process( > > uint8_t port_num; > > uint16_t block_num; > > > > - CL_ASSERT( p_rcv ); > > - > > OSM_LOG_ENTER( p_rcv->p_log, osm_pkey_rcv_process ); > > > > CL_ASSERT( p_madw ); > > Index: opensm/osm_sm_state_mgr.c > > =================================================================== > > --- opensm/osm_sm_state_mgr.c (revision 5246) > > +++ opensm/osm_sm_state_mgr.c (working copy) > > @@ -406,8 +406,6 @@ void > > osm_sm_state_mgr_destroy( > > IN osm_sm_state_mgr_t * const p_sm_mgr ) > > { > > - CL_ASSERT( p_sm_mgr ); > > - > > OSM_LOG_ENTER( p_sm_mgr->p_log, osm_sm_state_mgr_destroy ); > > > > cl_spinlock_destroy( &p_sm_mgr->state_lock ); > > @@ -500,8 +498,6 @@ osm_sm_state_mgr_process( > > { > > ib_api_status_t status = IB_SUCCESS; > > > > - CL_ASSERT( p_sm_mgr ); > > - > > OSM_LOG_ENTER( p_sm_mgr->p_log, osm_sm_state_mgr_process ); > > > > /* > > @@ -760,8 +756,6 @@ osm_sm_state_mgr_check_legality( > > { > > ib_api_status_t status = IB_SUCCESS; > > > > - CL_ASSERT( p_sm_mgr ); > > - > > OSM_LOG_ENTER( p_sm_mgr->p_log, osm_sm_state_mgr_check_legality ); > > > > /* > > Index: opensm/osm_state_mgr.c > > =================================================================== > > --- opensm/osm_state_mgr.c (revision 5246) > > +++ opensm/osm_state_mgr.c (working copy) > > @@ -86,8 +86,6 @@ void > > osm_state_mgr_destroy( > > IN osm_state_mgr_t * const p_mgr ) > > { > > - CL_ASSERT( p_mgr ); > > - > > OSM_LOG_ENTER( p_mgr->p_log, osm_state_mgr_destroy ); > > > > /* destroy the locks */ > > @@ -1884,8 +1882,6 @@ osm_state_mgr_process( > > ib_api_status_t status; > > osm_remote_sm_t *p_remote_sm; > > > > - CL_ASSERT( p_mgr ); > > - > > OSM_LOG_ENTER( p_mgr->p_log, osm_state_mgr_process ); > > > > /* if we are exiting do nothing */ > > Index: opensm/osm_sa_guidinfo_record.c > > =================================================================== > > --- opensm/osm_sa_guidinfo_record.c (revision 5246) > > +++ opensm/osm_sa_guidinfo_record.c (working copy) > > @@ -433,8 +433,6 @@ osm_gir_rcv_process( > > ib_api_status_t status; > > osm_physp_t* p_req_physp; > > > > - CL_ASSERT( p_rcv ); > > - > > OSM_LOG_ENTER( p_rcv->p_log, osm_gir_rcv_process ); > > > > CL_ASSERT( p_madw ); > > Index: opensm/osm_sa_vlarb_record.c > > =================================================================== > > --- opensm/osm_sa_vlarb_record.c (revision 5246) > > +++ opensm/osm_sa_vlarb_record.c (working copy) > > @@ -348,8 +348,6 @@ osm_vlarb_rec_rcv_process( > > ib_net64_t comp_mask; > > osm_physp_t* p_req_physp; > > > > - CL_ASSERT( p_rcv ); > > - > > OSM_LOG_ENTER( p_rcv->p_log, osm_vlarb_rec_rcv_process ); > > > > /* update the requestor physical port. */ > > Index: opensm/osm_sa_lft_record.c > > =================================================================== > > --- opensm/osm_sa_lft_record.c (revision 5246) > > +++ opensm/osm_sa_lft_record.c (working copy) > > @@ -329,8 +329,6 @@ osm_lftr_rcv_process( > > ib_api_status_t status; > > osm_physp_t* p_req_physp; > > > > - CL_ASSERT( p_rcv ); > > - > > OSM_LOG_ENTER( p_rcv->p_log, osm_lftr_rcv_process ); > > > > CL_ASSERT( p_madw ); > > Index: opensm/osm_sa_portinfo_record.c > > =================================================================== > > --- opensm/osm_sa_portinfo_record.c (revision 5246) > > +++ opensm/osm_sa_portinfo_record.c (working copy) > > @@ -600,8 +600,6 @@ osm_pir_rcv_process( > > osm_physp_t* p_req_physp; > > boolean_t trusted_req = TRUE; > > > > - CL_ASSERT( p_rcv ); > > - > > OSM_LOG_ENTER( p_rcv->p_log, osm_pir_rcv_process ); > > > > CL_ASSERT( p_madw ); > > Index: opensm/osm_req.c > > =================================================================== > > --- opensm/osm_req.c (revision 5246) > > +++ opensm/osm_req.c (working copy) > > @@ -131,8 +131,6 @@ osm_req_get( > > ib_api_status_t status = IB_SUCCESS; > > ib_net64_t tid; > > > > - CL_ASSERT( p_req ); > > - > > OSM_LOG_ENTER( p_req->p_log, osm_req_get ); > > > > CL_ASSERT( p_path ); > > @@ -222,8 +220,6 @@ osm_req_set( > > ib_api_status_t status = IB_SUCCESS; > > ib_net64_t tid; > > > > - CL_ASSERT( p_req ); > > - > > OSM_LOG_ENTER( p_req->p_log, osm_req_set ); > > > > CL_ASSERT( p_path ); > > Index: opensm/osm_sa_pkey_record.c > > =================================================================== > > --- opensm/osm_sa_pkey_record.c (revision 5246) > > +++ opensm/osm_sa_pkey_record.c (working copy) > > @@ -344,8 +344,6 @@ osm_pkey_rec_rcv_process( > > ib_net64_t comp_mask; > > osm_physp_t* p_req_physp; > > > > - CL_ASSERT( p_rcv ); > > - > > OSM_LOG_ENTER( p_rcv->p_log, osm_pkey_rec_rcv_process ); > > > > CL_ASSERT( p_madw ); > > Index: opensm/osm_lin_fwd_rcv.c > > =================================================================== > > --- opensm/osm_lin_fwd_rcv.c (revision 5246) > > +++ opensm/osm_lin_fwd_rcv.c (working copy) > > @@ -75,8 +75,6 @@ void > > osm_lft_rcv_destroy( > > IN osm_lft_rcv_t* const p_rcv ) > > { > > - CL_ASSERT( p_rcv ); > > - > > OSM_LOG_ENTER( p_rcv->p_log, osm_lft_rcv_destroy ); > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > @@ -121,8 +119,6 @@ osm_lft_rcv_process( > > ib_net64_t node_guid; > > ib_api_status_t status; > > > > - CL_ASSERT( p_rcv ); > > - > > OSM_LOG_ENTER( p_rcv->p_log, osm_lft_rcv_process ); > > > > CL_ASSERT( p_madw ); > > Index: opensm/osm_sa_slvl_record.c > > =================================================================== > > --- opensm/osm_sa_slvl_record.c (revision 5246) > > +++ opensm/osm_sa_slvl_record.c (working copy) > > @@ -324,8 +324,6 @@ osm_slvl_rec_rcv_process( > > ib_net64_t comp_mask; > > osm_physp_t* p_req_physp; > > > > - CL_ASSERT( p_rcv ); > > - > > OSM_LOG_ENTER( p_rcv->p_log, osm_slvl_rec_rcv_process ); > > > > CL_ASSERT( p_madw ); > > Index: opensm/osm_sminfo_rcv.c > > =================================================================== > > --- opensm/osm_sminfo_rcv.c (revision 5246) > > +++ opensm/osm_sminfo_rcv.c (working copy) > > @@ -80,8 +80,6 @@ void > > osm_sminfo_rcv_destroy( > > IN osm_sminfo_rcv_t* const p_rcv ) > > { > > - CL_ASSERT( p_rcv ); > > - > > OSM_LOG_ENTER( p_rcv->p_log, osm_sminfo_rcv_destroy ); > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > Index: opensm/osm_node_info_rcv.c > > =================================================================== > > --- opensm/osm_node_info_rcv.c (revision 5246) > > +++ opensm/osm_node_info_rcv.c (working copy) > > @@ -981,8 +981,6 @@ void > > osm_ni_rcv_destroy( > > IN osm_ni_rcv_t* const p_rcv ) > > { > > - CL_ASSERT( p_rcv ); > > - > > OSM_LOG_ENTER( p_rcv->p_log, osm_ni_rcv_destroy ); > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > @@ -1028,8 +1026,6 @@ osm_ni_rcv_process( > > osm_node_t *p_node; > > boolean_t process_new_flag = FALSE; > > > > - CL_ASSERT( p_rcv ); > > - > > OSM_LOG_ENTER( p_rcv->p_log, osm_ni_rcv_process ); > > > > CL_ASSERT( p_madw ); > > Index: opensm/osm_mcast_mgr.c > > =================================================================== > > --- opensm/osm_mcast_mgr.c (revision 5246) > > +++ opensm/osm_mcast_mgr.c (working copy) > > @@ -394,8 +394,6 @@ void > > osm_mcast_mgr_destroy( > > IN osm_mcast_mgr_t* const p_mgr ) > > { > > - CL_ASSERT( p_mgr ); > > - > > OSM_LOG_ENTER( p_mgr->p_log, osm_mcast_mgr_destroy ); > > > > OSM_LOG_EXIT( p_mgr->p_log ); > > @@ -449,8 +447,6 @@ __osm_mcast_mgr_set_tbl( > > ib_net16_t block[IB_MCAST_BLOCK_SIZE]; > > osm_signal_t signal = OSM_SIGNAL_DONE; > > > > - CL_ASSERT( p_mgr ); > > - > > OSM_LOG_ENTER( p_mgr->p_log, __osm_mcast_mgr_set_tbl ); > > > > CL_ASSERT( p_sw ); > > Index: opensm/osm_sa_sminfo_record.c > > =================================================================== > > --- opensm/osm_sa_sminfo_record.c (revision 5246) > > +++ opensm/osm_sa_sminfo_record.c (working copy) > > @@ -89,8 +89,6 @@ void > > osm_smir_rcv_destroy( > > IN osm_smir_rcv_t* const p_rcv ) > > { > > - CL_ASSERT( p_rcv ); > > - > > OSM_LOG_ENTER( p_rcv->p_log, osm_smir_rcv_destroy ); > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > @@ -142,8 +140,6 @@ osm_smir_rcv_process( > > ib_net64_t local_guid; > > osm_port_t* local_port; > > > > - CL_ASSERT( p_rcv ); > > - > > OSM_LOG_ENTER( p_rcv->p_log, osm_smir_rcv_process ); > > > > CL_ASSERT( p_madw ); > > Index: opensm/osm_trap_rcv.c > > =================================================================== > > --- opensm/osm_trap_rcv.c (revision 5246) > > +++ opensm/osm_trap_rcv.c (working copy) > > @@ -189,8 +189,6 @@ void > > osm_trap_rcv_destroy( > > IN osm_trap_rcv_t* const p_rcv ) > > { > > - CL_ASSERT( p_rcv ); > > - > > OSM_LOG_ENTER( p_rcv->p_log, osm_trap_rcv_destroy ); > > > > cl_event_wheel_destroy( &p_rcv->trap_aging_tracker ); > > Index: opensm/osm_ucast_mgr.c > > =================================================================== > > --- opensm/osm_ucast_mgr.c (revision 5246) > > +++ opensm/osm_ucast_mgr.c (working copy) > > @@ -90,8 +90,6 @@ void > > osm_ucast_mgr_destroy( > > IN osm_ucast_mgr_t* const p_mgr ) > > { > > - CL_ASSERT( p_mgr ); > > - > > OSM_LOG_ENTER( p_mgr->p_log, osm_ucast_mgr_destroy ); > > > > OSM_LOG_EXIT( p_mgr->p_log ); > > @@ -785,8 +783,6 @@ __osm_ucast_mgr_set_table( > > uint32_t block_id_ho = 0; > > uint8_t block[IB_SMP_DATA_SIZE]; > > > > - CL_ASSERT( p_mgr ); > > - > > OSM_LOG_ENTER( p_mgr->p_log, __osm_ucast_mgr_set_table ); > > > > CL_ASSERT( p_sw ); > > Index: opensm/osm_sa_node_record.c > > =================================================================== > > --- opensm/osm_sa_node_record.c (revision 5246) > > +++ opensm/osm_sa_node_record.c (working copy) > > @@ -435,8 +435,6 @@ osm_nr_rcv_process( > > ib_api_status_t status; > > osm_physp_t* p_req_physp; > > > > - CL_ASSERT( p_rcv ); > > - > > OSM_LOG_ENTER( p_rcv->p_log, osm_nr_rcv_process ); > > > > CL_ASSERT( p_madw ); > > Index: opensm/osm_sw_info_rcv.c > > =================================================================== > > --- opensm/osm_sw_info_rcv.c (revision 5246) > > +++ opensm/osm_sw_info_rcv.c (working copy) > > @@ -363,8 +363,6 @@ __osm_si_rcv_process_new( > > ib_smp_t *p_smp; > > cl_qmap_t *p_sw_guid_tbl; > > > > - CL_ASSERT( p_rcv ); > > - > > OSM_LOG_ENTER( p_rcv->p_log, __osm_si_rcv_process_new ); > > > > CL_ASSERT( p_madw ); > > @@ -582,8 +580,6 @@ void > > osm_si_rcv_destroy( > > IN osm_si_rcv_t* const p_rcv ) > > { > > - CL_ASSERT( p_rcv ); > > - > > OSM_LOG_ENTER( p_rcv->p_log, osm_si_rcv_destroy ); > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > @@ -631,8 +627,6 @@ osm_si_rcv_process( > > ib_net64_t node_guid; > > osm_si_context_t *p_context; > > > > - CL_ASSERT( p_rcv ); > > - > > OSM_LOG_ENTER( p_rcv->p_log, osm_si_rcv_process ); > > > > CL_ASSERT( p_madw ); > > Index: opensm/osm_mcast_fwd_rcv.c > > =================================================================== > > --- opensm/osm_mcast_fwd_rcv.c (revision 5246) > > +++ opensm/osm_mcast_fwd_rcv.c (working copy) > > @@ -77,8 +77,6 @@ void > > osm_mft_rcv_destroy( > > IN osm_mft_rcv_t* const p_rcv ) > > { > > - CL_ASSERT( p_rcv ); > > - > > OSM_LOG_ENTER( p_rcv->p_log, osm_mft_rcv_destroy ); > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > @@ -124,8 +122,6 @@ osm_mft_rcv_process( > > ib_net64_t node_guid; > > ib_api_status_t status; > > > > - CL_ASSERT( p_rcv ); > > - > > OSM_LOG_ENTER( p_rcv->p_log, osm_mft_rcv_process ); > > > > CL_ASSERT( p_madw ); > > Index: opensm/osm_slvl_map_rcv.c > > =================================================================== > > --- opensm/osm_slvl_map_rcv.c (revision 5246) > > +++ opensm/osm_slvl_map_rcv.c (working copy) > > @@ -83,8 +83,6 @@ void > > osm_slvl_rcv_destroy( > > IN osm_slvl_rcv_t* const p_rcv ) > > { > > - CL_ASSERT( p_rcv ); > > - > > OSM_LOG_ENTER( p_rcv->p_log, osm_slvl_rcv_destroy ); > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > @@ -136,8 +134,6 @@ osm_slvl_rcv_process( > > ib_net64_t node_guid; > > uint8_t out_port_num, in_port_num; > > > > - CL_ASSERT( p_rcv ); > > - > > OSM_LOG_ENTER( p_rcv->p_log, osm_slvl_rcv_process ); > > > > CL_ASSERT( p_madw ); > > Index: opensm/osm_node_desc_rcv.c > > =================================================================== > > --- opensm/osm_node_desc_rcv.c (revision 5246) > > +++ opensm/osm_node_desc_rcv.c (working copy) > > @@ -109,8 +109,6 @@ void > > osm_nd_rcv_destroy( > > IN osm_nd_rcv_t* const p_rcv ) > > { > > - CL_ASSERT( p_rcv ); > > - > > OSM_LOG_ENTER( p_rcv->p_log, osm_nd_rcv_destroy ); > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > @@ -152,8 +150,6 @@ osm_nd_rcv_process( > > osm_node_t *p_node; > > ib_net64_t node_guid; > > > > - CL_ASSERT( p_rcv ); > > - > > OSM_LOG_ENTER( p_rcv->p_log, osm_nd_rcv_process ); > > > > CL_ASSERT( p_madw ); > > Index: opensm/osm_sa_mcmember_record.c > > =================================================================== > > --- opensm/osm_sa_mcmember_record.c (revision 5246) > > +++ opensm/osm_sa_mcmember_record.c (working copy) > > @@ -109,8 +109,6 @@ void > > osm_mcmr_rcv_destroy( > > IN osm_mcmr_recv_t* const p_rcv ) > > { > > - CL_ASSERT( p_rcv ); > > - > > OSM_LOG_ENTER( p_rcv->p_log, osm_mcmr_rcv_destroy ); > > > > cl_qlock_pool_destroy( &p_rcv->pool ); > > @@ -1967,8 +1965,6 @@ osm_mcmr_query_mgrp(IN osm_mcmr_recv_t* > > osm_physp_t* p_req_physp; > > boolean_t trusted_req = TRUE; > > > > - CL_ASSERT( p_rcv ); > > - > > OSM_LOG_ENTER( p_rcv->p_log, osm_mcmr_query_mgrp ); > > > > CL_ASSERT( p_madw ); > > @@ -2173,8 +2169,6 @@ osm_mcmr_rcv_process( > > ib_member_rec_t *p_recvd_mcmember_rec; > > boolean_t valid; > > > > - CL_ASSERT( p_rcv ); > > - > > OSM_LOG_ENTER( p_rcv->p_log, osm_mcmr_rcv_process ); > > > > CL_ASSERT( p_madw ); > > Index: opensm/osm_drop_mgr.c > > =================================================================== > > --- opensm/osm_drop_mgr.c (revision 5246) > > +++ opensm/osm_drop_mgr.c (working copy) > > @@ -81,8 +81,6 @@ void > > osm_drop_mgr_destroy( > > IN osm_drop_mgr_t* const p_mgr ) > > { > > - CL_ASSERT( p_mgr ); > > - > > OSM_LOG_ENTER( p_mgr->p_log, osm_drop_mgr_destroy ); > > > > OSM_LOG_EXIT( p_mgr->p_log ); > > @@ -597,8 +595,6 @@ osm_drop_mgr_process( > > uint8_t port_num; > > osm_physp_t *p_physp; > > > > - CL_ASSERT( p_mgr ); > > - > > OSM_LOG_ENTER( p_mgr->p_log, osm_drop_mgr_process ); > > > > p_node_guid_tbl = &p_mgr->p_subn->node_guid_tbl; > > Index: opensm/osm_lid_mgr.c > > =================================================================== > > --- opensm/osm_lid_mgr.c (revision 5246) > > +++ opensm/osm_lid_mgr.c (working copy) > > @@ -1312,8 +1312,6 @@ osm_lid_mgr_process_subnet( > > osm_physp_t *p_physp; > > int lid_changed; > > > > - CL_ASSERT( p_mgr ); > > - > > OSM_LOG_ENTER( p_mgr->p_log, osm_lid_mgr_process_subnet ); > > > > CL_PLOCK_EXCL_ACQUIRE( p_mgr->p_lock ); > > Index: opensm/osm_pkey_mgr.c > > =================================================================== > > --- opensm/osm_pkey_mgr.c (revision 5246) > > +++ opensm/osm_pkey_mgr.c (working copy) > > @@ -73,8 +73,6 @@ void > > osm_pkey_mgr_destroy( > > IN osm_pkey_mgr_t * const p_mgr ) > > { > > - CL_ASSERT( p_mgr ); > > - > > OSM_LOG_ENTER( p_mgr->p_log, osm_pkey_mgr_destroy ); > > > > OSM_LOG_EXIT( p_mgr->p_log ); > > @@ -238,8 +236,6 @@ osm_pkey_mgr_process( > > osm_physp_t *p_physp; > > osm_signal_t result = OSM_SIGNAL_DONE; > > > > - CL_ASSERT( p_mgr ); > > - > > OSM_LOG_ENTER( p_mgr->p_log, osm_pkey_mgr_process ); > > > > p_node_guid_tbl = &p_mgr->p_subn->node_guid_tbl; > > Index: opensm/osm_vl_arb_rcv.c > > =================================================================== > > --- opensm/osm_vl_arb_rcv.c (revision 5246) > > +++ opensm/osm_vl_arb_rcv.c (working copy) > > @@ -83,8 +83,6 @@ void > > osm_vla_rcv_destroy( > > IN osm_vla_rcv_t* const p_rcv ) > > { > > - CL_ASSERT( p_rcv ); > > - > > OSM_LOG_ENTER( p_rcv->p_log, osm_vla_rcv_destroy ); > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > @@ -136,8 +134,6 @@ osm_vla_rcv_process( > > ib_net64_t node_guid; > > uint8_t port_num, block_num; > > > > - CL_ASSERT( p_rcv ); > > - > > OSM_LOG_ENTER( p_rcv->p_log, osm_vla_rcv_process ); > > > > CL_ASSERT( p_madw ); > > From Arkady.Kanevsky at netapp.com Wed Feb 1 06:44:49 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Wed, 1 Feb 2006 09:44:49 -0500 Subject: [openib-general] RE: [RFC] DAT 2.0 immediate data proposal Message-ID: comments on Arlin and Caitlin's emails inline. Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Caitlin Bestler [mailto:caitlinb at broadcom.com] > Sent: Monday, January 30, 2006 7:16 PM > To: Arlin Davis; Kanevsky, Arkady > Cc: Lentini, James; dat-discussions at yahoogroups.com; > openib-general at openib.org > Subject: RE: [openib-general] RE: [RFC] DAT 2.0 immediate > data proposal > > Arlin Davis wrote: > > Kanevsky, Arkady wrote: > > > >> Arlin, > >> I am not convinced we need a new recv for immediate data. > >> But what is needed is change in normative text in many places. > >> Recv, RDMA Write, DTO completion events, error behavior. > >> Sure you can define immed data in extension but it still effects > >> behavior of the normative part of the spec. > >> > >> > > How does it effect the normative part of the spec outside > of the DTO > > event extension? The post_recv behaves exactly the same. We will need a paragraph that size of the recv buffer shall accommodate immediate data if the recv may be matched with rdma_write_immed. There we can reference Provider attribute for how immed data is returned. Then in Advice to Consumer state how to generate transport independent recv and how it can be optimized based on Provider attr. > > > >> This is why my preference is to put it into the main spec. > >> > >> > > ok, with no new recv_immed call we do get a little closer. > > > >> The xfer_size is minor thing. We just need to define it > meaning with > >> respect to immed_data. Defining it either way is fine. > >> > >> Handling extra space on CQ can be handled by Provider. > >> We can add a new EVD attribute for the use for handling RDMA_write > >> with immed data and Provider can automatically add extra > space on CQ. > >> Provider is already responsible to handing user a single > completion. > >> SO it will only be used for error handling. > >> > >> > > sounds good. > > > >> Error handling takes maost of the new write up anyhow. > >> Regardless where it is done in the spec or in extension. > >> > >> Question on do we want to support Send with immed_data have to be > >> decided. Ditto remote RMR invalidation with new post(s) for > >> immed_data. > >> Just because IB supports all possible correlation under > one Send post > >> does not mean that uDAPL should follow that too. > >> > >> > > I would agree, strike them all except rdma_write_immed. The only one which need to be discussed it Remote invalidate with rdma_write_immed and Local invalidate with rdma_write_immed. > > > > Can you give some idea how you would write up the normative > text for > > the transport independent receive that would accept immediate data? > > > > thanks, > > > > -arlin > > The data source: > posts an rdma write with immediate DTO, supplying > the RDMA Write data source and an immediate value. > > This is translated into one work request (if the > device supports write with immediate), or into > a RDMA Write followed by a RDMA Send (if it does > not). This should be Model Implication section. > > While successful completion of the RDMA Write will > be suppressed, the Consumer must still allow for the > extra space on the SendQ and the CQ. An IA attribute > will document how many work requests a write_with_immediate > will translate into. This belongs to Model implication also and in Usage section. > > The data sink: > post a recv (to EP or SRQ) with a four byte buffer. > > When it reaps the completion it needs to be ready > to see the data either in an "immediate" field in > the work completion, or in the buffer originally > specified in the recv DTO. > This is in the Usage section. > > A Provider MAY indicate that it supports immediate receives, > but on iWARP or any transport where this is not the default > optimized receive processing MUST be enabled by the user. > Otherwise, RFC compliance would require that a four byte > untagged message matched to a zero byte buffer was an error. > Essentially the user is posting a receive operation that > names the four bytes in the Work completion as the buffer. > Ditto. Also it should reference Provider Attribute and not transport. > From mst at mellanox.co.il Wed Feb 1 06:45:50 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 1 Feb 2006 16:45:50 +0200 Subject: [openib-general] Re: [PATCH] Opensm - asserts before OSM_LOG_ENTER In-Reply-To: <1138803307.15119.63.camel@hal.voltaire.com> References: <1138803307.15119.63.camel@hal.voltaire.com> Message-ID: <20060201144550.GI31887@mellanox.co.il> Quoting r. Hal Rosenstock : > This looks fine for Linux to me but since this file is shared by > Windows, doesn't it need a conditionalization for those builds ? The simplest way I guess is to #define __func__ __FUNCTION__ Or move it to complib? Its up to you. > Also, will __FUNCTION__ work in all the Windows cases ? The latest DDK has it, as AFAIK so does VC7 and up. Do you care about VC6? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From ogerlitz at voltaire.com Wed Feb 1 06:52:28 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 1 Feb 2006 16:52:28 +0200 (IST) Subject: [openib-general] [PATCH] iser: always call rdma_disconnect/destroy_qp from sleepable context Message-ID: make sure that rdma_disconnect & rdma_destroy_qp etc are always called from sleepable context Signed-off-by: Or Gerlitz Index: ulp/iser/iser_conn.c =================================================================== --- ulp/iser/iser_conn.c (revision 5235) +++ ulp/iser/iser_conn.c (revision 5251) @@ -279,7 +279,7 @@ int iser_conn_sync_terminate(struct iscs case ISER_CONN_UP: case ISER_CONN_PENDING: atomic_set(&p_iser_conn->ib_conn->state, ISER_CONN_SYNC_TERM); - err = iser_disconnect(p_iser_conn->ib_conn); + err = rdma_disconnect(p_iser_conn->ib_conn->cma_id); if (err) iser_bug("Failed to disc.gracefully, conn: 0x%p\n", p_iser_conn); @@ -309,22 +309,24 @@ int iser_conn_sync_terminate(struct iscs } /** - * iser_conn_async_terminate - Triggers start of the disconn procedures - */ +* iser_conn_async_terminate - Triggers start of the disconn procedures +*/ int iser_conn_async_terminate(struct iser_conn *p_iser_conn) { - int err = 0; + int err; + /* if the state is UP it means that the conn is being async terminated * + * as of the iSCSI layer. We need to initiate a disconnection after * + * which we will notify the iSCSI layer */ if (atomic_read(&p_iser_conn->state) == ISER_CONN_UP) { + iser_err("Conn. 0x%p is being terminated asynchronously\n", p_iser_conn); atomic_set(&p_iser_conn->state, ISER_CONN_ASYNC_TERM); - iser_dbg("Conn.0x%p is being terminated asynchronously\n", - p_iser_conn); - err = iser_disconnect(p_iser_conn); + err = rdma_disconnect(p_iser_conn->cma_id); } else { - iser_err("called when in state %s\n", + iser_err("called when in state %s\n", iser_conn_get_state_name(p_iser_conn)); - err = -EPERM; - } + err = -EPERM; + } return err; } Index: ulp/iser/iscsi_iser.h =================================================================== --- ulp/iser/iscsi_iser.h (revision 5235) +++ ulp/iser/iscsi_iser.h (revision 5251) @@ -265,6 +265,7 @@ struct iscsi_iser_conn atomic_t post_send_buf_count; wait_queue_head_t disconnect_wait_q; /* sync conn term uses this */ + struct work_struct comperror_work; /* sleepable contex for conn term */ int c_stage; /* connection state */ int stop_stage; /* conn_stop() flag: */ @@ -486,6 +487,7 @@ int iser_conn_async_terminate(struct ise int iser_conn_sync_terminate(struct iscsi_iser_conn *p_iser_conn); + int iser_complete_conn_termination(struct iscsi_iser_conn *p_iser_conn); void iser_adaptor_add_conn(struct iser_adaptor *p_iser_adaptor, @@ -506,8 +508,6 @@ int iser_dto_add_regd_buff(struct iser_d void iser_dto_free(struct iser_dto *p_dto); -int iser_dto_completion_error(struct iser_desc *p_desc); - void iser_dto_send_create(struct iscsi_iser_conn *p_iser_conn, struct iser_desc *tx_desc); @@ -636,4 +636,5 @@ void iser_unreg_mem(struct iser_mem_reg int iser_post_recv(struct iser_desc *p_rx_desc); int iser_start_send(struct iser_desc *p_tx_desc); +void iser_comp_error_worker(void *data); #endif Index: ulp/iser/iser_verbs.c =================================================================== --- ulp/iser/iser_verbs.c (revision 5235) +++ ulp/iser/iser_verbs.c (revision 5251) @@ -318,8 +318,6 @@ static void iser_route_handler(struct rd conn_param.retry_count = 7; conn_param.rnr_retry_count = 6; - /* #warning "the target must prefix iSER SID with OPENIB OUI prefix" */ - ret = rdma_connect(cma_id, &conn_param); if (ret) { iser_err("failure connecting: %d\n", ret); @@ -443,23 +441,6 @@ iser_connect(struct iser_conn *p_iser_co } /** - * iser_disconnect - disconnects from the target - * - * returns 0 on success, -1 on failure - */ -int iser_disconnect(struct iser_conn *p_iser_conn) -{ - int ret; - - ret = rdma_disconnect(p_iser_conn->cma_id); - if (ret) { - iser_err("rdma_disconnet failed: %d\n", ret); - return -1; - } - return 0; -} - -/** * iser_reg_phys_mem - Register physical memory * * returns: 0 on success, -1 on failure @@ -641,34 +622,43 @@ int iser_start_send(struct iser_desc *p_ return ret_val; } -static void iser_handle_comp_error(enum ib_wc_status status, - struct iser_desc *p_desc) +void iser_comp_error_worker(void *data) { - int ret_val; - struct iscsi_iser_conn *p_iser_conn = p_desc->dto.p_conn; - + struct iscsi_iser_conn *p_iser_conn = data; + int err; if(p_iser_conn == NULL) iser_bug("NULL p_desc->p_conn \n"); - /* Since the cma doesn't notify us on CONNECTION_EVENT_BROKEN * - * we need to initiate a disconn */ - if (status != IB_WC_WR_FLUSH_ERR) { - iser_dbg("Calling iser_conn_async_terminate\n"); - ret_val = iser_conn_async_terminate(p_iser_conn->ib_conn); - if (ret_val != 0) - iser_err("Failed to async term conn:0x%p\n", - p_iser_conn); - } - /* If this event is unsolicited this means that the conn is * - * being async terminated from the iSCSI layer's perspective. */ - if (atomic_read(&p_iser_conn->state) == ISER_CONN_UP) { - atomic_set(&p_iser_conn->state, ISER_CONN_ASYNC_TERM); - iser_dbg("Conn. 0x%p is being terminated asynchronously\n", p_iser_conn); - } - /* Handle completion Error */ - ret_val = iser_dto_completion_error(p_desc); - if (ret_val && ret_val != -EAGAIN) - iser_err("Failed to handle ERROR DTO completion\n"); + if (atomic_read(&p_iser_conn->state) == ISER_CONN_UP) + err = iser_conn_async_terminate(p_iser_conn->ib_conn); + + err = iser_complete_conn_termination(p_iser_conn); + if (err && err != -EAGAIN) + iser_err("Failed to handle ERROR completion\n"); + + iser_err("iser_complete_conn_termination return %d\n", err); +} + +static void iser_handle_comp_error(struct iser_desc *p_desc) +{ + struct iser_dto *p_dto = &p_desc->dto; + struct iscsi_iser_conn *p_iser_conn = p_dto->p_conn; + + iser_dto_free(p_dto); + + if(p_desc->type == ISCSI_RX) { + kfree(p_desc->data); + kmem_cache_free(ig.desc_cache, p_desc); + atomic_dec(&p_iser_conn->post_recv_buf_count); + } else { /* type is TX control/command/dataout */ + if(p_desc->type == ISCSI_TX_DATAOUT) + kmem_cache_free(ig.desc_cache, p_desc); + atomic_dec(&p_iser_conn->post_send_buf_count); + } + + if( atomic_read(&p_iser_conn->post_recv_buf_count) == 0 && + atomic_read(&p_iser_conn->post_send_buf_count) == 0) + schedule_work(&p_iser_conn->comperror_work); } void iser_cq_tasklet_fn(unsigned long data) @@ -691,9 +681,10 @@ void iser_cq_tasklet_fn(unsigned long da iser_rcv_completion(p_desc, xfer_len); } else /* type == ISCSI_TX_CONTROL/SCSI_CMD/DOUT */ iser_snd_completion(p_desc); - } else - /* #warning "we better do a context jump here" */ - iser_handle_comp_error(wc.status, p_desc); + } else { + iser_err("comp w. error op %d status %d\n",p_desc->type,wc.status); + iser_handle_comp_error(p_desc); + } } /* #warning "it is assumed here that arming CQ only once its empty would not" * "cause interrupts to be missed" */ Index: ulp/iser/iser_dto.c =================================================================== --- ulp/iser/iser_dto.c (revision 5235) +++ ulp/iser/iser_dto.c (revision 5251) @@ -87,40 +87,6 @@ void iser_dto_free(struct iser_dto *p_dt iser_dto_buffs_release(p_dto); } -/** - * iser_dto_completion_error - Handles error in DTO completion - * - * returns 0 on success, -1 on failure - */ -int iser_dto_completion_error(struct iser_desc *p_desc) -{ - struct iscsi_iser_conn *p_iser_conn; - int err; - struct iser_dto *p_dto = &p_desc->dto; - - p_iser_conn = p_dto->p_conn; - if (p_iser_conn == NULL) - iser_bug("NULL conn in p_dto:0x%p\n", p_dto); - - iser_dto_free(p_dto); - - if(p_desc->type == ISCSI_RX) { - kfree(p_desc->data); - kmem_cache_free(ig.desc_cache, p_desc); - atomic_dec(&p_iser_conn->post_recv_buf_count); - } - else { /* type is TX control/command/dataout */ - if(p_desc->type == ISCSI_TX_DATAOUT) - kmem_cache_free(ig.desc_cache, p_desc); - atomic_dec(&p_iser_conn->post_send_buf_count); - } - - err = iser_complete_conn_termination(p_iser_conn); - - return err; -} - -/* iser_dto_get_rx_pdu_data - gets received PDU descriptor & data from rx DTO */ /** * Creates a new send DTO descriptor, Index: ulp/iser/iscsi_iser.c =================================================================== --- ulp/iser/iscsi_iser.c (revision 5235) +++ ulp/iser/iscsi_iser.c (revision 5251) @@ -1086,8 +1086,8 @@ static iscsi_connh_t iscsi_iser_conn_cre atomic_set(&conn->post_recv_buf_count, 0); atomic_set(&conn->post_send_buf_count, 0); init_waitqueue_head(&conn->disconnect_wait_q); - spin_lock_init(&conn->lock); + INIT_WORK(&conn->comperror_work, iser_comp_error_worker, conn); return iscsi_handle(conn); From mst at mellanox.co.il Wed Feb 1 06:47:21 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 1 Feb 2006 16:47:21 +0200 Subject: [openib-general] Re: [PATCH] Opensm - asserts before OSM_LOG_ENTER In-Reply-To: <1138803307.15119.63.camel@hal.voltaire.com> References: <1138803307.15119.63.camel@hal.voltaire.com> Message-ID: <20060201144721.GJ31887@mellanox.co.il> Quoting r. Hal Rosenstock : > > #define OSM_LOG_EXIT( OSM_LOG_PTR ) \ > > osm_log( OSM_LOG_PTR, OSM_LOG_FUNCS, \ > > - "%s: ]\n", osm_log_func_name ); > > + "%s: ]\n", __func__ ) > > > > /****h* OpenSM/Log > > * NAME > > > This looks fine for Linux to me but since this file is shared by > Windows, doesn't it need a conditionalization for those builds ? > > Also, will __FUNCTION__ work in all the Windows cases ? > > -- Hal > BTW functions that include ';' like OSM_LOG_EXIT here must be wrapped with do {} while (0), in any case. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From ogerlitz at voltaire.com Wed Feb 1 06:57:30 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 1 Feb 2006 16:57:30 +0200 (IST) Subject: [openib-general] on calling rdma_disconnect from non sleepable context Message-ID: Sean, I have now instrumented iser code to always call rdma_disconnect / rdma_destory_qp etc from sleepable context. Before doing so i was getting this oops few times. My interpertation was that cma_modify_qp_err is not supposted to get called when in_atomic is true, am i correct? Or. Debug: sleeping function called from invalid context at mm/slab.c:2459 in_atomic():1, irqs_disabled():0 Call Trace: {__might_sleep+199} {kmem_cache_alloc+34} {:ib_mthca:mthca_alloc_mailbox+47} {:ib_mthca:mthca_modify_qp+850} {try_to_wake_up+1083} {:ib_core:ib_modify_qp+13} {:rdma_cm:cma_modify_qp_err+38} {:ib_mthca:mthca_free_err_wqe+44} {:ib_mthca:mthca_poll_cq+1697} {:rdma_cm:cma_comp+63} {:rdma_cm:rdma_disconnect+44} {:ib_iser:iser_disconnect+13} {:ib_iser:iser_conn_async_terminate+59} {:ib_iser:iser_cq_tasklet_fn+206} {tasklet_action+105} {__do_softirq+107} {call_softirq+31} {do_softirq+49} {do_IRQ+52} {ret_from_intr+0} {printk+141} {acpi_processor_idle+314} {cpu_idle+79} {start_secondary+996} scheduling while atomic: swapper/0x00000100/0 Call Trace: {schedule+125} {release_console_sem+393} {vprintk+782} {start_secondary+996} {:ib_iser:iser_cq_tasklet_fn+206} {:ib_iser:iser_cq_tasklet_fn+206} {wait_for_completion+179} {default_wake_function+0} {_spin_lock_irqsave+14} {default_wake_function+0} {__mod_timer+206} {:ib_mthca:mthca_cmd_wait+399} {:ib_mthca:mthca_cmd_box+69} {dump_stack+12} {:ib_mthca:mthca_cmd+40} {:ib_mthca:mthca_MODIFY_QP+266} {:ib_mthca:mthca_modify_qp+2421} {try_to_wake_up+1083} {:ib_core:ib_modify_qp+13} {:rdma_cm:cma_modify_qp_err+38} {:ib_mthca:mthca_free_err_wqe+44} {:ib_mthca:mthca_poll_cq+1697} {:rdma_cm:cma_comp+63} {:rdma_cm:rdma_disconnect+44} {:ib_iser:iser_disconnect+13} {:ib_iser:iser_conn_async_terminate+59} {:ib_iser:iser_cq_tasklet_fn+206} {tasklet_action+105} {__do_softirq+107} {call_softirq+31} {do_softirq+49} {do_IRQ+52} {ret_from_intr+0} {printk+141} {acpi_processor_idle+314} {cpu_idle+79} {start_secondary+996} From eitan at mellanox.co.il Wed Feb 1 07:05:13 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 1 Feb 2006 17:05:13 +0200 Subject: [openib-general] Re: [PATCH] Opensm - asserts before OSM_LOG_ENTER Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B68F@mtlexch01.mtl.com> Hi Michael, Hal, For some reason I did not get the last mails on the list. I know we have some mail issue here but can you please put me on the CC line personally until this is resolved? Thanks Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Michael S. Tsirkin > Sent: Wednesday, February 01, 2006 4:46 PM > To: Hal Rosenstock > Cc: openib-windows at openib.org; openib-general at openib.org > Subject: [openib-general] Re: [PATCH] Opensm - asserts before OSM_LOG_ENTER > > Quoting r. Hal Rosenstock : > > This looks fine for Linux to me but since this file is shared by > > Windows, doesn't it need a conditionalization for those builds ? > > The simplest way I guess is to > #define __func__ __FUNCTION__ > > Or move it to complib? Its up to you. > > > Also, will __FUNCTION__ work in all the Windows cases ? > > The latest DDK has it, as AFAIK so does VC7 and up. Do you care about VC6? > > -- > Michael S. Tsirkin > Staff Engineer, Mellanox Technologies > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From ogerlitz at voltaire.com Wed Feb 1 07:29:04 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 01 Feb 2006 17:29:04 +0200 Subject: [openib-general] [PATCH 0/4] SA path record caching In-Reply-To: <43DFA1F5.1050907@ichips.intel.com> References: <43D870AA.9080204@voltaire.com> <43D912E4.3020603@ichips.intel.com> <43DE1A0B.6030606@voltaire.com> <43DE5742.5030601@ichips.intel.com> <43DF0267.4080007@voltaire.com> <43DFA1F5.1050907@ichips.intel.com> Message-ID: <43E0D3C0.3000004@voltaire.com> Sean Hefty wrote: > The cache > is updated using an SA GET_TABLE request, which is more efficient than > sending separate SA GET requests for each path record. > Your assumption is correct. The implementation will contain copies of > all path records whose SGID is a local node GID. (Currently it contains > only a single path record per SGID/DGID, but that will be expanded.) Taking into account an invalidation window of 15 minutes which you have mentioned in one of the emails and doing some math i have come into the following: For 1k node/port fabric the SM/SA need to xmit a table of 1k paths to each local SA, where you can embed 3 paths in a MAD, which would take at least 350 MADs (330 RMPP segments + 20 ACKS). Since we have 1k nodes there are 350K MADs to xmit, and if we assume xmit is uniform over the 1k seconds (1000 second = 16 minutes & 40 seconds invalidation window) we require the -----SM to xmit in constant rate of 350k/1k = 350 MADs/sec forever-----. And this is RMPP, so depending on the RMPP impl it would run into re-transmission of segments or the whole payload. And each such table takes 90K (350*256) RAM so the SM needs to allow for up to 90MB of RAM to hold all those tables. Aren't we creating a monster here??? if this is SA replica which should work for scale from day one, lets call it this way and see how to reach there. > I view MPI as one of the primary reasons for having a cache. Waiting > for a > failed lookup to create the initial cache would delay the startup time > for apps wanting all-to-all connection establishment. In this case, we > also get the side effect that the SA receives GET_TABLE requests from > every node at roughly the same time. Talking MPI, here are few points that seems to me somehow un addressed in the all-to-all cache design: + neither MVAPICH nor OpenMPI are using path query + OpenMPI is opening its connections "per demand" that is only if rank I attempts to send a message to rank J then I connects to J + even MPIs that connect all-to-all in an N ranks JOB would do only n(n-1)/2 path queries, so the load aggregated load on the SA is half what the all-to-all caching scheme is generating Or. From ogerlitz at voltaire.com Wed Feb 1 07:44:30 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 01 Feb 2006 17:44:30 +0200 Subject: [openib-general] [PATCH 4/4] SA path record caching In-Reply-To: References: Message-ID: <43E0D75E.70203@voltaire.com> Sean Hefty wrote: > Modify the CMA to use the local SA database for path record lookups. I recall that one of the design goals were to have the cache being an optional in the sense that the openib stack can provide the same functionality without the cache presence or in other words the user can easily disable the cache and work regularly. Since the CMA is dependent in the symbols exported by the local SA, this is not the case with this implementation. How about either having a mod param to the local SA practically telling it to do nothing (ugly) or the local SA doing symbol_put to the cache query function and the CMA attempting to symbol_get it, if the symbol exists, then the CMA is first trying the cache before querying the SA, or whatever other better idea someone can come up with? Or. From sean.hefty at intel.com Wed Feb 1 09:13:34 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 1 Feb 2006 09:13:34 -0800 Subject: [openib-general] on calling rdma_disconnect from non sleepablecontext In-Reply-To: Message-ID: >I have now instrumented iser code to always call rdma_disconnect / >rdma_destory_qp etc from sleepable context. Before doing so i was >getting >this oops few times. My interpertation was that cma_modify_qp_err is >not supposted to get called when in_atomic is true, am i correct? It looks like this from the backtrace, but this is dependent on the lower level driver. - Sean From caitlinb at broadcom.com Wed Feb 1 09:19:56 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 1 Feb 2006 09:19:56 -0800 Subject: [openib-general] on calling rdma_disconnect from non sleepablecontext Message-ID: <54AD0F12E08D1541B826BE97C98F99F122D3D5@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: >> I have now instrumented iser code to always call rdma_disconnect / >> rdma_destory_qp etc from sleepable context. Before doing so i was >> getting this oops few times. My interpertation was that >> cma_modify_qp_err is not supposted to get called when in_atomic is >> true, am i correct? > > It looks like this from the backtrace, but this is dependent on the > lower level driver. > ULP code such as iSER should not have to read driver-specific code to determine if their use of the verbs is correct. Either use of cma_modify_qp_err is legal while in_atomic or it is not. If it is legal, the fact that a given driver oops is a bug in that driver. If is not legal then the fact that a given driver does not cause an oops does not make it legal. Generally, I do not believe that QP state modifying calls should be legal in restricted contexts. Whenever there is a requirement that this be supported it should be explicitly documented because it creates a special requirement for those implementing drivers. From rdreier at cisco.com Wed Feb 1 09:20:29 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 01 Feb 2006 09:20:29 -0800 Subject: [openib-general] on calling rdma_disconnect from non sleepablecontext In-Reply-To: (Sean Hefty's message of "Wed, 1 Feb 2006 09:13:34 -0800") References: Message-ID: > I have now instrumented iser code to always call rdma_disconnect / > rdma_destory_qp etc from sleepable context. Before doing so i was > getting this oops few times. My interpertation was that > cma_modify_qp_err is not supposted to get called when in_atomic is > true, am i correct? Yes, the modify QP operation might sleep. For example, on Mellanox hardware, modifying a QP requires a firmware command, which allocates a mailbox with GFP_KERNEL and also sleeps until the command completes. I tried to document these sorts of rules in Documentation/infiniband/core_locking.txt - R. From sean.hefty at intel.com Wed Feb 1 09:35:01 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 1 Feb 2006 09:35:01 -0800 Subject: [openib-general] [PATCH 0/4] SA path record caching In-Reply-To: <43E0D3C0.3000004@voltaire.com> Message-ID: >Aren't we creating a monster here??? if this is SA replica which should >work for scale from day one, lets call it this way and see how to reach >there. The cache update window is configurable. What we don't know is how often the SA would be queried to establish connections without a local cache present. Based on information from SilverStorm, the cache should work well in practice. What I think we'd like is a userspace cache hierarchy/distributed SA; however, the time to develop these does not meet any of the Path Forward schedules. Having a mechanism where clients could ask if there have been any updates would also work, but I didn't see a way to do this without modifying the SA node. >+ neither MVAPICH nor OpenMPI are using path query The national labs want all path records for their routing algorithms. I believe that the problems here were API issues that make connecting difficult. As a result, most applications just hard-coded everything. >+ OpenMPI is opening its connections "per demand" that is only if rank >I >attempts to send a message to rank J then I connects to J > >+ even MPIs that connect all-to-all in an N ranks JOB would do only >n(n-1)/2 path queries, so the load aggregated load on the SA is half >what the all-to-all caching scheme is generating It would be better to issue a single query for all path records, and discard those not needed, than issue separate path records queries. This is what the cache does. The difference is 1000 queries, versus 500,000 queries. The total number of MADs generated by the SA is still lower using a single query to return all path records. - Sean From sean.hefty at intel.com Wed Feb 1 09:54:15 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 1 Feb 2006 09:54:15 -0800 Subject: [openib-general] [PATCH 4/4] SA path record caching In-Reply-To: <43E0D75E.70203@voltaire.com> Message-ID: >How about either having a mod param to the local SA practically telling >it to do nothing (ugly) or the local SA doing symbol_put to the cache >query function and the CMA attempting to symbol_get it, if the symbol >exists, then the CMA is first trying the cache before querying the SA, >or whatever other better idea someone can come up with? This sounds possible. Note that the local SA module will be there to support multicast join/leave operations. - Sean From jice at pantasys.com Wed Feb 1 10:30:18 2006 From: jice at pantasys.com (Jean-Christophe Hugly) Date: Wed, 01 Feb 2006 10:30:18 -0800 Subject: [openib-general] Re: libibverbs.la seems to have a non-working default. In-Reply-To: <20060201132448.GD31887@mellanox.co.il> References: <43E0CF9B.3080509@clustervision.com> <20060201132448.GD31887@mellanox.co.il> Message-ID: <1138818618.22996.55.camel@jhugly.pantasys.com> On Wed, 2006-02-01 at 15:24 +0200, Michael S. Tsirkin wrote: > Quoting r. Guido Passet : > > Subject: libibverbs.la seems to have a non-working default. > > > > Hi, > > > > I ran into a problem compiling libibcm and managed to trace it back to > > libibverbs. It seems that libibverbs.la has a static link for > > libsysfs.la to /lib64 on a 64bit host and to /lib on a 32bits host. > > > > However, on my SUSE10 system the libtool archive is located in > > /usr/lib/libsysfs.la.. > > > > dependency_libs=' -L/usr/local/my-favorite-installdir/OpenIB/lib > > /lib64/libsysfs.la -lpthread -ldl' > > > > For now i have a workaround this, but i am not sure if this is either a > > feature or a bug ;) > > > > Best regards, > > -- > > Guido Passet Email: guido.passet at clustervision.com > > Looks like autotools problem at your end: libibverbs.la is a generated file. I have encoutered the same issue (suse10 as well). For now, I also use an ugly work-around (basically a script goes and seds libibverbs.la before compiling the rest). The file may be generated by the autoconf/libtool etc. machinery, but I do not know how much the contens depends on the make files. The devl libsysfs.so and .la are in /usr/lib[64], while the run-time is in /lib[64] that's pretty usual, but there seems to be some assumption somewhere that the .la is next to the run-time lib. I whish I would have a week to spare understanding the autoconf/libtool mess. If someone has already made that cognitive investment, and can figure out where the mistaken assumption is, that'd be nice. -- Jean-Christophe Hugly PANTA From sean.hefty at intel.com Wed Feb 1 12:03:06 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 1 Feb 2006 12:03:06 -0800 Subject: [openib-general] [PATCH 0/5] Infiniband: connection abstraction Message-ID: Here's an updated version of these patches based on feedback. (The license did not change and continues to match that of the other Infiniband code.) Please consider for inclusion in 2.6.17. The following set of patches defines a connection abstraction for Infiniband and other RDMA devices, and serves several purposes: * It implements a connection protocol over Infiniband based on IP addressing. This greatly simplifies clients wishing to establish connections over Infiniband. * It defines a connection abstraction that works over multiple RDMA devices. The submitted implementation targets Infiniband, but has been tested over other RDMA devices as well. * It handles RDMA device insertion and removal on behalf of its clients. The changes have been broken into 5 separate patches. The basic purpose of each patch is: 1. Provide common handling for marshalling data between userspace clients and kernel mode Infiniband drivers. 2. Extend the Infiniband CM to include private data comparisons as part of its connection request matching process. 3. Provide an address translation service that maps IP addresses to Infiniband addresses (GIDs). This patch touches outside of the Infiniband core, so I'm including the netdev mailing list. 4. Implement the kernel mode RDMA connection management agent. 5. Implement the userspace RDMA connection management agent kernel support module. Please copy the openib-general mailing list on any replies. Thanks, Sean From sean.hefty at intel.com Wed Feb 1 12:07:25 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 1 Feb 2006 12:07:25 -0800 Subject: [openib-general] [PATCH 1/5] Infiniband: connection abstraction In-Reply-To: Message-ID: The following patch provides common handling for marshalling data between Userspace clients and kernel mode Infiniband drivers. Signed-off-by: Sean Hefty --- diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/Makefile linux-2.6.ib/drivers/infiniband/core/Makefile --- linux-2.6.git/drivers/infiniband/core/Makefile 2006-01-16 10:25:27.000000000 -0800 +++ linux-2.6.ib/drivers/infiniband/core/Makefile 2006-01-16 15:34:15.000000000 -0800 @@ -16,4 +16,5 @@ ib_umad-y := user_mad.o ib_ucm-y := ucm.o -ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_mem.o +ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_mem.o \ + uverbs_marshall.o diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/ucm.c linux-2.6.ib/drivers/infiniband/core/ucm.c --- linux-2.6.git/drivers/infiniband/core/ucm.c 2006-01-16 10:25:26.000000000 -0800 +++ linux-2.6.ib/drivers/infiniband/core/ucm.c 2006-01-16 15:34:15.000000000 -0800 @@ -30,7 +30,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ucm.c 2594 2005-06-13 19:46:02Z libor $ + * $Id: ucm.c 4311 2005-12-05 18:42:01Z sean.hefty $ */ #include #include @@ -48,6 +48,7 @@ #include #include +#include MODULE_AUTHOR("Libor Michalek"); MODULE_DESCRIPTION("InfiniBand userspace Connection Manager access"); @@ -203,36 +204,6 @@ error: return NULL; } -static void ib_ucm_event_path_get(struct ib_ucm_path_rec *upath, - struct ib_sa_path_rec *kpath) -{ - if (!kpath || !upath) - return; - - memcpy(upath->dgid, kpath->dgid.raw, sizeof *upath->dgid); - memcpy(upath->sgid, kpath->sgid.raw, sizeof *upath->sgid); - - upath->dlid = kpath->dlid; - upath->slid = kpath->slid; - upath->raw_traffic = kpath->raw_traffic; - upath->flow_label = kpath->flow_label; - upath->hop_limit = kpath->hop_limit; - upath->traffic_class = kpath->traffic_class; - upath->reversible = kpath->reversible; - upath->numb_path = kpath->numb_path; - upath->pkey = kpath->pkey; - upath->sl = kpath->sl; - upath->mtu_selector = kpath->mtu_selector; - upath->mtu = kpath->mtu; - upath->rate_selector = kpath->rate_selector; - upath->rate = kpath->rate; - upath->packet_life_time = kpath->packet_life_time; - upath->preference = kpath->preference; - - upath->packet_life_time_selector = - kpath->packet_life_time_selector; -} - static void ib_ucm_event_req_get(struct ib_ucm_req_event_resp *ureq, struct ib_cm_req_event_param *kreq) { @@ -251,8 +222,10 @@ static void ib_ucm_event_req_get(struct ureq->srq = kreq->srq; ureq->port = kreq->port; - ib_ucm_event_path_get(&ureq->primary_path, kreq->primary_path); - ib_ucm_event_path_get(&ureq->alternate_path, kreq->alternate_path); + ib_copy_path_rec_to_user(&ureq->primary_path, kreq->primary_path); + if (kreq->alternate_path) + ib_copy_path_rec_to_user(&ureq->alternate_path, + kreq->alternate_path); } static void ib_ucm_event_rep_get(struct ib_ucm_rep_event_resp *urep, @@ -322,8 +295,8 @@ static int ib_ucm_event_process(struct i info = evt->param.rej_rcvd.ari; break; case IB_CM_LAP_RECEIVED: - ib_ucm_event_path_get(&uvt->resp.u.lap_resp.path, - evt->param.lap_rcvd.alternate_path); + ib_copy_path_rec_to_user(&uvt->resp.u.lap_resp.path, + evt->param.lap_rcvd.alternate_path); uvt->data_len = IB_CM_LAP_PRIVATE_DATA_SIZE; uvt->resp.present = IB_UCM_PRES_ALTERNATE; break; @@ -635,65 +608,11 @@ static ssize_t ib_ucm_attr_id(struct ib_ return result; } -static void ib_ucm_copy_ah_attr(struct ib_ucm_ah_attr *dest_attr, - struct ib_ah_attr *src_attr) -{ - memcpy(dest_attr->grh_dgid, src_attr->grh.dgid.raw, - sizeof src_attr->grh.dgid); - dest_attr->grh_flow_label = src_attr->grh.flow_label; - dest_attr->grh_sgid_index = src_attr->grh.sgid_index; - dest_attr->grh_hop_limit = src_attr->grh.hop_limit; - dest_attr->grh_traffic_class = src_attr->grh.traffic_class; - - dest_attr->dlid = src_attr->dlid; - dest_attr->sl = src_attr->sl; - dest_attr->src_path_bits = src_attr->src_path_bits; - dest_attr->static_rate = src_attr->static_rate; - dest_attr->is_global = (src_attr->ah_flags & IB_AH_GRH); - dest_attr->port_num = src_attr->port_num; -} - -static void ib_ucm_copy_qp_attr(struct ib_ucm_init_qp_attr_resp *dest_attr, - struct ib_qp_attr *src_attr) -{ - dest_attr->cur_qp_state = src_attr->cur_qp_state; - dest_attr->path_mtu = src_attr->path_mtu; - dest_attr->path_mig_state = src_attr->path_mig_state; - dest_attr->qkey = src_attr->qkey; - dest_attr->rq_psn = src_attr->rq_psn; - dest_attr->sq_psn = src_attr->sq_psn; - dest_attr->dest_qp_num = src_attr->dest_qp_num; - dest_attr->qp_access_flags = src_attr->qp_access_flags; - - dest_attr->max_send_wr = src_attr->cap.max_send_wr; - dest_attr->max_recv_wr = src_attr->cap.max_recv_wr; - dest_attr->max_send_sge = src_attr->cap.max_send_sge; - dest_attr->max_recv_sge = src_attr->cap.max_recv_sge; - dest_attr->max_inline_data = src_attr->cap.max_inline_data; - - ib_ucm_copy_ah_attr(&dest_attr->ah_attr, &src_attr->ah_attr); - ib_ucm_copy_ah_attr(&dest_attr->alt_ah_attr, &src_attr->alt_ah_attr); - - dest_attr->pkey_index = src_attr->pkey_index; - dest_attr->alt_pkey_index = src_attr->alt_pkey_index; - dest_attr->en_sqd_async_notify = src_attr->en_sqd_async_notify; - dest_attr->sq_draining = src_attr->sq_draining; - dest_attr->max_rd_atomic = src_attr->max_rd_atomic; - dest_attr->max_dest_rd_atomic = src_attr->max_dest_rd_atomic; - dest_attr->min_rnr_timer = src_attr->min_rnr_timer; - dest_attr->port_num = src_attr->port_num; - dest_attr->timeout = src_attr->timeout; - dest_attr->retry_cnt = src_attr->retry_cnt; - dest_attr->rnr_retry = src_attr->rnr_retry; - dest_attr->alt_port_num = src_attr->alt_port_num; - dest_attr->alt_timeout = src_attr->alt_timeout; -} - static ssize_t ib_ucm_init_qp_attr(struct ib_ucm_file *file, const char __user *inbuf, int in_len, int out_len) { - struct ib_ucm_init_qp_attr_resp resp; + struct ib_uverbs_qp_attr resp; struct ib_ucm_init_qp_attr cmd; struct ib_ucm_context *ctx; struct ib_qp_attr qp_attr; @@ -716,7 +635,7 @@ static ssize_t ib_ucm_init_qp_attr(struc if (result) goto out; - ib_ucm_copy_qp_attr(&resp, &qp_attr); + ib_copy_qp_attr_to_user(&resp, &qp_attr); if (copy_to_user((void __user *)(unsigned long)cmd.response, &resp, sizeof(resp))) @@ -791,7 +710,7 @@ static int ib_ucm_alloc_data(const void static int ib_ucm_path_get(struct ib_sa_path_rec **path, u64 src) { - struct ib_ucm_path_rec ucm_path; + struct ib_user_path_rec upath; struct ib_sa_path_rec *sa_path; *path = NULL; @@ -803,36 +722,14 @@ static int ib_ucm_path_get(struct ib_sa_ if (!sa_path) return -ENOMEM; - if (copy_from_user(&ucm_path, (void __user *)(unsigned long)src, - sizeof(ucm_path))) { + if (copy_from_user(&upath, (void __user *)(unsigned long)src, + sizeof(upath))) { kfree(sa_path); return -EFAULT; } - memcpy(sa_path->dgid.raw, ucm_path.dgid, sizeof sa_path->dgid); - memcpy(sa_path->sgid.raw, ucm_path.sgid, sizeof sa_path->sgid); - - sa_path->dlid = ucm_path.dlid; - sa_path->slid = ucm_path.slid; - sa_path->raw_traffic = ucm_path.raw_traffic; - sa_path->flow_label = ucm_path.flow_label; - sa_path->hop_limit = ucm_path.hop_limit; - sa_path->traffic_class = ucm_path.traffic_class; - sa_path->reversible = ucm_path.reversible; - sa_path->numb_path = ucm_path.numb_path; - sa_path->pkey = ucm_path.pkey; - sa_path->sl = ucm_path.sl; - sa_path->mtu_selector = ucm_path.mtu_selector; - sa_path->mtu = ucm_path.mtu; - sa_path->rate_selector = ucm_path.rate_selector; - sa_path->rate = ucm_path.rate; - sa_path->packet_life_time = ucm_path.packet_life_time; - sa_path->preference = ucm_path.preference; - - sa_path->packet_life_time_selector = - ucm_path.packet_life_time_selector; - + ib_copy_path_rec_from_user(sa_path, &upath); *path = sa_path; return 0; } @@ -1243,8 +1140,10 @@ static unsigned int ib_ucm_poll(struct f poll_wait(filp, &file->poll_wait, wait); + down(&file->mutex); if (!list_empty(&file->events)) mask = POLLIN | POLLRDNORM; + up(&file->mutex); return mask; } diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/uverbs_marshall.c linux-2.6.ib/drivers/infiniband/core/uverbs_marshall.c --- linux-2.6.git/drivers/infiniband/core/uverbs_marshall.c 1969-12-31 16:00:00.000000000 -0800 +++ linux-2.6.ib/drivers/infiniband/core/uverbs_marshall.c 2006-01-16 15:34:15.000000000 -0800 @@ -0,0 +1,138 @@ +/* + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include + +static void ib_copy_ah_attr_to_user(struct ib_uverbs_ah_attr *dst, + struct ib_ah_attr *src) +{ + memcpy(dst->grh.dgid, src->grh.dgid.raw, sizeof src->grh.dgid); + dst->grh.flow_label = src->grh.flow_label; + dst->grh.sgid_index = src->grh.sgid_index; + dst->grh.hop_limit = src->grh.hop_limit; + dst->grh.traffic_class = src->grh.traffic_class; + dst->dlid = src->dlid; + dst->sl = src->sl; + dst->src_path_bits = src->src_path_bits; + dst->static_rate = src->static_rate; + dst->is_global = src->ah_flags & IB_AH_GRH ? 1 : 0; + dst->port_num = src->port_num; +} + +void ib_copy_qp_attr_to_user(struct ib_uverbs_qp_attr *dst, + struct ib_qp_attr *src) +{ + dst->cur_qp_state = src->cur_qp_state; + dst->path_mtu = src->path_mtu; + dst->path_mig_state = src->path_mig_state; + dst->qkey = src->qkey; + dst->rq_psn = src->rq_psn; + dst->sq_psn = src->sq_psn; + dst->dest_qp_num = src->dest_qp_num; + dst->qp_access_flags = src->qp_access_flags; + + dst->max_send_wr = src->cap.max_send_wr; + dst->max_recv_wr = src->cap.max_recv_wr; + dst->max_send_sge = src->cap.max_send_sge; + dst->max_recv_sge = src->cap.max_recv_sge; + dst->max_inline_data = src->cap.max_inline_data; + + ib_copy_ah_attr_to_user(&dst->ah_attr, &src->ah_attr); + ib_copy_ah_attr_to_user(&dst->alt_ah_attr, &src->alt_ah_attr); + + dst->pkey_index = src->pkey_index; + dst->alt_pkey_index = src->alt_pkey_index; + dst->en_sqd_async_notify = src->en_sqd_async_notify; + dst->sq_draining = src->sq_draining; + dst->max_rd_atomic = src->max_rd_atomic; + dst->max_dest_rd_atomic = src->max_dest_rd_atomic; + dst->min_rnr_timer = src->min_rnr_timer; + dst->port_num = src->port_num; + dst->timeout = src->timeout; + dst->retry_cnt = src->retry_cnt; + dst->rnr_retry = src->rnr_retry; + dst->alt_port_num = src->alt_port_num; + dst->alt_timeout = src->alt_timeout; +} +EXPORT_SYMBOL(ib_copy_qp_attr_to_user); + +void ib_copy_path_rec_to_user(struct ib_user_path_rec *dst, + struct ib_sa_path_rec *src) +{ + memcpy(dst->dgid, src->dgid.raw, sizeof src->dgid); + memcpy(dst->sgid, src->sgid.raw, sizeof src->sgid); + + dst->dlid = src->dlid; + dst->slid = src->slid; + dst->raw_traffic = src->raw_traffic; + dst->flow_label = src->flow_label; + dst->hop_limit = src->hop_limit; + dst->traffic_class = src->traffic_class; + dst->reversible = src->reversible; + dst->numb_path = src->numb_path; + dst->pkey = src->pkey; + dst->sl = src->sl; + dst->mtu_selector = src->mtu_selector; + dst->mtu = src->mtu; + dst->rate_selector = src->rate_selector; + dst->rate = src->rate; + dst->packet_life_time = src->packet_life_time; + dst->preference = src->preference; + dst->packet_life_time_selector = src->packet_life_time_selector; +} +EXPORT_SYMBOL(ib_copy_path_rec_to_user); + +void ib_copy_path_rec_from_user(struct ib_sa_path_rec *dst, + struct ib_user_path_rec *src) +{ + memcpy(dst->dgid.raw, src->dgid, sizeof dst->dgid); + memcpy(dst->sgid.raw, src->sgid, sizeof dst->sgid); + + dst->dlid = src->dlid; + dst->slid = src->slid; + dst->raw_traffic = src->raw_traffic; + dst->flow_label = src->flow_label; + dst->hop_limit = src->hop_limit; + dst->traffic_class = src->traffic_class; + dst->reversible = src->reversible; + dst->numb_path = src->numb_path; + dst->pkey = src->pkey; + dst->sl = src->sl; + dst->mtu_selector = src->mtu_selector; + dst->mtu = src->mtu; + dst->rate_selector = src->rate_selector; + dst->rate = src->rate; + dst->packet_life_time = src->packet_life_time; + dst->preference = src->preference; + dst->packet_life_time_selector = src->packet_life_time_selector; +} +EXPORT_SYMBOL(ib_copy_path_rec_from_user); diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/include/rdma/ib_marshall.h linux-2.6.ib/include/rdma/ib_marshall.h --- linux-2.6.git/include/rdma/ib_marshall.h 1969-12-31 16:00:00.000000000 -0800 +++ linux-2.6.ib/include/rdma/ib_marshall.h 2006-01-16 15:34:15.000000000 -0800 @@ -0,0 +1,50 @@ +/* + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#if !defined(IB_USER_MARSHALL_H) +#define IB_USER_MARSHALL_H + +#include +#include +#include +#include + +void ib_copy_qp_attr_to_user(struct ib_uverbs_qp_attr *dst, + struct ib_qp_attr *src); + +void ib_copy_path_rec_to_user(struct ib_user_path_rec *dst, + struct ib_sa_path_rec *src); + +void ib_copy_path_rec_from_user(struct ib_sa_path_rec *dst, + struct ib_user_path_rec *src); + +#endif /* IB_USER_MARSHALL_H */ diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/include/rdma/ib_user_cm.h linux-2.6.ib/include/rdma/ib_user_cm.h --- linux-2.6.git/include/rdma/ib_user_cm.h 2006-01-16 10:26:47.000000000 -0800 +++ linux-2.6.ib/include/rdma/ib_user_cm.h 2006-01-16 15:34:15.000000000 -0800 @@ -30,13 +30,13 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ib_user_cm.h 2576 2005-06-09 17:00:30Z libor $ + * $Id: ib_user_cm.h 4019 2005-11-11 00:33:09Z sean.hefty $ */ #ifndef IB_USER_CM_H #define IB_USER_CM_H -#include +#include #define IB_USER_CM_ABI_VERSION 4 @@ -110,58 +110,6 @@ struct ib_ucm_init_qp_attr { __u32 qp_state; }; -struct ib_ucm_ah_attr { - __u8 grh_dgid[16]; - __u32 grh_flow_label; - __u16 dlid; - __u16 reserved; - __u8 grh_sgid_index; - __u8 grh_hop_limit; - __u8 grh_traffic_class; - __u8 sl; - __u8 src_path_bits; - __u8 static_rate; - __u8 is_global; - __u8 port_num; -}; - -struct ib_ucm_init_qp_attr_resp { - __u32 qp_attr_mask; - __u32 qp_state; - __u32 cur_qp_state; - __u32 path_mtu; - __u32 path_mig_state; - __u32 qkey; - __u32 rq_psn; - __u32 sq_psn; - __u32 dest_qp_num; - __u32 qp_access_flags; - - struct ib_ucm_ah_attr ah_attr; - struct ib_ucm_ah_attr alt_ah_attr; - - /* ib_qp_cap */ - __u32 max_send_wr; - __u32 max_recv_wr; - __u32 max_send_sge; - __u32 max_recv_sge; - __u32 max_inline_data; - - __u16 pkey_index; - __u16 alt_pkey_index; - __u8 en_sqd_async_notify; - __u8 sq_draining; - __u8 max_rd_atomic; - __u8 max_dest_rd_atomic; - __u8 min_rnr_timer; - __u8 port_num; - __u8 timeout; - __u8 retry_cnt; - __u8 rnr_retry; - __u8 alt_port_num; - __u8 alt_timeout; -}; - struct ib_ucm_listen { __be64 service_id; __be64 service_mask; @@ -180,28 +128,6 @@ struct ib_ucm_private_data { __u8 reserved[3]; }; -struct ib_ucm_path_rec { - __u8 dgid[16]; - __u8 sgid[16]; - __be16 dlid; - __be16 slid; - __u32 raw_traffic; - __be32 flow_label; - __u32 reversible; - __u32 mtu; - __be16 pkey; - __u8 hop_limit; - __u8 traffic_class; - __u8 numb_path; - __u8 sl; - __u8 mtu_selector; - __u8 rate_selector; - __u8 rate; - __u8 packet_life_time_selector; - __u8 packet_life_time; - __u8 preference; -}; - struct ib_ucm_req { __u32 id; __u32 qpn; @@ -304,8 +230,8 @@ struct ib_ucm_event_get { }; struct ib_ucm_req_event_resp { - struct ib_ucm_path_rec primary_path; - struct ib_ucm_path_rec alternate_path; + struct ib_user_path_rec primary_path; + struct ib_user_path_rec alternate_path; __be64 remote_ca_guid; __u32 remote_qkey; __u32 remote_qpn; @@ -349,7 +275,7 @@ struct ib_ucm_mra_event_resp { }; struct ib_ucm_lap_event_resp { - struct ib_ucm_path_rec path; + struct ib_user_path_rec path; }; struct ib_ucm_apr_event_resp { diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/include/rdma/ib_user_sa.h linux-2.6.ib/include/rdma/ib_user_sa.h --- linux-2.6.git/include/rdma/ib_user_sa.h 1969-12-31 16:00:00.000000000 -0800 +++ linux-2.6.ib/include/rdma/ib_user_sa.h 2006-01-16 15:34:15.000000000 -0800 @@ -0,0 +1,60 @@ +/* + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef IB_USER_SA_H +#define IB_USER_SA_H + +#include + +struct ib_user_path_rec { + __u8 dgid[16]; + __u8 sgid[16]; + __be16 dlid; + __be16 slid; + __u32 raw_traffic; + __be32 flow_label; + __u32 reversible; + __u32 mtu; + __be16 pkey; + __u8 hop_limit; + __u8 traffic_class; + __u8 numb_path; + __u8 sl; + __u8 mtu_selector; + __u8 rate_selector; + __u8 rate; + __u8 packet_life_time_selector; + __u8 packet_life_time; + __u8 preference; +}; + +#endif /* IB_USER_SA_H */ diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/include/rdma/ib_user_verbs.h linux-2.6.ib/include/rdma/ib_user_verbs.h --- linux-2.6.git/include/rdma/ib_user_verbs.h 2006-01-16 10:26:47.000000000 -0800 +++ linux-2.6.ib/include/rdma/ib_user_verbs.h 2006-01-16 15:34:15.000000000 -0800 @@ -31,7 +31,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ib_user_verbs.h 2708 2005-06-24 17:27:21Z roland $ + * $Id: ib_user_verbs.h 4019 2005-11-11 00:33:09Z sean.hefty $ */ #ifndef IB_USER_VERBS_H @@ -311,6 +311,64 @@ struct ib_uverbs_destroy_cq_resp { __u32 async_events_reported; }; +struct ib_uverbs_global_route { + __u8 dgid[16]; + __u32 flow_label; + __u8 sgid_index; + __u8 hop_limit; + __u8 traffic_class; + __u8 reserved; +}; + +struct ib_uverbs_ah_attr { + struct ib_uverbs_global_route grh; + __u16 dlid; + __u8 sl; + __u8 src_path_bits; + __u8 static_rate; + __u8 is_global; + __u8 port_num; + __u8 reserved; +}; + +struct ib_uverbs_qp_attr { + __u32 qp_attr_mask; + __u32 qp_state; + __u32 cur_qp_state; + __u32 path_mtu; + __u32 path_mig_state; + __u32 qkey; + __u32 rq_psn; + __u32 sq_psn; + __u32 dest_qp_num; + __u32 qp_access_flags; + + struct ib_uverbs_ah_attr ah_attr; + struct ib_uverbs_ah_attr alt_ah_attr; + + /* ib_qp_cap */ + __u32 max_send_wr; + __u32 max_recv_wr; + __u32 max_send_sge; + __u32 max_recv_sge; + __u32 max_inline_data; + + __u16 pkey_index; + __u16 alt_pkey_index; + __u8 en_sqd_async_notify; + __u8 sq_draining; + __u8 max_rd_atomic; + __u8 max_dest_rd_atomic; + __u8 min_rnr_timer; + __u8 port_num; + __u8 timeout; + __u8 retry_cnt; + __u8 rnr_retry; + __u8 alt_port_num; + __u8 alt_timeout; + __u8 reserved[5]; +}; + struct ib_uverbs_create_qp { __u64 response; __u64 user_handle; @@ -487,26 +545,6 @@ struct ib_uverbs_post_srq_recv_resp { __u32 bad_wr; }; -struct ib_uverbs_global_route { - __u8 dgid[16]; - __u32 flow_label; - __u8 sgid_index; - __u8 hop_limit; - __u8 traffic_class; - __u8 reserved; -}; - -struct ib_uverbs_ah_attr { - struct ib_uverbs_global_route grh; - __u16 dlid; - __u8 sl; - __u8 src_path_bits; - __u8 static_rate; - __u8 is_global; - __u8 port_num; - __u8 reserved; -}; - struct ib_uverbs_create_ah { __u64 response; __u64 user_handle; From sean.hefty at intel.com Wed Feb 1 12:10:26 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 1 Feb 2006 12:10:26 -0800 Subject: [openib-general] [PATCH 2/5] Infiniband: connection abstraction In-Reply-To: Message-ID: The following patch extends matching connection requests to listens in the Infiniband CM to include private data. Signed-off-by: Sean Hefty --- diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/cm.c linux-2.6.ib/drivers/infiniband/core/cm.c --- linux-2.6.git/drivers/infiniband/core/cm.c 2006-01-16 10:25:26.000000000 -0800 +++ linux-2.6.ib/drivers/infiniband/core/cm.c 2006-01-16 16:03:35.000000000 -0800 @@ -32,7 +32,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: cm.c 2821 2005-07-08 17:07:28Z sean.hefty $ + * $Id: cm.c 4311 2005-12-05 18:42:01Z sean.hefty $ */ #include #include @@ -130,6 +130,7 @@ struct cm_id_private { /* todo: use alternate port on send failure */ struct cm_av av; struct cm_av alt_av; + struct ib_cm_compare_data *compare_data; void *private_data; __be64 tid; @@ -355,6 +356,41 @@ static struct cm_id_private * cm_acquire return cm_id_priv; } +static void cm_mask_copy(u8 *dst, u8 *src, u8 *mask) +{ + int i; + + for (i = 0; i < IB_CM_COMPARE_SIZE / sizeof(unsigned long); i++) + ((unsigned long *) dst)[i] = ((unsigned long *) src)[i] & + ((unsigned long *) mask)[i]; +} + +static int cm_compare_data(struct ib_cm_compare_data *src_data, + struct ib_cm_compare_data *dst_data) +{ + u8 src[IB_CM_COMPARE_SIZE]; + u8 dst[IB_CM_COMPARE_SIZE]; + + if (!src_data || !dst_data) + return 0; + + cm_mask_copy(src, src_data->data, dst_data->mask); + cm_mask_copy(dst, dst_data->data, src_data->mask); + return memcmp(src, dst, IB_CM_COMPARE_SIZE); +} + +static int cm_compare_private_data(u8 *private_data, + struct ib_cm_compare_data *dst_data) +{ + u8 src[IB_CM_COMPARE_SIZE]; + + if (!dst_data) + return 0; + + cm_mask_copy(src, private_data, dst_data->mask); + return memcmp(src, dst_data->data, IB_CM_COMPARE_SIZE); +} + static struct cm_id_private * cm_insert_listen(struct cm_id_private *cm_id_priv) { struct rb_node **link = &cm.listen_service_table.rb_node; @@ -362,14 +397,18 @@ static struct cm_id_private * cm_insert_ struct cm_id_private *cur_cm_id_priv; __be64 service_id = cm_id_priv->id.service_id; __be64 service_mask = cm_id_priv->id.service_mask; + int data_cmp; while (*link) { parent = *link; cur_cm_id_priv = rb_entry(parent, struct cm_id_private, service_node); + data_cmp = cm_compare_data(cm_id_priv->compare_data, + cur_cm_id_priv->compare_data); if ((cur_cm_id_priv->id.service_mask & service_id) == (service_mask & cur_cm_id_priv->id.service_id) && - (cm_id_priv->id.device == cur_cm_id_priv->id.device)) + (cm_id_priv->id.device == cur_cm_id_priv->id.device) && + !data_cmp) return cur_cm_id_priv; if (cm_id_priv->id.device < cur_cm_id_priv->id.device) @@ -378,6 +417,10 @@ static struct cm_id_private * cm_insert_ link = &(*link)->rb_right; else if (service_id < cur_cm_id_priv->id.service_id) link = &(*link)->rb_left; + else if (service_id > cur_cm_id_priv->id.service_id) + link = &(*link)->rb_right; + else if (data_cmp < 0) + link = &(*link)->rb_left; else link = &(*link)->rb_right; } @@ -387,16 +430,20 @@ static struct cm_id_private * cm_insert_ } static struct cm_id_private * cm_find_listen(struct ib_device *device, - __be64 service_id) + __be64 service_id, + u8 *private_data) { struct rb_node *node = cm.listen_service_table.rb_node; struct cm_id_private *cm_id_priv; + int data_cmp; while (node) { cm_id_priv = rb_entry(node, struct cm_id_private, service_node); + data_cmp = cm_compare_private_data(private_data, + cm_id_priv->compare_data); if ((cm_id_priv->id.service_mask & service_id) == cm_id_priv->id.service_id && - (cm_id_priv->id.device == device)) + (cm_id_priv->id.device == device) && !data_cmp) return cm_id_priv; if (device < cm_id_priv->id.device) @@ -405,6 +452,10 @@ static struct cm_id_private * cm_find_li node = node->rb_right; else if (service_id < cm_id_priv->id.service_id) node = node->rb_left; + else if (service_id > cm_id_priv->id.service_id) + node = node->rb_right; + else if (data_cmp < 0) + node = node->rb_left; else node = node->rb_right; } @@ -728,15 +779,14 @@ retest: wait_event(cm_id_priv->wait, !atomic_read(&cm_id_priv->refcount)); while ((work = cm_dequeue_work(cm_id_priv)) != NULL) cm_free_work(work); - if (cm_id_priv->private_data && cm_id_priv->private_data_len) - kfree(cm_id_priv->private_data); + kfree(cm_id_priv->compare_data); + kfree(cm_id_priv->private_data); kfree(cm_id_priv); } EXPORT_SYMBOL(ib_destroy_cm_id); -int ib_cm_listen(struct ib_cm_id *cm_id, - __be64 service_id, - __be64 service_mask) +int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, + struct ib_cm_compare_data *compare_data) { struct cm_id_private *cm_id_priv, *cur_cm_id_priv; unsigned long flags; @@ -750,7 +800,19 @@ int ib_cm_listen(struct ib_cm_id *cm_id, return -EINVAL; cm_id_priv = container_of(cm_id, struct cm_id_private, id); - BUG_ON(cm_id->state != IB_CM_IDLE); + if (cm_id->state != IB_CM_IDLE) + return -EINVAL; + + if (compare_data) { + cm_id_priv->compare_data = kzalloc(sizeof *compare_data, + GFP_KERNEL); + if (!cm_id_priv->compare_data) + return -ENOMEM; + cm_mask_copy(cm_id_priv->compare_data->data, + compare_data->data, compare_data->mask); + memcpy(cm_id_priv->compare_data->mask, compare_data->mask, + IB_CM_COMPARE_SIZE); + } cm_id->state = IB_CM_LISTEN; @@ -767,6 +829,8 @@ int ib_cm_listen(struct ib_cm_id *cm_id, if (cur_cm_id_priv) { cm_id->state = IB_CM_IDLE; + kfree(cm_id_priv->compare_data); + cm_id_priv->compare_data = NULL; ret = -EBUSY; } return ret; @@ -1239,7 +1303,8 @@ static struct cm_id_private * cm_match_r /* Find matching listen request. */ listen_cm_id_priv = cm_find_listen(cm_id_priv->id.device, - req_msg->service_id); + req_msg->service_id, + req_msg->private_data); if (!listen_cm_id_priv) { spin_unlock_irqrestore(&cm.lock, flags); cm_issue_rej(work->port, work->mad_recv_wc, @@ -2646,7 +2711,8 @@ static int cm_sidr_req_handler(struct cm goto out; /* Duplicate message. */ } cur_cm_id_priv = cm_find_listen(cm_id->device, - sidr_req_msg->service_id); + sidr_req_msg->service_id, + sidr_req_msg->private_data); if (!cur_cm_id_priv) { rb_erase(&cm_id_priv->sidr_id_node, &cm.remote_sidr_table); spin_unlock_irqrestore(&cm.lock, flags); diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/ucm.c linux-2.6.ib/drivers/infiniband/core/ucm.c --- linux-2.6.git/drivers/infiniband/core/ucm.c 2006-01-16 16:03:08.000000000 -0800 +++ linux-2.6.ib/drivers/infiniband/core/ucm.c 2006-01-16 16:03:35.000000000 -0800 @@ -646,6 +646,17 @@ out: return result; } +static int ucm_validate_listen(__be64 service_id, __be64 service_mask) +{ + service_id &= service_mask; + + if (((service_id & IB_CMA_SERVICE_ID_MASK) == IB_CMA_SERVICE_ID) || + ((service_id & IB_SDP_SERVICE_ID_MASK) == IB_SDP_SERVICE_ID)) + return -EINVAL; + + return 0; +} + static ssize_t ib_ucm_listen(struct ib_ucm_file *file, const char __user *inbuf, int in_len, int out_len) @@ -661,7 +672,13 @@ static ssize_t ib_ucm_listen(struct ib_u if (IS_ERR(ctx)) return PTR_ERR(ctx); - result = ib_cm_listen(ctx->cm_id, cmd.service_id, cmd.service_mask); + result = ucm_validate_listen(cmd.service_id, cmd.service_mask); + if (result) + goto out; + + result = ib_cm_listen(ctx->cm_id, cmd.service_id, cmd.service_mask, + NULL); +out: ib_ucm_ctx_put(ctx); return result; } diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/include/rdma/ib_cm.h linux-2.6.ib/include/rdma/ib_cm.h --- linux-2.6.git/include/rdma/ib_cm.h 2006-01-16 10:26:47.000000000 -0800 +++ linux-2.6.ib/include/rdma/ib_cm.h 2006-01-16 16:03:35.000000000 -0800 @@ -32,7 +32,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ib_cm.h 2730 2005-06-28 16:43:03Z sean.hefty $ + * $Id: ib_cm.h 4311 2005-12-05 18:42:01Z sean.hefty $ */ #if !defined(IB_CM_H) #define IB_CM_H @@ -102,7 +102,8 @@ enum ib_cm_data_size { IB_CM_APR_INFO_LENGTH = 72, IB_CM_SIDR_REQ_PRIVATE_DATA_SIZE = 216, IB_CM_SIDR_REP_PRIVATE_DATA_SIZE = 136, - IB_CM_SIDR_REP_INFO_LENGTH = 72 + IB_CM_SIDR_REP_INFO_LENGTH = 72, + IB_CM_COMPARE_SIZE = 64 }; struct ib_cm_id; @@ -238,7 +239,6 @@ struct ib_cm_sidr_rep_event_param { u32 qpn; void *info; u8 info_len; - }; struct ib_cm_event { @@ -317,6 +317,15 @@ void ib_destroy_cm_id(struct ib_cm_id *c #define IB_SERVICE_ID_AGN_MASK __constant_cpu_to_be64(0xFF00000000000000ULL) #define IB_CM_ASSIGN_SERVICE_ID __constant_cpu_to_be64(0x0200000000000000ULL) +#define IB_CMA_SERVICE_ID __constant_cpu_to_be64(0x0000000001000000ULL) +#define IB_CMA_SERVICE_ID_MASK __constant_cpu_to_be64(0xFFFFFFFFFF000000ULL) +#define IB_SDP_SERVICE_ID __constant_cpu_to_be64(0x0000000000010000ULL) +#define IB_SDP_SERVICE_ID_MASK __constant_cpu_to_be64(0xFFFFFFFFFFFF0000ULL) + +struct ib_cm_compare_data { + u8 data[IB_CM_COMPARE_SIZE]; + u8 mask[IB_CM_COMPARE_SIZE]; +}; /** * ib_cm_listen - Initiates listening on the specified service ID for @@ -330,10 +339,12 @@ void ib_destroy_cm_id(struct ib_cm_id *c * range of service IDs. If set to 0, the service ID is matched * exactly. This parameter is ignored if %service_id is set to * IB_CM_ASSIGN_SERVICE_ID. + * @compare_data: This parameter is optional. It specifies data that must + * appear in the private data of a connection request for the specified + * listen request. */ -int ib_cm_listen(struct ib_cm_id *cm_id, - __be64 service_id, - __be64 service_mask); +int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, + struct ib_cm_compare_data *compare_data); struct ib_cm_req_param { struct ib_sa_path_rec *primary_path; From sean.hefty at intel.com Wed Feb 1 12:15:08 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 1 Feb 2006 12:15:08 -0800 Subject: [openib-general] [PATCH 3/5] Infiniband: connection abstraction In-Reply-To: Message-ID: The following provides an address translation service that maps IP addresses to Infiniband addresses (GIDs) using IPoIB. This patch exports ip_dev_find() to locate a net_device given an IP address. Signed-off-by: Sean Hefty --- diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/addr.c linux-2.6.ib/drivers/infiniband/core/addr.c --- linux-2.6.git/drivers/infiniband/core/addr.c 1969-12-31 16:00:00.000000000 -0800 +++ linux-2.6.ib/drivers/infiniband/core/addr.c 2006-01-16 16:14:24.000000000 -0800 @@ -0,0 +1,356 @@ +/* + * Copyright (c) 2005 Voltaire Inc. All rights reserved. + * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. + * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ +#include +#include +#include +#include +#include +#include + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("IB Address Translation"); +MODULE_LICENSE("Dual BSD/GPL"); + +struct addr_req { + struct list_head list; + struct sockaddr src_addr; + struct sockaddr dst_addr; + struct rdma_dev_addr *addr; + void *context; + void (*callback)(int status, struct sockaddr *src_addr, + struct rdma_dev_addr *addr, void *context); + unsigned long timeout; + int status; +}; + +static void process_req(void *data); + +static DEFINE_MUTEX(lock); +static LIST_HEAD(req_list); +static DECLARE_WORK(work, process_req, NULL); +struct workqueue_struct *rdma_wq; +EXPORT_SYMBOL(rdma_wq); + +static int copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev, + unsigned char *dst_dev_addr) +{ + switch (dev->type) { + case ARPHRD_INFINIBAND: + dev_addr->dev_type = IB_NODE_CA; + break; + default: + return -EADDRNOTAVAIL; + } + + memcpy(dev_addr->src_dev_addr, dev->dev_addr, MAX_ADDR_LEN); + memcpy(dev_addr->broadcast, dev->broadcast, MAX_ADDR_LEN); + if (dst_dev_addr) + memcpy(dev_addr->dst_dev_addr, dst_dev_addr, MAX_ADDR_LEN); + return 0; +} + +int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr) +{ + struct net_device *dev; + u32 ip = ((struct sockaddr_in *) addr)->sin_addr.s_addr; + int ret; + + dev = ip_dev_find(ip); + if (!dev) + return -EADDRNOTAVAIL; + + ret = copy_addr(dev_addr, dev, NULL); + dev_put(dev); + return ret; +} +EXPORT_SYMBOL(rdma_translate_ip); + +static void set_timeout(unsigned long time) +{ + unsigned long delay; + + cancel_delayed_work(&work); + + delay = time - jiffies; + if ((long)delay <= 0) + delay = 1; + + queue_delayed_work(rdma_wq, &work, delay); +} + +static void queue_req(struct addr_req *req) +{ + struct addr_req *temp_req; + + mutex_lock(&lock); + list_for_each_entry_reverse(temp_req, &req_list, list) { + if (time_after(req->timeout, temp_req->timeout)) + break; + } + + list_add(&req->list, &temp_req->list); + + if (req_list.next == &req->list) + set_timeout(req->timeout); + mutex_unlock(&lock); +} + +static void addr_send_arp(struct sockaddr_in *dst_in) +{ + struct rtable *rt; + struct flowi fl; + u32 dst_ip = dst_in->sin_addr.s_addr; + + memset(&fl, 0, sizeof fl); + fl.nl_u.ip4_u.daddr = dst_ip; + if (ip_route_output_key(&rt, &fl)) + return; + + arp_send(ARPOP_REQUEST, ETH_P_ARP, rt->rt_gateway, rt->idev->dev, + rt->rt_src, NULL, rt->idev->dev->dev_addr, NULL); + ip_rt_put(rt); +} + +static int addr_resolve_remote(struct sockaddr_in *src_in, + struct sockaddr_in *dst_in, + struct rdma_dev_addr *addr) +{ + u32 src_ip = src_in->sin_addr.s_addr; + u32 dst_ip = dst_in->sin_addr.s_addr; + struct flowi fl; + struct rtable *rt; + struct neighbour *neigh; + int ret; + + memset(&fl, 0, sizeof fl); + fl.nl_u.ip4_u.daddr = dst_ip; + fl.nl_u.ip4_u.saddr = src_ip; + ret = ip_route_output_key(&rt, &fl); + if (ret) + goto out; + + neigh = neigh_lookup(&arp_tbl, &rt->rt_gateway, rt->idev->dev); + if (!neigh) { + ret = -ENODATA; + goto err1; + } + + if (!(neigh->nud_state & NUD_VALID)) { + ret = -ENODATA; + goto err2; + } + + if (!src_ip) { + src_in->sin_family = dst_in->sin_family; + src_in->sin_addr.s_addr = rt->rt_src; + } + + ret = copy_addr(addr, neigh->dev, neigh->ha); +err2: + neigh_release(neigh); +err1: + ip_rt_put(rt); +out: + return ret; +} + +static void process_req(void *data) +{ + struct addr_req *req, *temp_req; + struct sockaddr_in *src_in, *dst_in; + struct list_head done_list; + + INIT_LIST_HEAD(&done_list); + + mutex_lock(&lock); + list_for_each_entry_safe(req, temp_req, &req_list, list) { + if (req->status) { + src_in = (struct sockaddr_in *) &req->src_addr; + dst_in = (struct sockaddr_in *) &req->dst_addr; + req->status = addr_resolve_remote(src_in, dst_in, + req->addr); + } + if (req->status && time_after(jiffies, req->timeout)) + req->status = -ETIMEDOUT; + else if (req->status == -ENODATA) + continue; + + list_del(&req->list); + list_add_tail(&req->list, &done_list); + } + + if (!list_empty(&req_list)) { + req = list_entry(req_list.next, struct addr_req, list); + set_timeout(req->timeout); + } + mutex_unlock(&lock); + + list_for_each_entry_safe(req, temp_req, &done_list, list) { + list_del(&req->list); + req->callback(req->status, &req->src_addr, req->addr, + req->context); + kfree(req); + } +} + +static int addr_resolve_local(struct sockaddr_in *src_in, + struct sockaddr_in *dst_in, + struct rdma_dev_addr *addr) +{ + struct net_device *dev; + u32 src_ip = src_in->sin_addr.s_addr; + u32 dst_ip = dst_in->sin_addr.s_addr; + int ret; + + dev = ip_dev_find(dst_ip); + if (!dev) + return -EADDRNOTAVAIL; + + if (!src_ip) { + src_in->sin_family = dst_in->sin_family; + src_in->sin_addr.s_addr = dst_ip; + ret = copy_addr(addr, dev, dev->dev_addr); + } else { + ret = rdma_translate_ip((struct sockaddr *)src_in, addr); + if (!ret) + memcpy(addr->dst_dev_addr, dev->dev_addr, MAX_ADDR_LEN); + } + + dev_put(dev); + return ret; +} + +int rdma_resolve_ip(struct sockaddr *src_addr, struct sockaddr *dst_addr, + struct rdma_dev_addr *addr, int timeout_ms, + void (*callback)(int status, struct sockaddr *src_addr, + struct rdma_dev_addr *addr, void *context), + void *context) +{ + struct sockaddr_in *src_in, *dst_in; + struct addr_req *req; + int ret = 0; + + req = kmalloc(sizeof *req, GFP_KERNEL); + if (!req) + return -ENOMEM; + memset(req, 0, sizeof *req); + + if (src_addr) + memcpy(&req->src_addr, src_addr, ip_addr_size(src_addr)); + memcpy(&req->dst_addr, dst_addr, ip_addr_size(dst_addr)); + req->addr = addr; + req->callback = callback; + req->context = context; + + src_in = (struct sockaddr_in *) &req->src_addr; + dst_in = (struct sockaddr_in *) &req->dst_addr; + + req->status = addr_resolve_local(src_in, dst_in, addr); + if (req->status == -EADDRNOTAVAIL) + req->status = addr_resolve_remote(src_in, dst_in, addr); + + switch (req->status) { + case 0: + req->timeout = jiffies; + queue_req(req); + break; + case -ENODATA: + req->timeout = msecs_to_jiffies(timeout_ms) + jiffies; + queue_req(req); + addr_send_arp(dst_in); + break; + default: + ret = req->status; + kfree(req); + break; + } + return ret; +} +EXPORT_SYMBOL(rdma_resolve_ip); + +void rdma_addr_cancel(struct rdma_dev_addr *addr) +{ + struct addr_req *req, *temp_req; + + mutex_lock(&lock); + list_for_each_entry_safe(req, temp_req, &req_list, list) { + if (req->addr == addr) { + req->status = -ECANCELED; + req->timeout = jiffies; + list_del(&req->list); + list_add(&req->list, &req_list); + set_timeout(req->timeout); + break; + } + } + mutex_unlock(&lock); +} +EXPORT_SYMBOL(rdma_addr_cancel); + +static int addr_arp_recv(struct sk_buff *skb, struct net_device *dev, + struct packet_type *pkt, struct net_device *orig_dev) +{ + struct arphdr *arp_hdr; + + arp_hdr = (struct arphdr *) skb->nh.raw; + + if (dev->type == ARPHRD_INFINIBAND && + (arp_hdr->ar_op == __constant_htons(ARPOP_REQUEST) || + arp_hdr->ar_op == __constant_htons(ARPOP_REPLY))) + set_timeout(jiffies); + + kfree_skb(skb); + return 0; +} + +static struct packet_type addr_arp = { + .type = __constant_htons(ETH_P_ARP), + .func = addr_arp_recv, + .af_packet_priv = (void*) 1, +}; + +static int addr_init(void) +{ + rdma_wq = create_singlethread_workqueue("rdma_wq"); + if (!rdma_wq) + return -ENOMEM; + + dev_add_pack(&addr_arp); + return 0; +} + +static void addr_cleanup(void) +{ + dev_remove_pack(&addr_arp); + destroy_workqueue(rdma_wq); +} + +module_init(addr_init); +module_exit(addr_cleanup); diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/Makefile linux-2.6.ib/drivers/infiniband/core/Makefile --- linux-2.6.git/drivers/infiniband/core/Makefile 2006-01-16 16:03:08.000000000 -0800 +++ linux-2.6.ib/drivers/infiniband/core/Makefile 2006-01-16 16:14:24.000000000 -0800 @@ -1,5 +1,5 @@ obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_sa.o \ - ib_cm.o + ib_cm.o ib_addr.o obj-$(CONFIG_INFINIBAND_USER_MAD) += ib_umad.o obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o @@ -12,6 +12,8 @@ ib_sa-y := sa_query.o ib_cm-y := cm.o +ib_addr-y := addr.o + ib_umad-y := user_mad.o ib_ucm-y := ucm.o diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/include/rdma/ib_addr.h linux-2.6.ib/include/rdma/ib_addr.h --- linux-2.6.git/include/rdma/ib_addr.h 1969-12-31 16:00:00.000000000 -0800 +++ linux-2.6.ib/include/rdma/ib_addr.h 2006-01-16 16:14:24.000000000 -0800 @@ -0,0 +1,117 @@ +/* + * Copyright (c) 2005 Voltaire Inc. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + * + */ + +#if !defined(IB_ADDR_H) +#define IB_ADDR_H + +#include +#include +#include +#include +#include + +extern struct workqueue_struct *rdma_wq; + +struct rdma_dev_addr { + unsigned char src_dev_addr[MAX_ADDR_LEN]; + unsigned char dst_dev_addr[MAX_ADDR_LEN]; + unsigned char broadcast[MAX_ADDR_LEN]; + enum ib_node_type dev_type; +}; + +/** + * rdma_translate_ip - Translate a local IP address to an RDMA hardware + * address. + */ +int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr); + +/** + * rdma_resolve_ip - Resolve source and destination IP addresses to + * RDMA hardware addresses. + * @src_addr: An optional source address to use in the resolution. If a + * source address is not provided, a usable address will be returned via + * the callback. + * @dst_addr: The destination address to resolve. + * @addr: A reference to a data location that will receive the resolved + * addresses. The data location must remain valid until the callback has + * been invoked. + * @timeout_ms: Amount of time to wait for the address resolution to complete. + * @callback: Call invoked once address resolution has completed, timed out, + * or been canceled. A status of 0 indicates success. + * @context: User-specified context associated with the call. + */ +int rdma_resolve_ip(struct sockaddr *src_addr, struct sockaddr *dst_addr, + struct rdma_dev_addr *addr, int timeout_ms, + void (*callback)(int status, struct sockaddr *src_addr, + struct rdma_dev_addr *addr, void *context), + void *context); + +void rdma_addr_cancel(struct rdma_dev_addr *addr); + +static inline int ip_addr_size(struct sockaddr *addr) +{ + return addr->sa_family == AF_INET6 ? + sizeof(struct sockaddr_in6) : sizeof(struct sockaddr_in); +} + +static inline u16 ib_addr_get_pkey(struct rdma_dev_addr *dev_addr) +{ + return ((u16)dev_addr->broadcast[8] << 8) | (u16)dev_addr->broadcast[9]; +} + +static inline void ib_addr_set_pkey(struct rdma_dev_addr *dev_addr, u16 pkey) +{ + dev_addr->broadcast[8] = pkey >> 8; + dev_addr->broadcast[9] = (unsigned char) pkey; +} + +static inline union ib_gid* ib_addr_get_sgid(struct rdma_dev_addr *dev_addr) +{ + return (union ib_gid *) (dev_addr->src_dev_addr + 4); +} + +static inline void ib_addr_set_sgid(struct rdma_dev_addr *dev_addr, + union ib_gid *gid) +{ + memcpy(dev_addr->src_dev_addr + 4, gid, sizeof *gid); +} + +static inline union ib_gid* ib_addr_get_dgid(struct rdma_dev_addr *dev_addr) +{ + return (union ib_gid *) (dev_addr->dst_dev_addr + 4); +} + +static inline void ib_addr_set_dgid(struct rdma_dev_addr *dev_addr, + union ib_gid *gid) +{ + memcpy(dev_addr->dst_dev_addr + 4, gid, sizeof *gid); +} + +#endif /* IB_ADDR_H */ + diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/net/ipv4/fib_frontend.c linux-2.6.ib/net/ipv4/fib_frontend.c --- linux-2.6.git/net/ipv4/fib_frontend.c 2006-01-16 10:28:29.000000000 -0800 +++ linux-2.6.ib/net/ipv4/fib_frontend.c 2006-01-16 16:14:24.000000000 -0800 @@ -666,4 +666,5 @@ void __init ip_fib_init(void) } EXPORT_SYMBOL(inet_addr_type); +EXPORT_SYMBOL(ip_dev_find); EXPORT_SYMBOL(ip_rt_ioctl); From sean.hefty at intel.com Wed Feb 1 12:18:07 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 1 Feb 2006 12:18:07 -0800 Subject: [openib-general] [PATCH 4/5] Infiniband: connection abstraction In-Reply-To: Message-ID: The following patch implements a kernel mode connection management agent over Infiniband that connects based on IP addresses. The agent defines a generic RDMA connection abstraction to support clients wanting to connect over different RDMA devices. It also handles RDMA device hotplug events on behalf of clients. Signed-off-by: Sean Hefty --- diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/cma.c linux-2.6.ib/drivers/infiniband/core/cma.c --- linux-2.6.git/drivers/infiniband/core/cma.c 1969-12-31 16:00:00.000000000 -0800 +++ linux-2.6.ib/drivers/infiniband/core/cma.c 2006-01-16 16:17:34.000000000 -0800 @@ -0,0 +1,1639 @@ +/* + * Copyright (c) 2005 Voltaire Inc. All rights reserved. + * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. + * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + * + */ +#include +#include +#include +#include +#include +#include +#include + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("Generic RDMA CM Agent"); +MODULE_LICENSE("Dual BSD/GPL"); + +#define CMA_CM_RESPONSE_TIMEOUT 20 +#define CMA_MAX_CM_RETRIES 3 + +static void cma_add_one(struct ib_device *device); +static void cma_remove_one(struct ib_device *device); + +static struct ib_client cma_client = { + .name = "cma", + .add = cma_add_one, + .remove = cma_remove_one +}; + +static LIST_HEAD(dev_list); +static LIST_HEAD(listen_any_list); +static DEFINE_MUTEX(lock); + +struct cma_device { + struct list_head list; + struct ib_device *device; + __be64 node_guid; + wait_queue_head_t wait; + atomic_t refcount; + struct list_head id_list; +}; + +enum cma_state { + CMA_IDLE, + CMA_ADDR_QUERY, + CMA_ADDR_RESOLVED, + CMA_ROUTE_QUERY, + CMA_ROUTE_RESOLVED, + CMA_CONNECT, + CMA_ADDR_BOUND, + CMA_LISTEN, + CMA_DEVICE_REMOVAL, + CMA_DESTROYING +}; + +/* + * Device removal can occur at anytime, so we need extra handling to + * serialize notifying the user of device removal with other callbacks. + * We do this by disabling removal notification while a callback is in process, + * and reporting it after the callback completes. + */ +struct rdma_id_private { + struct rdma_cm_id id; + + struct list_head list; + struct list_head listen_list; + struct cma_device *cma_dev; + + enum cma_state state; + spinlock_t lock; + wait_queue_head_t wait; + atomic_t refcount; + wait_queue_head_t wait_remove; + atomic_t dev_remove; + + int backlog; + int timeout_ms; + struct ib_sa_query *query; + int query_id; + struct ib_cm_id *cm_id; + + u32 seq_num; + u32 qp_num; + enum ib_qp_type qp_type; + u8 srq; +}; + +struct cma_work { + struct work_struct work; + struct rdma_id_private *id; +}; + +union cma_ip_addr { + struct in6_addr ip6; + struct { + __u32 pad[3]; + __u32 addr; + } ip4; +}; + +struct cma_hdr { + u8 cma_version; + u8 ip_version; /* IP version: 7:4 */ + __u16 port; + union cma_ip_addr src_addr; + union cma_ip_addr dst_addr; +}; + +struct sdp_hh { + u8 sdp_version; + u8 ip_version; /* IP version: 7:4 */ + u8 sdp_specific1[10]; + __u16 port; + __u16 sdp_specific2; + union cma_ip_addr src_addr; + union cma_ip_addr dst_addr; +}; + +#define CMA_VERSION 0x00 +#define SDP_VERSION 0x22 + +static int cma_comp(struct rdma_id_private *id_priv, enum cma_state comp) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(&id_priv->lock, flags); + ret = (id_priv->state == comp); + spin_unlock_irqrestore(&id_priv->lock, flags); + return ret; +} + +static int cma_comp_exch(struct rdma_id_private *id_priv, + enum cma_state comp, enum cma_state exch) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(&id_priv->lock, flags); + if ((ret = (id_priv->state == comp))) + id_priv->state = exch; + spin_unlock_irqrestore(&id_priv->lock, flags); + return ret; +} + +static enum cma_state cma_exch(struct rdma_id_private *id_priv, + enum cma_state exch) +{ + unsigned long flags; + enum cma_state old; + + spin_lock_irqsave(&id_priv->lock, flags); + old = id_priv->state; + id_priv->state = exch; + spin_unlock_irqrestore(&id_priv->lock, flags); + return old; +} + +static inline u8 cma_get_ip_ver(struct cma_hdr *hdr) +{ + return hdr->ip_version >> 4; +} + +static inline void cma_set_ip_ver(struct cma_hdr *hdr, u8 ip_ver) +{ + hdr->ip_version = (ip_ver << 4) | (hdr->ip_version & 0xF); +} + +static inline u8 sdp_get_ip_ver(struct sdp_hh *hh) +{ + return hh->ip_version >> 4; +} + +static inline void sdp_set_ip_ver(struct sdp_hh *hh, u8 ip_ver) +{ + hh->ip_version = (ip_ver << 4) | (hh->ip_version & 0xF); +} + +static void cma_attach_to_dev(struct rdma_id_private *id_priv, + struct cma_device *cma_dev) +{ + atomic_inc(&cma_dev->refcount); + id_priv->cma_dev = cma_dev; + id_priv->id.device = cma_dev->device; + list_add_tail(&id_priv->list, &cma_dev->id_list); +} + +static void cma_detach_from_dev(struct rdma_id_private *id_priv) +{ + list_del(&id_priv->list); + if (atomic_dec_and_test(&id_priv->cma_dev->refcount)) + wake_up(&id_priv->cma_dev->wait); + id_priv->cma_dev = NULL; +} + +static int cma_acquire_ib_dev(struct rdma_id_private *id_priv) +{ + struct cma_device *cma_dev; + union ib_gid *gid; + int ret = -ENODEV; + + gid = ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr); + + mutex_lock(&lock); + list_for_each_entry(cma_dev, &dev_list, list) { + ret = ib_find_cached_gid(cma_dev->device, gid, + &id_priv->id.port_num, NULL); + if (!ret) { + cma_attach_to_dev(id_priv, cma_dev); + break; + } + } + mutex_unlock(&lock); + return ret; +} + +static int cma_acquire_dev(struct rdma_id_private *id_priv) +{ + switch (id_priv->id.route.addr.dev_addr.dev_type) { + case IB_NODE_CA: + return cma_acquire_ib_dev(id_priv); + default: + return -ENODEV; + } +} + +static void cma_deref_id(struct rdma_id_private *id_priv) +{ + if (atomic_dec_and_test(&id_priv->refcount)) + wake_up(&id_priv->wait); +} + +static void cma_release_remove(struct rdma_id_private *id_priv) +{ + if (atomic_dec_and_test(&id_priv->dev_remove)) + wake_up(&id_priv->wait_remove); +} + +struct rdma_cm_id* rdma_create_id(rdma_cm_event_handler event_handler, + void *context, enum rdma_port_space ps) +{ + struct rdma_id_private *id_priv; + + id_priv = kzalloc(sizeof *id_priv, GFP_KERNEL); + if (!id_priv) + return ERR_PTR(-ENOMEM); + + id_priv->state = CMA_IDLE; + id_priv->id.context = context; + id_priv->id.event_handler = event_handler; + id_priv->id.ps = ps; + spin_lock_init(&id_priv->lock); + init_waitqueue_head(&id_priv->wait); + atomic_set(&id_priv->refcount, 1); + init_waitqueue_head(&id_priv->wait_remove); + atomic_set(&id_priv->dev_remove, 0); + INIT_LIST_HEAD(&id_priv->listen_list); + get_random_bytes(&id_priv->seq_num, sizeof id_priv->seq_num); + + return &id_priv->id; +} +EXPORT_SYMBOL(rdma_create_id); + +static int cma_init_ib_qp(struct rdma_id_private *id_priv, struct ib_qp *qp) +{ + struct ib_qp_attr qp_attr; + struct rdma_dev_addr *dev_addr; + int ret; + + dev_addr = &id_priv->id.route.addr.dev_addr; + ret = ib_find_cached_pkey(id_priv->id.device, id_priv->id.port_num, + ib_addr_get_pkey(dev_addr), + &qp_attr.pkey_index); + if (ret) + return ret; + + qp_attr.qp_state = IB_QPS_INIT; + qp_attr.qp_access_flags = IB_ACCESS_LOCAL_WRITE; + qp_attr.port_num = id_priv->id.port_num; + return ib_modify_qp(qp, &qp_attr, IB_QP_STATE | IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX | IB_QP_PORT); +} + +int rdma_create_qp(struct rdma_cm_id *id, struct ib_pd *pd, + struct ib_qp_init_attr *qp_init_attr) +{ + struct rdma_id_private *id_priv; + struct ib_qp *qp; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (id->device != pd->device) + return -EINVAL; + + qp = ib_create_qp(pd, qp_init_attr); + if (IS_ERR(qp)) + return PTR_ERR(qp); + + switch (id->device->node_type) { + case IB_NODE_CA: + ret = cma_init_ib_qp(id_priv, qp); + break; + default: + ret = -ENOSYS; + break; + } + + if (ret) + goto err; + + id->qp = qp; + id_priv->qp_num = qp->qp_num; + id_priv->qp_type = qp->qp_type; + id_priv->srq = (qp->srq != NULL); + return 0; +err: + ib_destroy_qp(qp); + return ret; +} +EXPORT_SYMBOL(rdma_create_qp); + +void rdma_destroy_qp(struct rdma_cm_id *id) +{ + ib_destroy_qp(id->qp); +} +EXPORT_SYMBOL(rdma_destroy_qp); + +static int cma_modify_qp_rtr(struct rdma_cm_id *id) +{ + struct ib_qp_attr qp_attr; + int qp_attr_mask, ret; + + if (!id->qp) + return 0; + + /* Need to update QP attributes from default values. */ + qp_attr.qp_state = IB_QPS_INIT; + ret = rdma_init_qp_attr(id, &qp_attr, &qp_attr_mask); + if (ret) + return ret; + + ret = ib_modify_qp(id->qp, &qp_attr, qp_attr_mask); + if (ret) + return ret; + + qp_attr.qp_state = IB_QPS_RTR; + ret = rdma_init_qp_attr(id, &qp_attr, &qp_attr_mask); + if (ret) + return ret; + + return ib_modify_qp(id->qp, &qp_attr, qp_attr_mask); +} + +static int cma_modify_qp_rts(struct rdma_cm_id *id) +{ + struct ib_qp_attr qp_attr; + int qp_attr_mask, ret; + + if (!id->qp) + return 0; + + qp_attr.qp_state = IB_QPS_RTS; + ret = rdma_init_qp_attr(id, &qp_attr, &qp_attr_mask); + if (ret) + return ret; + + return ib_modify_qp(id->qp, &qp_attr, qp_attr_mask); +} + +static int cma_modify_qp_err(struct rdma_cm_id *id) +{ + struct ib_qp_attr qp_attr; + + if (!id->qp) + return 0; + + qp_attr.qp_state = IB_QPS_ERR; + return ib_modify_qp(id->qp, &qp_attr, IB_QP_STATE); +} + +int rdma_init_qp_attr(struct rdma_cm_id *id, struct ib_qp_attr *qp_attr, + int *qp_attr_mask) +{ + struct rdma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + switch (id_priv->id.device->node_type) { + case IB_NODE_CA: + ret = ib_cm_init_qp_attr(id_priv->cm_id, qp_attr, + qp_attr_mask); + if (qp_attr->qp_state == IB_QPS_RTR) + qp_attr->rq_psn = id_priv->seq_num; + break; + default: + ret = -ENOSYS; + break; + } + + return ret; +} +EXPORT_SYMBOL(rdma_init_qp_attr); + +static inline int cma_any_addr(struct sockaddr *addr) +{ + struct in6_addr *ip6; + + if (addr->sa_family == AF_INET) + return ((struct sockaddr_in *) addr)->sin_addr.s_addr == + INADDR_ANY; + else { + ip6 = &((struct sockaddr_in6 *) addr)->sin6_addr; + return (ip6->s6_addr32[0] | ip6->s6_addr32[1] | + ip6->s6_addr32[3] | ip6->s6_addr32[4]) == 0; + } +} + +static inline int cma_loopback_addr(struct sockaddr *addr) +{ + return ((struct sockaddr_in *) addr)->sin_addr.s_addr == + ntohl(INADDR_LOOPBACK); +} + +static int cma_get_net_info(void *hdr, enum rdma_port_space ps, + u8 *ip_ver, __u16 *port, + union cma_ip_addr **src, union cma_ip_addr **dst) +{ + switch (ps) { + case RDMA_PS_SDP: + if (((struct sdp_hh *) hdr)->sdp_version != SDP_VERSION) + return -EINVAL; + + *ip_ver = sdp_get_ip_ver(hdr); + *port = ((struct sdp_hh *) hdr)->port; + *src = &((struct sdp_hh *) hdr)->src_addr; + *dst = &((struct sdp_hh *) hdr)->dst_addr; + break; + default: + if (((struct cma_hdr *) hdr)->cma_version != CMA_VERSION) + return -EINVAL; + + *ip_ver = cma_get_ip_ver(hdr); + *port = ((struct cma_hdr *) hdr)->port; + *src = &((struct cma_hdr *) hdr)->src_addr; + *dst = &((struct cma_hdr *) hdr)->dst_addr; + break; + } + return 0; +} + +static void cma_save_net_info(struct rdma_addr *addr, + struct rdma_addr *listen_addr, + u8 ip_ver, __u16 port, + union cma_ip_addr *src, union cma_ip_addr *dst) +{ + struct sockaddr_in *listen4, *ip4; + struct sockaddr_in6 *listen6, *ip6; + + switch (ip_ver) { + case 4: + listen4 = (struct sockaddr_in *) &listen_addr->src_addr; + ip4 = (struct sockaddr_in *) &addr->src_addr; + ip4->sin_family = listen4->sin_family; + ip4->sin_addr.s_addr = dst->ip4.addr; + ip4->sin_port = listen4->sin_port; + + ip4 = (struct sockaddr_in *) &addr->dst_addr; + ip4->sin_family = listen4->sin_family; + ip4->sin_addr.s_addr = src->ip4.addr; + ip4->sin_port = port; + break; + case 6: + listen6 = (struct sockaddr_in6 *) &listen_addr->src_addr; + ip6 = (struct sockaddr_in6 *) &addr->src_addr; + ip6->sin6_family = listen6->sin6_family; + ip6->sin6_addr = dst->ip6; + ip6->sin6_port = listen6->sin6_port; + + ip6 = (struct sockaddr_in6 *) &addr->dst_addr; + ip6->sin6_family = listen6->sin6_family; + ip6->sin6_addr = src->ip6; + ip6->sin6_port = port; + break; + default: + break; + } +} + +static inline int cma_user_data_offset(enum rdma_port_space ps) +{ + switch (ps) { + case RDMA_PS_SDP: + return 0; + default: + return sizeof(struct cma_hdr); + } +} + +static int cma_notify_user(struct rdma_id_private *id_priv, + enum rdma_cm_event_type type, int status, + void *data, u8 data_len) +{ + struct rdma_cm_event event; + + event.event = type; + event.status = status; + event.private_data = data; + event.private_data_len = data_len; + + return id_priv->id.event_handler(&id_priv->id, &event); +} + +static void cma_cancel_addr(struct rdma_id_private *id_priv) +{ + switch (id_priv->id.device->node_type) { + case IB_NODE_CA: + rdma_addr_cancel(&id_priv->id.route.addr.dev_addr); + break; + default: + break; + } +} + +static void cma_cancel_route(struct rdma_id_private *id_priv) +{ + switch (id_priv->id.device->node_type) { + case IB_NODE_CA: + ib_sa_cancel_query(id_priv->query_id, id_priv->query); + break; + default: + break; + } +} + +static inline int cma_internal_listen(struct rdma_id_private *id_priv) +{ + return (id_priv->state == CMA_LISTEN) && id_priv->cma_dev && + cma_any_addr(&id_priv->id.route.addr.src_addr); +} + +static void cma_destroy_listen(struct rdma_id_private *id_priv) +{ + cma_exch(id_priv, CMA_DESTROYING); + + if (id_priv->cm_id && !IS_ERR(id_priv->cm_id)) + ib_destroy_cm_id(id_priv->cm_id); + + list_del(&id_priv->listen_list); + if (id_priv->cma_dev) + cma_detach_from_dev(id_priv); + + atomic_dec(&id_priv->refcount); + wait_event(id_priv->wait, !atomic_read(&id_priv->refcount)); + + kfree(id_priv); +} + +static void cma_cancel_listens(struct rdma_id_private *id_priv) +{ + struct rdma_id_private *dev_id_priv; + + mutex_lock(&lock); + list_del(&id_priv->list); + + while (!list_empty(&id_priv->listen_list)) { + dev_id_priv = list_entry(id_priv->listen_list.next, + struct rdma_id_private, listen_list); + cma_destroy_listen(dev_id_priv); + } + mutex_unlock(&lock); +} + +static void cma_cancel_operation(struct rdma_id_private *id_priv, + enum cma_state state) +{ + switch (state) { + case CMA_ADDR_QUERY: + cma_cancel_addr(id_priv); + break; + case CMA_ROUTE_QUERY: + cma_cancel_route(id_priv); + break; + case CMA_LISTEN: + if (cma_any_addr(&id_priv->id.route.addr.src_addr) && + !id_priv->cma_dev) + cma_cancel_listens(id_priv); + break; + default: + break; + } +} + +void rdma_destroy_id(struct rdma_cm_id *id) +{ + struct rdma_id_private *id_priv; + enum cma_state state; + + id_priv = container_of(id, struct rdma_id_private, id); + state = cma_exch(id_priv, CMA_DESTROYING); + cma_cancel_operation(id_priv, state); + + if (id_priv->cm_id && !IS_ERR(id_priv->cm_id)) + ib_destroy_cm_id(id_priv->cm_id); + + if (id_priv->cma_dev) { + mutex_lock(&lock); + cma_detach_from_dev(id_priv); + mutex_unlock(&lock); + } + + atomic_dec(&id_priv->refcount); + wait_event(id_priv->wait, !atomic_read(&id_priv->refcount)); + + kfree(id_priv->id.route.path_rec); + kfree(id_priv); +} +EXPORT_SYMBOL(rdma_destroy_id); + +static int cma_rep_recv(struct rdma_id_private *id_priv) +{ + int ret; + + ret = cma_modify_qp_rtr(&id_priv->id); + if (ret) + goto reject; + + ret = cma_modify_qp_rts(&id_priv->id); + if (ret) + goto reject; + + ret = ib_send_cm_rtu(id_priv->cm_id, NULL, 0); + if (ret) + goto reject; + + return 0; +reject: + cma_modify_qp_err(&id_priv->id); + ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, + NULL, 0, NULL, 0); + return ret; +} + +static int cma_rtu_recv(struct rdma_id_private *id_priv) +{ + int ret; + + ret = cma_modify_qp_rts(&id_priv->id); + if (ret) + goto reject; + + return 0; +reject: + cma_modify_qp_err(&id_priv->id); + ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, + NULL, 0, NULL, 0); + return ret; +} + +static int cma_ib_handler(struct ib_cm_id *cm_id, struct ib_cm_event *ib_event) +{ + struct rdma_id_private *id_priv = cm_id->context; + enum rdma_cm_event_type event; + u8 private_data_len = 0; + int ret = 0, status = 0; + + if (!cma_comp(id_priv, CMA_CONNECT)) + return 0; + + atomic_inc(&id_priv->dev_remove); + switch (ib_event->event) { + case IB_CM_REQ_ERROR: + case IB_CM_REP_ERROR: + event = RDMA_CM_EVENT_UNREACHABLE; + status = -ETIMEDOUT; + break; + case IB_CM_REP_RECEIVED: + if (id_priv->id.qp) { + status = cma_rep_recv(id_priv); + event = status ? RDMA_CM_EVENT_CONNECT_ERROR : + RDMA_CM_EVENT_ESTABLISHED; + } else + event = RDMA_CM_EVENT_CONNECT_RESPONSE; + private_data_len = IB_CM_REP_PRIVATE_DATA_SIZE; + break; + case IB_CM_RTU_RECEIVED: + status = cma_rtu_recv(id_priv); + event = status ? RDMA_CM_EVENT_CONNECT_ERROR : + RDMA_CM_EVENT_ESTABLISHED; + break; + case IB_CM_DREQ_ERROR: + status = -ETIMEDOUT; /* fall through */ + case IB_CM_DREQ_RECEIVED: + case IB_CM_DREP_RECEIVED: + event = RDMA_CM_EVENT_DISCONNECTED; + break; + case IB_CM_TIMEWAIT_EXIT: + case IB_CM_MRA_RECEIVED: + /* ignore event */ + goto out; + case IB_CM_REJ_RECEIVED: + cma_modify_qp_err(&id_priv->id); + status = ib_event->param.rej_rcvd.reason; + event = RDMA_CM_EVENT_REJECTED; + break; + default: + printk(KERN_ERR "RDMA CMA: unexpected IB CM event: %d", + ib_event->event); + goto out; + } + + ret = cma_notify_user(id_priv, event, status, ib_event->private_data, + private_data_len); + if (ret) { + /* Destroy the CM ID by returning a non-zero value. */ + id_priv->cm_id = NULL; + cma_exch(id_priv, CMA_DESTROYING); + cma_release_remove(id_priv); + rdma_destroy_id(&id_priv->id); + return ret; + } +out: + cma_release_remove(id_priv); + return ret; +} + +static struct rdma_id_private* cma_new_id(struct rdma_cm_id *listen_id, + struct ib_cm_event *ib_event) +{ + struct rdma_id_private *id_priv; + struct rdma_cm_id *id; + struct rdma_route *rt; + union cma_ip_addr *src, *dst; + __u16 port; + u8 ip_ver; + + id = rdma_create_id(listen_id->event_handler, listen_id->context, + listen_id->ps); + if (IS_ERR(id)) + return NULL; + + rt = &id->route; + rt->num_paths = ib_event->param.req_rcvd.alternate_path ? 2 : 1; + rt->path_rec = kmalloc(sizeof *rt->path_rec * rt->num_paths, GFP_KERNEL); + if (!rt->path_rec) + goto err; + + if (cma_get_net_info(ib_event->private_data, listen_id->ps, + &ip_ver, &port, &src, &dst)) + goto err; + + cma_save_net_info(&id->route.addr, &listen_id->route.addr, + ip_ver, port, src, dst); + rt->path_rec[0] = *ib_event->param.req_rcvd.primary_path; + if (rt->num_paths == 2) + rt->path_rec[1] = *ib_event->param.req_rcvd.alternate_path; + + ib_addr_set_sgid(&rt->addr.dev_addr, &rt->path_rec[0].sgid); + ib_addr_set_dgid(&rt->addr.dev_addr, &rt->path_rec[0].dgid); + ib_addr_set_pkey(&rt->addr.dev_addr, be16_to_cpu(rt->path_rec[0].pkey)); + rt->addr.dev_addr.dev_type = IB_NODE_CA; + + id_priv = container_of(id, struct rdma_id_private, id); + id_priv->state = CMA_CONNECT; + return id_priv; +err: + rdma_destroy_id(id); + return NULL; +} + +static int cma_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *ib_event) +{ + struct rdma_id_private *listen_id, *conn_id; + int offset, ret; + + listen_id = cm_id->context; + atomic_inc(&listen_id->dev_remove); + if (!cma_comp(listen_id, CMA_LISTEN)) { + ret = -ECONNABORTED; + goto out; + } + + conn_id = cma_new_id(&listen_id->id, ib_event); + if (!conn_id) { + ret = -ENOMEM; + goto out; + } + + atomic_inc(&conn_id->dev_remove); + ret = cma_acquire_ib_dev(conn_id); + if (ret) { + ret = -ENODEV; + cma_release_remove(conn_id); + rdma_destroy_id(&conn_id->id); + goto out; + } + + conn_id->cm_id = cm_id; + cm_id->context = conn_id; + cm_id->cm_handler = cma_ib_handler; + + offset = cma_user_data_offset(listen_id->id.ps); + ret = cma_notify_user(conn_id, RDMA_CM_EVENT_CONNECT_REQUEST, 0, + ib_event->private_data + offset, + IB_CM_REQ_PRIVATE_DATA_SIZE - offset); + if (ret) { + /* Destroy the CM ID by returning a non-zero value. */ + conn_id->cm_id = NULL; + cma_exch(conn_id, CMA_DESTROYING); + cma_release_remove(conn_id); + rdma_destroy_id(&conn_id->id); + } +out: + cma_release_remove(listen_id); + return ret; +} + +static __be64 cma_get_service_id(enum rdma_port_space ps, struct sockaddr *addr) +{ + return cpu_to_be64(((u64)ps << 16) + + be16_to_cpu(((struct sockaddr_in *) addr)->sin_port)); +} + +static void cma_set_compare_data(struct sockaddr *addr, + struct ib_cm_compare_data *compare) +{ + struct cma_hdr *data, *mask; + + memset(compare, 0, sizeof *compare); + data = (void *) compare->data; + mask = (void *) compare->mask; + + switch (addr->sa_family) { + case AF_INET: + cma_set_ip_ver(data, 4); + cma_set_ip_ver(mask, 0xF); + data->dst_addr.ip4.addr = ((struct sockaddr_in *) addr)-> + sin_addr.s_addr; + mask->dst_addr.ip4.addr = ~0; + break; + case AF_INET6: + cma_set_ip_ver(data, 6); + cma_set_ip_ver(mask, 0xF); + data->dst_addr.ip6 = ((struct sockaddr_in6 *) addr)-> + sin6_addr; + memset(&mask->dst_addr.ip6, 1, sizeof mask->dst_addr.ip6); + break; + default: + break; + } +} + +static int cma_ib_listen(struct rdma_id_private *id_priv) +{ + struct ib_cm_compare_data compare_data; + struct sockaddr *addr; + __be64 svc_id; + int ret; + + id_priv->cm_id = ib_create_cm_id(id_priv->id.device, cma_req_handler, + id_priv); + if (IS_ERR(id_priv->cm_id)) + return PTR_ERR(id_priv->cm_id); + + addr = &id_priv->id.route.addr.src_addr; + svc_id = cma_get_service_id(id_priv->id.ps, addr); + if (cma_any_addr(addr)) + ret = ib_cm_listen(id_priv->cm_id, svc_id, 0, NULL); + else { + cma_set_compare_data(addr, &compare_data); + ret = ib_cm_listen(id_priv->cm_id, svc_id, 0, &compare_data); + } + + if (ret) { + ib_destroy_cm_id(id_priv->cm_id); + id_priv->cm_id = NULL; + } + + return ret; +} + +static int cma_duplicate_listen(struct rdma_id_private *id_priv) +{ + struct rdma_id_private *cur_id_priv; + struct sockaddr_in *cur_addr, *new_addr; + + new_addr = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr; + list_for_each_entry(cur_id_priv, &listen_any_list, listen_list) { + cur_addr = (struct sockaddr_in *) + &cur_id_priv->id.route.addr.src_addr; + if (cur_addr->sin_port == new_addr->sin_port) + return -EADDRINUSE; + } + return 0; +} + +static int cma_listen_handler(struct rdma_cm_id *id, + struct rdma_cm_event *event) +{ + struct rdma_id_private *id_priv = id->context; + + id->context = id_priv->id.context; + id->event_handler = id_priv->id.event_handler; + return id_priv->id.event_handler(id, event); +} + +static void cma_listen_on_dev(struct rdma_id_private *id_priv, + struct cma_device *cma_dev) +{ + struct rdma_id_private *dev_id_priv; + struct rdma_cm_id *id; + int ret; + + id = rdma_create_id(cma_listen_handler, id_priv, id_priv->id.ps); + if (IS_ERR(id)) + return; + + dev_id_priv = container_of(id, struct rdma_id_private, id); + ret = rdma_bind_addr(id, &id_priv->id.route.addr.src_addr); + if (ret) + goto err; + + cma_attach_to_dev(dev_id_priv, cma_dev); + list_add_tail(&dev_id_priv->listen_list, &id_priv->listen_list); + + ret = rdma_listen(id, id_priv->backlog); + if (ret) + goto err; + + return; +err: + cma_destroy_listen(dev_id_priv); +} + +static int cma_listen_on_all(struct rdma_id_private *id_priv) +{ + struct cma_device *cma_dev; + int ret; + + mutex_lock(&lock); + ret = cma_duplicate_listen(id_priv); + if (ret) + goto out; + + list_add_tail(&id_priv->list, &listen_any_list); + list_for_each_entry(cma_dev, &dev_list, list) + cma_listen_on_dev(id_priv, cma_dev); +out: + mutex_unlock(&lock); + return ret; +} + +int rdma_listen(struct rdma_cm_id *id, int backlog) +{ + struct rdma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp_exch(id_priv, CMA_ADDR_BOUND, CMA_LISTEN)) + return -EINVAL; + + if (id->device) { + switch (id->device->node_type) { + case IB_NODE_CA: + ret = cma_ib_listen(id_priv); + break; + default: + ret = -ENOSYS; + break; + } + } else + ret = cma_listen_on_all(id_priv); + + if (ret) + goto err; + + id_priv->backlog = backlog; + return 0; +err: + cma_comp_exch(id_priv, CMA_LISTEN, CMA_ADDR_BOUND); + return ret; +}; +EXPORT_SYMBOL(rdma_listen); + +static void cma_query_handler(int status, struct ib_sa_path_rec *path_rec, + void *context) +{ + struct rdma_id_private *id_priv = context; + struct rdma_route *route = &id_priv->id.route; + enum rdma_cm_event_type event = RDMA_CM_EVENT_ROUTE_RESOLVED; + + atomic_inc(&id_priv->dev_remove); + if (!status) { + route->path_rec = kmalloc(sizeof *route->path_rec, GFP_KERNEL); + if (route->path_rec) { + route->num_paths = 1; + *route->path_rec = *path_rec; + if (!cma_comp_exch(id_priv, CMA_ROUTE_QUERY, + CMA_ROUTE_RESOLVED)) { + kfree(route->path_rec); + goto out; + } + } else + status = -ENOMEM; + } + + if (status) { + if (!cma_comp_exch(id_priv, CMA_ROUTE_QUERY, CMA_ADDR_RESOLVED)) + goto out; + event = RDMA_CM_EVENT_ROUTE_ERROR; + } + + if (cma_notify_user(id_priv, event, status, NULL, 0)) { + cma_exch(id_priv, CMA_DESTROYING); + cma_release_remove(id_priv); + cma_deref_id(id_priv); + rdma_destroy_id(&id_priv->id); + return; + } +out: + cma_release_remove(id_priv); + cma_deref_id(id_priv); +} + +static int cma_resolve_ib_route(struct rdma_id_private *id_priv, int timeout_ms) +{ + struct rdma_dev_addr *addr = &id_priv->id.route.addr.dev_addr; + struct ib_sa_path_rec path_rec; + + memset(&path_rec, 0, sizeof path_rec); + path_rec.sgid = *ib_addr_get_sgid(addr); + path_rec.dgid = *ib_addr_get_dgid(addr); + path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr)); + path_rec.numb_path = 1; + + id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device, + id_priv->id.port_num, &path_rec, + IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | + IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH, + timeout_ms, GFP_KERNEL, + cma_query_handler, id_priv, &id_priv->query); + + return (id_priv->query_id < 0) ? id_priv->query_id : 0; +} + +int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms) +{ + struct rdma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp_exch(id_priv, CMA_ADDR_RESOLVED, CMA_ROUTE_QUERY)) + return -EINVAL; + + atomic_inc(&id_priv->refcount); + switch (id->device->node_type) { + case IB_NODE_CA: + ret = cma_resolve_ib_route(id_priv, timeout_ms); + break; + default: + ret = -ENOSYS; + break; + } + if (ret) + goto err; + + return 0; +err: + cma_comp_exch(id_priv, CMA_ROUTE_QUERY, CMA_ADDR_RESOLVED); + cma_deref_id(id_priv); + return ret; +} +EXPORT_SYMBOL(rdma_resolve_route); + +static int cma_bind_loopback(struct rdma_id_private *id_priv) +{ + struct cma_device *cma_dev; + union ib_gid *gid; + u16 pkey; + int ret; + + mutex_lock(&lock); + if (list_empty(&dev_list)) { + ret = -ENODEV; + goto out; + } + + cma_dev = list_entry(dev_list.next, struct cma_device, list); + gid = ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr); + ret = ib_get_cached_gid(cma_dev->device, 1, 0, gid); + if (ret) + goto out; + + ret = ib_get_cached_pkey(cma_dev->device, 1, 0, &pkey); + if (ret) + goto out; + + ib_addr_set_pkey(&id_priv->id.route.addr.dev_addr, pkey); + id_priv->id.port_num = 1; + cma_attach_to_dev(id_priv, cma_dev); +out: + mutex_unlock(&lock); + return ret; +} + +static void addr_handler(int status, struct sockaddr *src_addr, + struct rdma_dev_addr *dev_addr, void *context) +{ + struct rdma_id_private *id_priv = context; + enum rdma_cm_event_type event; + enum cma_state old_state; + + atomic_inc(&id_priv->dev_remove); + if (!id_priv->cma_dev) { + old_state = CMA_IDLE; + if (!status) + status = cma_acquire_dev(id_priv); + } else + old_state = CMA_ADDR_BOUND; + + if (status) { + if (!cma_comp_exch(id_priv, CMA_ADDR_QUERY, old_state)) + goto out; + event = RDMA_CM_EVENT_ADDR_ERROR; + } else { + if (!cma_comp_exch(id_priv, CMA_ADDR_QUERY, CMA_ADDR_RESOLVED)) + goto out; + memcpy(&id_priv->id.route.addr.src_addr, src_addr, + ip_addr_size(src_addr)); + event = RDMA_CM_EVENT_ADDR_RESOLVED; + } + + if (cma_notify_user(id_priv, event, status, NULL, 0)) { + cma_exch(id_priv, CMA_DESTROYING); + cma_release_remove(id_priv); + cma_deref_id(id_priv); + rdma_destroy_id(&id_priv->id); + return; + } +out: + cma_release_remove(id_priv); + cma_deref_id(id_priv); +} + +static void loopback_addr_handler(void *data) +{ + struct cma_work *work = data; + struct rdma_id_private *id_priv = work->id; + + kfree(work); + atomic_inc(&id_priv->dev_remove); + + if (!cma_comp_exch(id_priv, CMA_ADDR_QUERY, CMA_ADDR_RESOLVED)) + goto out; + + if (cma_notify_user(id_priv, RDMA_CM_EVENT_ADDR_RESOLVED, 0, NULL, 0)) { + cma_exch(id_priv, CMA_DESTROYING); + cma_release_remove(id_priv); + cma_deref_id(id_priv); + rdma_destroy_id(&id_priv->id); + return; + } +out: + cma_release_remove(id_priv); + cma_deref_id(id_priv); +} + +static int cma_resolve_loopback(struct rdma_id_private *id_priv, + struct sockaddr *src_addr, enum cma_state state) +{ + struct cma_work *work; + struct rdma_dev_addr *dev_addr; + int ret; + + work = kmalloc(sizeof *work, GFP_KERNEL); + if (!work) + return -ENOMEM; + + if (state == CMA_IDLE) { + ret = cma_bind_loopback(id_priv); + if (ret) + goto err; + dev_addr = &id_priv->id.route.addr.dev_addr; + ib_addr_set_dgid(dev_addr, ib_addr_get_sgid(dev_addr)); + if (!src_addr || cma_any_addr(src_addr)) + src_addr = &id_priv->id.route.addr.dst_addr; + memcpy(&id_priv->id.route.addr.src_addr, src_addr, + ip_addr_size(src_addr)); + } + + work->id = id_priv; + INIT_WORK(&work->work, loopback_addr_handler, work); + queue_work(rdma_wq, &work->work); + return 0; +err: + kfree(work); + return ret; +} + +int rdma_resolve_addr(struct rdma_cm_id *id, struct sockaddr *src_addr, + struct sockaddr *dst_addr, int timeout_ms) +{ + struct rdma_id_private *id_priv; + enum cma_state expected_state; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (id_priv->cma_dev) { + expected_state = CMA_ADDR_BOUND; + src_addr = &id->route.addr.src_addr; + } else + expected_state = CMA_IDLE; + + if (!cma_comp_exch(id_priv, expected_state, CMA_ADDR_QUERY)) + return -EINVAL; + + atomic_inc(&id_priv->refcount); + memcpy(&id->route.addr.dst_addr, dst_addr, ip_addr_size(dst_addr)); + if (cma_loopback_addr(dst_addr)) + ret = cma_resolve_loopback(id_priv, src_addr, expected_state); + else + ret = rdma_resolve_ip(src_addr, dst_addr, + &id->route.addr.dev_addr, + timeout_ms, addr_handler, id_priv); + if (ret) + goto err; + + return 0; +err: + cma_comp_exch(id_priv, CMA_ADDR_QUERY, expected_state); + cma_deref_id(id_priv); + return ret; +} +EXPORT_SYMBOL(rdma_resolve_addr); + +int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr *addr) +{ + struct rdma_id_private *id_priv; + struct rdma_dev_addr *dev_addr; + int ret; + + if (addr->sa_family != AF_INET) + return -EINVAL; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp_exch(id_priv, CMA_IDLE, CMA_ADDR_BOUND)) + return -EINVAL; + + if (cma_any_addr(addr)) { + ret = 0; + } else if (cma_loopback_addr(addr)) { + ret = cma_bind_loopback(id_priv); + } else { + dev_addr = &id->route.addr.dev_addr; + ret = rdma_translate_ip(addr, dev_addr); + if (!ret) + ret = cma_acquire_dev(id_priv); + } + + if (ret) + goto err; + + memcpy(&id->route.addr.src_addr, addr, ip_addr_size(addr)); + return 0; +err: + cma_comp_exch(id_priv, CMA_ADDR_BOUND, CMA_IDLE); + return ret; +} +EXPORT_SYMBOL(rdma_bind_addr); + +static void cma_format_hdr(void *hdr, enum rdma_port_space ps, + struct rdma_route *route) +{ + struct sockaddr_in *src4, *dst4; + struct cma_hdr *cma_hdr; + struct sdp_hh *sdp_hdr; + + src4 = (struct sockaddr_in *) &route->addr.src_addr; + dst4 = (struct sockaddr_in *) &route->addr.dst_addr; + + switch (ps) { + case RDMA_PS_SDP: + sdp_hdr = hdr; + sdp_hdr->sdp_version = SDP_VERSION; + sdp_set_ip_ver(sdp_hdr, 4); + sdp_hdr->src_addr.ip4.addr = src4->sin_addr.s_addr; + sdp_hdr->dst_addr.ip4.addr = dst4->sin_addr.s_addr; + sdp_hdr->port = src4->sin_port; + break; + default: + cma_hdr = hdr; + cma_hdr->cma_version = CMA_VERSION; + cma_set_ip_ver(cma_hdr, 4); + cma_hdr->src_addr.ip4.addr = src4->sin_addr.s_addr; + cma_hdr->dst_addr.ip4.addr = dst4->sin_addr.s_addr; + cma_hdr->port = src4->sin_port; + break; + } +} + +static int cma_connect_ib(struct rdma_id_private *id_priv, + struct rdma_conn_param *conn_param) +{ + struct ib_cm_req_param req; + struct rdma_route *route; + void *private_data; + int offset, ret; + + memset(&req, 0, sizeof req); + offset = cma_user_data_offset(id_priv->id.ps); + req.private_data_len = offset + conn_param->private_data_len; + private_data = kzalloc(req.private_data_len, GFP_ATOMIC); + if (!private_data) + return -ENOMEM; + + if (conn_param->private_data && conn_param->private_data_len) + memcpy(private_data + offset, conn_param->private_data, + conn_param->private_data_len); + + id_priv->cm_id = ib_create_cm_id(id_priv->id.device, cma_ib_handler, + id_priv); + if (IS_ERR(id_priv->cm_id)) { + ret = PTR_ERR(id_priv->cm_id); + goto out; + } + + route = &id_priv->id.route; + cma_format_hdr(private_data, id_priv->id.ps, route); + req.private_data = private_data; + + req.primary_path = &route->path_rec[0]; + if (route->num_paths == 2) + req.alternate_path = &route->path_rec[1]; + + req.service_id = cma_get_service_id(id_priv->id.ps, + &route->addr.dst_addr); + req.qp_num = id_priv->qp_num; + req.qp_type = id_priv->qp_type; + req.starting_psn = id_priv->seq_num; + req.responder_resources = conn_param->responder_resources; + req.initiator_depth = conn_param->initiator_depth; + req.flow_control = conn_param->flow_control; + req.retry_count = conn_param->retry_count; + req.rnr_retry_count = conn_param->rnr_retry_count; + req.remote_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT; + req.local_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT; + req.max_cm_retries = CMA_MAX_CM_RETRIES; + req.srq = id_priv->srq ? 1 : 0; + + ret = ib_send_cm_req(id_priv->cm_id, &req); +out: + kfree(private_data); + return ret; +} + +int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) +{ + struct rdma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp_exch(id_priv, CMA_ROUTE_RESOLVED, CMA_CONNECT)) + return -EINVAL; + + if (!id->qp) { + id_priv->qp_num = conn_param->qp_num; + id_priv->qp_type = conn_param->qp_type; + id_priv->srq = conn_param->srq; + } + + switch (id->device->node_type) { + case IB_NODE_CA: + ret = cma_connect_ib(id_priv, conn_param); + break; + default: + ret = -ENOSYS; + break; + } + if (ret) + goto err; + + return 0; +err: + cma_comp_exch(id_priv, CMA_CONNECT, CMA_ROUTE_RESOLVED); + return ret; +} +EXPORT_SYMBOL(rdma_connect); + +static int cma_accept_ib(struct rdma_id_private *id_priv, + struct rdma_conn_param *conn_param) +{ + struct ib_cm_rep_param rep; + int ret; + + ret = cma_modify_qp_rtr(&id_priv->id); + if (ret) + return ret; + + memset(&rep, 0, sizeof rep); + rep.qp_num = id_priv->qp_num; + rep.starting_psn = id_priv->seq_num; + rep.private_data = conn_param->private_data; + rep.private_data_len = conn_param->private_data_len; + rep.responder_resources = conn_param->responder_resources; + rep.initiator_depth = conn_param->initiator_depth; + rep.target_ack_delay = CMA_CM_RESPONSE_TIMEOUT; + rep.failover_accepted = 0; + rep.flow_control = conn_param->flow_control; + rep.rnr_retry_count = conn_param->rnr_retry_count; + rep.srq = id_priv->srq ? 1 : 0; + + return ib_send_cm_rep(id_priv->cm_id, &rep); +} + +int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) +{ + struct rdma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp(id_priv, CMA_CONNECT)) + return -EINVAL; + + if (!id->qp && conn_param) { + id_priv->qp_num = conn_param->qp_num; + id_priv->qp_type = conn_param->qp_type; + id_priv->srq = conn_param->srq; + } + + switch (id->device->node_type) { + case IB_NODE_CA: + if (conn_param) + ret = cma_accept_ib(id_priv, conn_param); + else + ret = cma_rep_recv(id_priv); + break; + default: + ret = -ENOSYS; + break; + } + + if (ret) + goto reject; + + return 0; +reject: + cma_modify_qp_err(id); + rdma_reject(id, NULL, 0); + return ret; +} +EXPORT_SYMBOL(rdma_accept); + +int rdma_reject(struct rdma_cm_id *id, const void *private_data, + u8 private_data_len) +{ + struct rdma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp(id_priv, CMA_CONNECT)) + return -EINVAL; + + switch (id->device->node_type) { + case IB_NODE_CA: + ret = ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, + NULL, 0, private_data, private_data_len); + break; + default: + ret = -ENOSYS; + break; + } + return ret; +}; +EXPORT_SYMBOL(rdma_reject); + +int rdma_disconnect(struct rdma_cm_id *id) +{ + struct rdma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp(id_priv, CMA_CONNECT)) + return -EINVAL; + + ret = cma_modify_qp_err(id); + if (ret) + goto out; + + switch (id->device->node_type) { + case IB_NODE_CA: + /* Initiate or respond to a disconnect. */ + if (ib_send_cm_dreq(id_priv->cm_id, NULL, 0)) + ib_send_cm_drep(id_priv->cm_id, NULL, 0); + break; + default: + break; + } +out: + return ret; +} +EXPORT_SYMBOL(rdma_disconnect); + +static void cma_add_one(struct ib_device *device) +{ + struct cma_device *cma_dev; + struct rdma_id_private *id_priv; + + cma_dev = kmalloc(sizeof *cma_dev, GFP_KERNEL); + if (!cma_dev) + return; + + cma_dev->device = device; + cma_dev->node_guid = device->node_guid; + if (!cma_dev->node_guid) + goto err; + + init_waitqueue_head(&cma_dev->wait); + atomic_set(&cma_dev->refcount, 1); + INIT_LIST_HEAD(&cma_dev->id_list); + ib_set_client_data(device, &cma_client, cma_dev); + + mutex_lock(&lock); + list_add_tail(&cma_dev->list, &dev_list); + list_for_each_entry(id_priv, &listen_any_list, list) + cma_listen_on_dev(id_priv, cma_dev); + mutex_unlock(&lock); + return; +err: + kfree(cma_dev); +} + +static int cma_remove_id_dev(struct rdma_id_private *id_priv) +{ + enum cma_state state; + + /* Record that we want to remove the device */ + state = cma_exch(id_priv, CMA_DEVICE_REMOVAL); + if (state == CMA_DESTROYING) + return 0; + + cma_cancel_operation(id_priv, state); + wait_event(id_priv->wait_remove, !atomic_read(&id_priv->dev_remove)); + + /* Check for destruction from another callback. */ + if (!cma_comp(id_priv, CMA_DEVICE_REMOVAL)) + return 0; + + return cma_notify_user(id_priv, RDMA_CM_EVENT_DEVICE_REMOVAL, + 0, NULL, 0); +} + +static void cma_process_remove(struct cma_device *cma_dev) +{ + struct list_head remove_list; + struct rdma_id_private *id_priv; + int ret; + + INIT_LIST_HEAD(&remove_list); + + mutex_lock(&lock); + while (!list_empty(&cma_dev->id_list)) { + id_priv = list_entry(cma_dev->id_list.next, + struct rdma_id_private, list); + + if (cma_internal_listen(id_priv)) { + cma_destroy_listen(id_priv); + continue; + } + + list_del(&id_priv->list); + list_add_tail(&id_priv->list, &remove_list); + atomic_inc(&id_priv->refcount); + mutex_unlock(&lock); + + ret = cma_remove_id_dev(id_priv); + cma_deref_id(id_priv); + if (ret) + rdma_destroy_id(&id_priv->id); + + mutex_lock(&lock); + } + mutex_unlock(&lock); + + atomic_dec(&cma_dev->refcount); + wait_event(cma_dev->wait, !atomic_read(&cma_dev->refcount)); +} + +static void cma_remove_one(struct ib_device *device) +{ + struct cma_device *cma_dev; + + cma_dev = ib_get_client_data(device, &cma_client); + if (!cma_dev) + return; + + mutex_lock(&lock); + list_del(&cma_dev->list); + mutex_unlock(&lock); + + cma_process_remove(cma_dev); + kfree(cma_dev); +} + +static int cma_init(void) +{ + return ib_register_client(&cma_client); +} + +static void cma_cleanup(void) +{ + ib_unregister_client(&cma_client); +} + +module_init(cma_init); +module_exit(cma_cleanup); diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/Makefile linux-2.6.ib/drivers/infiniband/core/Makefile --- linux-2.6.git/drivers/infiniband/core/Makefile 2006-01-16 16:16:18.000000000 -0800 +++ linux-2.6.ib/drivers/infiniband/core/Makefile 2006-01-16 16:35:48.000000000 -0800 @@ -1,5 +1,5 @@ obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_sa.o \ - ib_cm.o ib_addr.o + ib_cm.o ib_addr.o rdma_cm.o obj-$(CONFIG_INFINIBAND_USER_MAD) += ib_umad.o obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o @@ -12,6 +12,8 @@ ib_sa-y := sa_query.o ib_cm-y := cm.o +rdma_cm-y := cma.o + ib_addr-y := addr.o ib_umad-y := user_mad.o diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/include/rdma/rdma_cm.h linux-2.6.ib/include/rdma/rdma_cm.h --- linux-2.6.git/include/rdma/rdma_cm.h 1969-12-31 16:00:00.000000000 -0800 +++ linux-2.6.ib/include/rdma/rdma_cm.h 2006-01-16 16:19:12.000000000 -0800 @@ -0,0 +1,255 @@ +/* + * Copyright (c) 2005 Voltaire Inc. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + * + */ + +#if !defined(RDMA_CM_H) +#define RDMA_CM_H + +#include +#include +#include +#include + +/* + * Upon receiving a device removal event, users must destroy the associated + * RDMA identifier and release all resources allocated with the device. + */ +enum rdma_cm_event_type { + RDMA_CM_EVENT_ADDR_RESOLVED, + RDMA_CM_EVENT_ADDR_ERROR, + RDMA_CM_EVENT_ROUTE_RESOLVED, + RDMA_CM_EVENT_ROUTE_ERROR, + RDMA_CM_EVENT_CONNECT_REQUEST, + RDMA_CM_EVENT_CONNECT_RESPONSE, + RDMA_CM_EVENT_CONNECT_ERROR, + RDMA_CM_EVENT_UNREACHABLE, + RDMA_CM_EVENT_REJECTED, + RDMA_CM_EVENT_ESTABLISHED, + RDMA_CM_EVENT_DISCONNECTED, + RDMA_CM_EVENT_DEVICE_REMOVAL, +}; + +enum rdma_port_space { + RDMA_PS_SDP = 0x0001, + RDMA_PS_TCP = 0x0106, + RDMA_PS_UDP = 0x0111, + RDMA_PS_SCTP = 0x0183 +}; + +struct rdma_addr { + struct sockaddr src_addr; + u8 src_pad[sizeof(struct sockaddr_in6) - + sizeof(struct sockaddr)]; + struct sockaddr dst_addr; + u8 dst_pad[sizeof(struct sockaddr_in6) - + sizeof(struct sockaddr)]; + struct rdma_dev_addr dev_addr; +}; + +struct rdma_route { + struct rdma_addr addr; + struct ib_sa_path_rec *path_rec; + int num_paths; +}; + +struct rdma_cm_event { + enum rdma_cm_event_type event; + int status; + void *private_data; + u8 private_data_len; +}; + +struct rdma_cm_id; + +/** + * rdma_cm_event_handler - Callback used to report user events. + * + * Notes: Users may not call rdma_destroy_id from this callback to destroy + * the passed in id, or a corresponding listen id. Returning a + * non-zero value from the callback will destroy the corresponding id. + */ +typedef int (*rdma_cm_event_handler)(struct rdma_cm_id *id, + struct rdma_cm_event *event); + +struct rdma_cm_id { + struct ib_device *device; + void *context; + struct ib_qp *qp; + rdma_cm_event_handler event_handler; + struct rdma_route route; + enum rdma_port_space ps; + u8 port_num; +}; + +/** + * rdma_create_id - Create an RDMA identifier. + * + * @event_handler: User callback invoked to report events associated with the + * returned rdma_id. + * @context: User specified context associated with the id. + * @ps: RDMA port space. + */ +struct rdma_cm_id* rdma_create_id(rdma_cm_event_handler event_handler, + void *context, enum rdma_port_space ps); + +void rdma_destroy_id(struct rdma_cm_id *id); + +/** + * rdma_bind_addr - Bind an RDMA identifier to a source address and + * associated RDMA device, if needed. + * + * @id: RDMA identifier. + * @addr: Local address information. Wildcard values are permitted. + * + * This associates a source address with the RDMA identifier before calling + * rdma_listen. If a specific local address is given, the RDMA identifier will + * be bound to a local RDMA device. + */ +int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr *addr); + +/** + * rdma_resolve_addr - Resolve destination and optional source addresses + * from IP addresses to an RDMA address. If successful, the specified + * rdma_cm_id will be bound to a local device. + * + * @id: RDMA identifier. + * @src_addr: Source address information. This parameter may be NULL. + * @dst_addr: Destination address information. + * @timeout_ms: Time to wait for resolution to complete. + */ +int rdma_resolve_addr(struct rdma_cm_id *id, struct sockaddr *src_addr, + struct sockaddr *dst_addr, int timeout_ms); + +/** + * rdma_resolve_route - Resolve the RDMA address bound to the RDMA identifier + * into route information needed to establish a connection. + * + * This is called on the client side of a connection. + * Users must have first called rdma_resolve_addr to resolve a dst_addr + * into an RDMA address before calling this routine. + */ +int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms); + +/** + * rdma_create_qp - Allocate a QP and associate it with the specified RDMA + * identifier. + * + * QPs allocated to an rdma_cm_id will automatically be transitioned by the CMA + * through their states. + */ +int rdma_create_qp(struct rdma_cm_id *id, struct ib_pd *pd, + struct ib_qp_init_attr *qp_init_attr); + +/** + * rdma_destroy_qp - Deallocate the QP associated with the specified RDMA + * identifier. + * + * Users must destroy any QP associated with an RDMA identifier before + * destroying the RDMA ID. + */ +void rdma_destroy_qp(struct rdma_cm_id *id); + +/** + * rdma_init_qp_attr - Initializes the QP attributes for use in transitioning + * to a specified QP state. + * @id: Communication identifier associated with the QP attributes to + * initialize. + * @qp_attr: On input, specifies the desired QP state. On output, the + * mandatory and desired optional attributes will be set in order to + * modify the QP to the specified state. + * @qp_attr_mask: The QP attribute mask that may be used to transition the + * QP to the specified state. + * + * Users must set the @qp_attr->qp_state to the desired QP state. This call + * will set all required attributes for the given transition, along with + * known optional attributes. Users may override the attributes returned from + * this call before calling ib_modify_qp. + * + * Users that wish to have their QP automatically transitioned through its + * states can associate a QP with the rdma_cm_id by calling rdma_create_qp(). + */ +int rdma_init_qp_attr(struct rdma_cm_id *id, struct ib_qp_attr *qp_attr, + int *qp_attr_mask); + +struct rdma_conn_param { + const void *private_data; + u8 private_data_len; + u8 responder_resources; + u8 initiator_depth; + u8 flow_control; + u8 retry_count; /* ignored when accepting */ + u8 rnr_retry_count; + /* Fields below ignored if a QP is created on the rdma_cm_id. */ + u8 srq; + u32 qp_num; + enum ib_qp_type qp_type; +}; + +/** + * rdma_connect - Initiate an active connection request. + * + * Users must have resolved a route for the rdma_cm_id to connect with + * by having called rdma_resolve_route before calling this routine. + */ +int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param); + +/** + * rdma_listen - This function is called by the passive side to + * listen for incoming connection requests. + * + * Users must have bound the rdma_cm_id to a local address by calling + * rdma_bind_addr before calling this routine. + */ +int rdma_listen(struct rdma_cm_id *id, int backlog); + +/** + * rdma_accept - Called to accept a connection request or response. + * @id: Connection identifier associated with the request. + * @conn_param: Information needed to establish the connection. This must be + * provided if accepting a connection request. If accepting a connection + * response, this parameter must be NULL. + * + * Typically, this routine is only called by the listener to accept a connection + * request. It must also be called on the active side of a connection if the + * user is performing their own QP transitions. + */ +int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param); + +/** + * rdma_reject - Called on the passive side to reject a connection request. + */ +int rdma_reject(struct rdma_cm_id *id, const void *private_data, + u8 private_data_len); + +/** + * rdma_disconnect - This function disconnects the associated QP. + */ +int rdma_disconnect(struct rdma_cm_id *id); + +#endif /* RDMA_CM_H */ + From sean.hefty at intel.com Wed Feb 1 12:19:55 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 1 Feb 2006 12:19:55 -0800 Subject: [openib-general] [PATCH 5/5] Infiniband: connection abstraction In-Reply-To: Message-ID: This patch adds the kernel component to support the userspace Infiniband/RDMA connection agent library. Signed-off-by: Sean Hefty --- diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/Makefile linux-2.6.ib/drivers/infiniband/core/Makefile --- linux-2.6.git/drivers/infiniband/core/Makefile 2006-01-16 16:58:58.000000000 -0800 +++ linux-2.6.ib/drivers/infiniband/core/Makefile 2006-01-16 16:55:25.000000000 -0800 @@ -1,5 +1,5 @@ obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_sa.o \ - ib_cm.o ib_addr.o rdma_cm.o + ib_cm.o ib_addr.o rdma_cm.o rdma_ucm.o obj-$(CONFIG_INFINIBAND_USER_MAD) += ib_umad.o obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o @@ -14,6 +14,8 @@ ib_cm-y := cm.o rdma_cm-y := cma.o +rdma_ucm-y := ucma.o + ib_addr-y := addr.o ib_umad-y := user_mad.o diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/ucma.c linux-2.6.ib/drivers/infiniband/core/ucma.c --- linux-2.6.git/drivers/infiniband/core/ucma.c 1969-12-31 16:00:00.000000000 -0800 +++ linux-2.6.ib/drivers/infiniband/core/ucma.c 2006-01-16 16:54:31.000000000 -0800 @@ -0,0 +1,788 @@ +/* + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include + +#include +#include +#include + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("RDMA Userspace Connection Manager Access"); +MODULE_LICENSE("Dual BSD/GPL"); + +enum { + UCMA_MAX_BACKLOG = 128 +}; + +struct ucma_file { + struct mutex file_mutex; + struct file *filp; + struct list_head ctxs; + struct list_head events; + wait_queue_head_t poll_wait; +}; + +struct ucma_context { + int id; + wait_queue_head_t wait; + atomic_t ref; + int events_reported; + int backlog; + + struct ucma_file *file; + struct rdma_cm_id *cm_id; + __u64 uid; + + struct list_head events; /* list of pending events. */ + struct list_head file_list; /* member in file ctx list */ +}; + +struct ucma_event { + struct ucma_context *ctx; + struct list_head file_list; /* member in file event list */ + struct list_head ctx_list; /* member in ctx event list */ + struct rdma_cm_id *cm_id; + struct rdma_ucm_event_resp resp; +}; + +static DEFINE_MUTEX(ctx_mutex); +static DEFINE_IDR(ctx_idr); + +static struct ucma_context* ucma_get_ctx(struct ucma_file *file, int id) +{ + struct ucma_context *ctx; + + mutex_lock(&ctx_mutex); + ctx = idr_find(&ctx_idr, id); + if (!ctx) + ctx = ERR_PTR(-ENOENT); + else if (ctx->file != file) + ctx = ERR_PTR(-EINVAL); + else + atomic_inc(&ctx->ref); + mutex_unlock(&ctx_mutex); + + return ctx; +} + +static void ucma_put_ctx(struct ucma_context *ctx) +{ + if (atomic_dec_and_test(&ctx->ref)) + wake_up(&ctx->wait); +} + +static void ucma_cleanup_events(struct ucma_context *ctx) +{ + struct ucma_event *uevent; + + mutex_lock(&ctx->file->file_mutex); + list_del(&ctx->file_list); + while (!list_empty(&ctx->events)) { + + uevent = list_entry(ctx->events.next, struct ucma_event, + ctx_list); + list_del(&uevent->file_list); + list_del(&uevent->ctx_list); + + /* clear incoming connections. */ + if (uevent->resp.event == RDMA_CM_EVENT_CONNECT_REQUEST) + rdma_destroy_id(uevent->cm_id); + + kfree(uevent); + } + mutex_unlock(&ctx->file->file_mutex); +} + +static struct ucma_context* ucma_alloc_ctx(struct ucma_file *file) +{ + struct ucma_context *ctx; + int ret; + + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); + if (!ctx) + return NULL; + + atomic_set(&ctx->ref, 1); + init_waitqueue_head(&ctx->wait); + ctx->file = file; + INIT_LIST_HEAD(&ctx->events); + + do { + ret = idr_pre_get(&ctx_idr, GFP_KERNEL); + if (!ret) + goto error; + + mutex_lock(&ctx_mutex); + ret = idr_get_new(&ctx_idr, ctx, &ctx->id); + mutex_unlock(&ctx_mutex); + } while (ret == -EAGAIN); + + if (ret) + goto error; + + list_add_tail(&ctx->file_list, &file->ctxs); + return ctx; + +error: + kfree(ctx); + return NULL; +} + +static int ucma_event_handler(struct rdma_cm_id *cm_id, + struct rdma_cm_event *event) +{ + struct ucma_event *uevent; + struct ucma_context *ctx = cm_id->context; + int ret = 0; + + uevent = kzalloc(sizeof(*uevent), GFP_KERNEL); + if (!uevent) + return event->event == RDMA_CM_EVENT_CONNECT_REQUEST; + + uevent->ctx = ctx; + uevent->cm_id = cm_id; + uevent->resp.uid = ctx->uid; + uevent->resp.id = ctx->id; + uevent->resp.event = event->event; + uevent->resp.status = event->status; + if ((uevent->resp.private_data_len = event->private_data_len)) + memcpy(uevent->resp.private_data, event->private_data, + event->private_data_len); + + mutex_lock(&ctx->file->file_mutex); + if (event->event == RDMA_CM_EVENT_CONNECT_REQUEST) { + if (!ctx->backlog) { + ret = -EDQUOT; + goto out; + } + ctx->backlog--; + } + list_add_tail(&uevent->file_list, &ctx->file->events); + list_add_tail(&uevent->ctx_list, &ctx->events); + wake_up_interruptible(&ctx->file->poll_wait); +out: + mutex_unlock(&ctx->file->file_mutex); + return ret; +} + +static ssize_t ucma_get_event(struct ucma_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct ucma_context *ctx; + struct rdma_ucm_get_event cmd; + struct ucma_event *uevent; + int ret = 0; + DEFINE_WAIT(wait); + + if (out_len < sizeof(struct rdma_ucm_event_resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + mutex_lock(&file->file_mutex); + while (list_empty(&file->events)) { + if (file->filp->f_flags & O_NONBLOCK) { + ret = -EAGAIN; + break; + } + + if (signal_pending(current)) { + ret = -ERESTARTSYS; + break; + } + + prepare_to_wait(&file->poll_wait, &wait, TASK_INTERRUPTIBLE); + mutex_unlock(&file->file_mutex); + schedule(); + mutex_lock(&file->file_mutex); + finish_wait(&file->poll_wait, &wait); + } + + if (ret) + goto done; + + uevent = list_entry(file->events.next, struct ucma_event, file_list); + + if (uevent->resp.event == RDMA_CM_EVENT_CONNECT_REQUEST) { + ctx = ucma_alloc_ctx(file); + if (!ctx) { + ret = -ENOMEM; + goto done; + } + uevent->ctx->backlog++; + ctx->cm_id = uevent->cm_id; + ctx->cm_id->context = ctx; + uevent->resp.id = ctx->id; + } + + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &uevent->resp, sizeof(uevent->resp))) { + ret = -EFAULT; + goto done; + } + + list_del(&uevent->file_list); + list_del(&uevent->ctx_list); + uevent->ctx->events_reported++; + kfree(uevent); +done: + mutex_unlock(&file->file_mutex); + return ret; +} + +static ssize_t ucma_create_id(struct ucma_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_create_id cmd; + struct rdma_ucm_create_id_resp resp; + struct ucma_context *ctx; + int ret; + + if (out_len < sizeof(resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + mutex_lock(&file->file_mutex); + ctx = ucma_alloc_ctx(file); + mutex_unlock(&file->file_mutex); + if (!ctx) + return -ENOMEM; + + ctx->uid = cmd.uid; + ctx->cm_id = rdma_create_id(ucma_event_handler, ctx, RDMA_PS_TCP); + if (IS_ERR(ctx->cm_id)) { + ret = PTR_ERR(ctx->cm_id); + goto err1; + } + + resp.id = ctx->id; + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &resp, sizeof(resp))) { + ret = -EFAULT; + goto err2; + } + return 0; + +err2: + rdma_destroy_id(ctx->cm_id); +err1: + mutex_lock(&ctx_mutex); + idr_remove(&ctx_idr, ctx->id); + mutex_unlock(&ctx_mutex); + kfree(ctx); + return ret; +} + +static ssize_t ucma_destroy_id(struct ucma_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_destroy_id cmd; + struct rdma_ucm_destroy_id_resp resp; + struct ucma_context *ctx; + int ret = 0; + + if (out_len < sizeof(resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + mutex_lock(&ctx_mutex); + ctx = idr_find(&ctx_idr, cmd.id); + if (!ctx) + ctx = ERR_PTR(-ENOENT); + else if (ctx->file != file) + ctx = ERR_PTR(-EINVAL); + else + idr_remove(&ctx_idr, ctx->id); + mutex_unlock(&ctx_mutex); + + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + atomic_dec(&ctx->ref); + wait_event(ctx->wait, !atomic_read(&ctx->ref)); + + /* No new events will be generated after destroying the id. */ + rdma_destroy_id(ctx->cm_id); + /* Cleanup events not yet reported to the user. */ + ucma_cleanup_events(ctx); + + resp.events_reported = ctx->events_reported; + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &resp, sizeof(resp))) + ret = -EFAULT; + + kfree(ctx); + return ret; +} + +static ssize_t ucma_bind_addr(struct ucma_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_bind_addr cmd; + struct ucma_context *ctx; + int ret; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + ret = rdma_bind_addr(ctx->cm_id, (struct sockaddr *) &cmd.addr); + ucma_put_ctx(ctx); + return ret; +} + +static ssize_t ucma_resolve_addr(struct ucma_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_resolve_addr cmd; + struct ucma_context *ctx; + int ret; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + ret = rdma_resolve_addr(ctx->cm_id, (struct sockaddr *) &cmd.src_addr, + (struct sockaddr *) &cmd.dst_addr, + cmd.timeout_ms); + ucma_put_ctx(ctx); + return ret; +} + +static ssize_t ucma_resolve_route(struct ucma_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_resolve_route cmd; + struct ucma_context *ctx; + int ret; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + ret = rdma_resolve_route(ctx->cm_id, cmd.timeout_ms); + ucma_put_ctx(ctx); + return ret; +} + +static void ucma_copy_ib_route(struct rdma_ucm_query_route_resp *resp, + struct rdma_route *route) +{ + struct rdma_dev_addr *dev_addr; + + resp->num_paths = route->num_paths; + switch (route->num_paths) { + case 0: + dev_addr = &route->addr.dev_addr; + memcpy(&resp->ib_route[0].dgid, ib_addr_get_dgid(dev_addr), + sizeof(union ib_gid)); + memcpy(&resp->ib_route[0].sgid, ib_addr_get_sgid(dev_addr), + sizeof(union ib_gid)); + resp->ib_route[0].pkey = cpu_to_be16(ib_addr_get_pkey(dev_addr)); + break; + case 2: + ib_copy_path_rec_to_user(&resp->ib_route[1], + &route->path_rec[1]); + /* fall through */ + case 1: + ib_copy_path_rec_to_user(&resp->ib_route[0], + &route->path_rec[0]); + break; + default: + break; + } +} + +static ssize_t ucma_query_route(struct ucma_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_query_route cmd; + struct rdma_ucm_query_route_resp resp; + struct ucma_context *ctx; + struct sockaddr *addr; + int ret = 0; + + if (out_len < sizeof(resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + if (!ctx->cm_id->device) { + ret = -ENODEV; + goto out; + } + + addr = &ctx->cm_id->route.addr.src_addr; + memcpy(&resp.src_addr, addr, addr->sa_family == AF_INET ? + sizeof(struct sockaddr_in) : + sizeof(struct sockaddr_in6)); + addr = &ctx->cm_id->route.addr.dst_addr; + memcpy(&resp.dst_addr, addr, addr->sa_family == AF_INET ? + sizeof(struct sockaddr_in) : + sizeof(struct sockaddr_in6)); + resp.node_guid = ctx->cm_id->device->node_guid; + resp.port_num = ctx->cm_id->port_num; + switch (ctx->cm_id->device->node_type) { + case IB_NODE_CA: + ucma_copy_ib_route(&resp, &ctx->cm_id->route); + default: + break; + } + + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &resp, sizeof(resp))) + ret = -EFAULT; + +out: + ucma_put_ctx(ctx); + return ret; +} + +static void ucma_copy_conn_param(struct rdma_conn_param *dst_conn, + struct rdma_ucm_conn_param *src_conn) +{ + dst_conn->private_data = src_conn->private_data; + dst_conn->private_data_len = src_conn->private_data_len; + dst_conn->responder_resources =src_conn->responder_resources; + dst_conn->initiator_depth = src_conn->initiator_depth; + dst_conn->flow_control = src_conn->flow_control; + dst_conn->retry_count = src_conn->retry_count; + dst_conn->rnr_retry_count = src_conn->rnr_retry_count; + dst_conn->srq = src_conn->srq; + dst_conn->qp_num = src_conn->qp_num; + dst_conn->qp_type = src_conn->qp_type; +} + +static ssize_t ucma_connect(struct ucma_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_connect cmd; + struct rdma_conn_param conn_param; + struct ucma_context *ctx; + int ret; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + if (!cmd.conn_param.valid) + return -EINVAL; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + ucma_copy_conn_param(&conn_param, &cmd.conn_param); + ret = rdma_connect(ctx->cm_id, &conn_param); + ucma_put_ctx(ctx); + return ret; +} + +static ssize_t ucma_listen(struct ucma_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_listen cmd; + struct ucma_context *ctx; + int ret; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + ctx->backlog = cmd.backlog > 0 && cmd.backlog < UCMA_MAX_BACKLOG ? + cmd.backlog : UCMA_MAX_BACKLOG; + ret = rdma_listen(ctx->cm_id, ctx->backlog); + ucma_put_ctx(ctx); + return ret; +} + +static ssize_t ucma_accept(struct ucma_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_accept cmd; + struct rdma_conn_param conn_param; + struct ucma_context *ctx; + int ret; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + if (cmd.conn_param.valid) { + ctx->uid = cmd.uid; + ucma_copy_conn_param(&conn_param, &cmd.conn_param); + ret = rdma_accept(ctx->cm_id, &conn_param); + } else + ret = rdma_accept(ctx->cm_id, NULL); + + ucma_put_ctx(ctx); + return ret; +} + +static ssize_t ucma_reject(struct ucma_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_reject cmd; + struct ucma_context *ctx; + int ret; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + ret = rdma_reject(ctx->cm_id, cmd.private_data, cmd.private_data_len); + ucma_put_ctx(ctx); + return ret; +} + +static ssize_t ucma_disconnect(struct ucma_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_disconnect cmd; + struct ucma_context *ctx; + int ret; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + ret = rdma_disconnect(ctx->cm_id); + ucma_put_ctx(ctx); + return ret; +} + +static ssize_t ucma_init_qp_attr(struct ucma_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_init_qp_attr cmd; + struct ib_uverbs_qp_attr resp; + struct ucma_context *ctx; + struct ib_qp_attr qp_attr; + int ret; + + if (out_len < sizeof(resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + resp.qp_attr_mask = 0; + memset(&qp_attr, 0, sizeof qp_attr); + qp_attr.qp_state = cmd.qp_state; + ret = rdma_init_qp_attr(ctx->cm_id, &qp_attr, &resp.qp_attr_mask); + if (ret) + goto out; + + ib_copy_qp_attr_to_user(&resp, &qp_attr); + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &resp, sizeof(resp))) + ret = -EFAULT; + +out: + ucma_put_ctx(ctx); + return ret; +} + +static ssize_t (*ucma_cmd_table[])(struct ucma_file *file, + const char __user *inbuf, + int in_len, int out_len) = { + [RDMA_USER_CM_CMD_CREATE_ID] = ucma_create_id, + [RDMA_USER_CM_CMD_DESTROY_ID] = ucma_destroy_id, + [RDMA_USER_CM_CMD_BIND_ADDR] = ucma_bind_addr, + [RDMA_USER_CM_CMD_RESOLVE_ADDR] = ucma_resolve_addr, + [RDMA_USER_CM_CMD_RESOLVE_ROUTE]= ucma_resolve_route, + [RDMA_USER_CM_CMD_QUERY_ROUTE] = ucma_query_route, + [RDMA_USER_CM_CMD_CONNECT] = ucma_connect, + [RDMA_USER_CM_CMD_LISTEN] = ucma_listen, + [RDMA_USER_CM_CMD_ACCEPT] = ucma_accept, + [RDMA_USER_CM_CMD_REJECT] = ucma_reject, + [RDMA_USER_CM_CMD_DISCONNECT] = ucma_disconnect, + [RDMA_USER_CM_CMD_INIT_QP_ATTR] = ucma_init_qp_attr, + [RDMA_USER_CM_CMD_GET_EVENT] = ucma_get_event +}; + +static ssize_t ucma_write(struct file *filp, const char __user *buf, + size_t len, loff_t *pos) +{ + struct ucma_file *file = filp->private_data; + struct rdma_ucm_cmd_hdr hdr; + ssize_t ret; + + if (len < sizeof(hdr)) + return -EINVAL; + + if (copy_from_user(&hdr, buf, sizeof(hdr))) + return -EFAULT; + + if (hdr.cmd < 0 || hdr.cmd >= ARRAY_SIZE(ucma_cmd_table)) + return -EINVAL; + + if (hdr.in + sizeof(hdr) > len) + return -EINVAL; + + ret = ucma_cmd_table[hdr.cmd](file, buf + sizeof(hdr), hdr.in, hdr.out); + if (!ret) + ret = len; + + return ret; +} + +static unsigned int ucma_poll(struct file *filp, struct poll_table_struct *wait) +{ + struct ucma_file *file = filp->private_data; + unsigned int mask = 0; + + poll_wait(filp, &file->poll_wait, wait); + + mutex_lock(&file->file_mutex); + if (!list_empty(&file->events)) + mask = POLLIN | POLLRDNORM; + mutex_unlock(&file->file_mutex); + + return mask; +} + +static int ucma_open(struct inode *inode, struct file *filp) +{ + struct ucma_file *file; + + file = kmalloc(sizeof *file, GFP_KERNEL); + if (!file) + return -ENOMEM; + + INIT_LIST_HEAD(&file->events); + INIT_LIST_HEAD(&file->ctxs); + init_waitqueue_head(&file->poll_wait); + mutex_init(&file->file_mutex); + + filp->private_data = file; + file->filp = filp; + return 0; +} + +static int ucma_close(struct inode *inode, struct file *filp) +{ + struct ucma_file *file = filp->private_data; + struct ucma_context *ctx; + + mutex_lock(&file->file_mutex); + while (!list_empty(&file->ctxs)) { + ctx = list_entry(file->ctxs.next, struct ucma_context, + file_list); + mutex_unlock(&file->file_mutex); + + mutex_lock(&ctx_mutex); + idr_remove(&ctx_idr, ctx->id); + mutex_unlock(&ctx_mutex); + + rdma_destroy_id(ctx->cm_id); + ucma_cleanup_events(ctx); + kfree(ctx); + + mutex_lock(&file->file_mutex); + } + mutex_unlock(&file->file_mutex); + kfree(file); + return 0; +} + +static struct file_operations ucma_fops = { + .owner = THIS_MODULE, + .open = ucma_open, + .release = ucma_close, + .write = ucma_write, + .poll = ucma_poll, +}; + +static struct miscdevice ucma_misc = { + .minor = MISC_DYNAMIC_MINOR, + .name = "rdma_cm", + .fops = &ucma_fops, +}; + +static int __init ucma_init(void) +{ + return misc_register(&ucma_misc); +} + +static void __exit ucma_cleanup(void) +{ + misc_deregister(&ucma_misc); + idr_destroy(&ctx_idr); +} + +module_init(ucma_init); +module_exit(ucma_cleanup); diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/include/rdma/rdma_user_cm.h linux-2.6.ib/include/rdma/rdma_user_cm.h --- linux-2.6.git/include/rdma/rdma_user_cm.h 1969-12-31 16:00:00.000000000 -0800 +++ linux-2.6.ib/include/rdma/rdma_user_cm.h 2006-01-16 16:54:55.000000000 -0800 @@ -0,0 +1,186 @@ +/* + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef RDMA_USER_CM_H +#define RDMA_USER_CM_H + +#include +#include +#include +#include + +#define RDMA_USER_CM_ABI_VERSION 1 + +#define RDMA_MAX_PRIVATE_DATA 256 + +enum { + RDMA_USER_CM_CMD_CREATE_ID, + RDMA_USER_CM_CMD_DESTROY_ID, + RDMA_USER_CM_CMD_BIND_ADDR, + RDMA_USER_CM_CMD_RESOLVE_ADDR, + RDMA_USER_CM_CMD_RESOLVE_ROUTE, + RDMA_USER_CM_CMD_QUERY_ROUTE, + RDMA_USER_CM_CMD_CONNECT, + RDMA_USER_CM_CMD_LISTEN, + RDMA_USER_CM_CMD_ACCEPT, + RDMA_USER_CM_CMD_REJECT, + RDMA_USER_CM_CMD_DISCONNECT, + RDMA_USER_CM_CMD_INIT_QP_ATTR, + RDMA_USER_CM_CMD_GET_EVENT +}; + +/* + * command ABI structures. + */ +struct rdma_ucm_cmd_hdr { + __u32 cmd; + __u16 in; + __u16 out; +}; + +struct rdma_ucm_create_id { + __u64 uid; + __u64 response; +}; + +struct rdma_ucm_create_id_resp { + __u32 id; +}; + +struct rdma_ucm_destroy_id { + __u64 response; + __u32 id; + __u32 reserved; +}; + +struct rdma_ucm_destroy_id_resp { + __u32 events_reported; +}; + +struct rdma_ucm_bind_addr { + __u64 response; + struct sockaddr_in6 addr; + __u32 id; +}; + +struct rdma_ucm_resolve_addr { + struct sockaddr_in6 src_addr; + struct sockaddr_in6 dst_addr; + __u32 id; + __u32 timeout_ms; +}; + +struct rdma_ucm_resolve_route { + __u32 id; + __u32 timeout_ms; +}; + +struct rdma_ucm_query_route { + __u64 response; + __u32 id; + __u32 reserved; +}; + +struct rdma_ucm_query_route_resp { + __u64 node_guid; + struct ib_user_path_rec ib_route[2]; + struct sockaddr_in6 src_addr; + struct sockaddr_in6 dst_addr; + __u32 num_paths; + __u8 port_num; + __u8 reserved[3]; +}; + +struct rdma_ucm_conn_param { + __u32 qp_num; + __u32 qp_type; + __u8 private_data[RDMA_MAX_PRIVATE_DATA]; + __u8 private_data_len; + __u8 srq; + __u8 responder_resources; + __u8 initiator_depth; + __u8 flow_control; + __u8 retry_count; + __u8 rnr_retry_count; + __u8 valid; +}; + +struct rdma_ucm_connect { + struct rdma_ucm_conn_param conn_param; + __u32 id; + __u32 reserved; +}; + +struct rdma_ucm_listen { + __u32 id; + __u32 backlog; +}; + +struct rdma_ucm_accept { + __u64 uid; + struct rdma_ucm_conn_param conn_param; + __u32 id; + __u32 reserved; +}; + +struct rdma_ucm_reject { + __u32 id; + __u8 private_data_len; + __u8 reserved[3]; + __u8 private_data[RDMA_MAX_PRIVATE_DATA]; +}; + +struct rdma_ucm_disconnect { + __u32 id; +}; + +struct rdma_ucm_init_qp_attr { + __u64 response; + __u32 id; + __u32 qp_state; +}; + +struct rdma_ucm_get_event { + __u64 response; +}; + +struct rdma_ucm_event_resp { + __u64 uid; + __u32 id; + __u32 event; + __u32 status; + __u8 private_data_len; + __u8 reserved[3]; + __u8 private_data[RDMA_MAX_PRIVATE_DATA]; +}; + +#endif /* RDMA_USER_CM_H */ From halr at voltaire.com Wed Feb 1 12:18:02 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Feb 2006 15:18:02 -0500 Subject: [openib-general] RE: [PATCH] Opensm - asserts before OSM_LOG_ENTER In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B68D@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B68D@mtlexch01.mtl.com> Message-ID: <1138825073.15119.4066.camel@hal.voltaire.com> Hi Eitan, On Wed, 2006-02-01 at 09:50, Eitan Zahavi wrote: > Hi Hal, > > Please see below: > > > When trying to compile the windows stack with some late updates, > I've > > > encountered an issue with the addition/change of place of asserts to > > > before the OSM_LOG_ENTER. Since OSM_LOG_ENTER declares a variable, > > > then these asserts cause failure due to declaration in the middle of > > > the function. > > > > The asserts are on a passed in pointer rather than the static variable > > created by the MACRO based on the second parameter to OSM_LOG_ENTER. I > > don't understand how this causes a problem. Is it Windows only? > [EZ] In C you are not allowed to mix variable declarations with > statements like "if" (on the same code block). In debug build the > CL_ASSERT includes an "if" statement that is later followed by > OSM_LOG_ENTER which declares the static variable. It used to fail build > in Linux but for some reason it stopped. I guess if we used -pedantic or > ANSI it would fail. OK. I understand now what is going on. > Anyway, the assert is on the passed down parameter > which is passed as NULL. This might happen only on race in the "destroy" > flow - but if this is a race it is not guaranteed to catch the bug as > the pointer might be free'ed after the assert. It should be caught as a > "segfault" (dereferencing NULL pointer) or in valgrind. This is not always possible. > We have few options: > a. Do not use same code tree for WinIB - I do not think we want that. I too would prefer not to do that but this is not a requirement of OpenIB. > b. Put everything after the CL_ASSERT in an internal code block (i.e. > "{") - I do not think we want to do this either. I don't think that solves the problem as doesn't OSM_LOG_EXIT need access to that variable created by OSM_LOG_ENTER. > c. Move the CL_ASSERT before the function call (into the function > caller). Ugh... > d. Give up these few asserts as this only can happen as a race during > resource destruction. e. What about some conditionalization of these asserts ? #ifndef __WIN__ CL_ASSERT(foo); #endif It's already in other places in OpenSM. > I think that in this case it is more important to keep the WinIB and > Linux tree identical. This is not a requirement although it is desirable. -- Hal > > > These asserts are all on the reciever object or the manager object, > so > > > I don't think they are really necessary. > > > > They compile out when not using debug. I saw these trip at SC05. > [EZ] As explained - yes they can trip - but only if we have memory > pollution (that could be caught by valgrind) or during exit - when and > they really a race and might not be caught by the assert. > > > > -- Hal > > > > > The Following patch removes these asserts. > > > > > > Thanks, > > > Yael > > > > > > Signed-off-by: Yael Kalka > > > > > > Index: opensm/osm_pkey_rcv.c > > > =================================================================== > > > --- opensm/osm_pkey_rcv.c (revision 5246) > > > +++ opensm/osm_pkey_rcv.c (working copy) > > > @@ -71,8 +71,6 @@ void > > > osm_pkey_rcv_destroy( > > > IN osm_pkey_rcv_t* const p_rcv ) > > > { > > > - CL_ASSERT( p_rcv ); > > > - > > > OSM_LOG_ENTER( p_rcv->p_log, osm_pkey_rcv_destroy ); > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > @@ -125,8 +123,6 @@ osm_pkey_rcv_process( > > > uint8_t port_num; > > > uint16_t block_num; > > > > > > - CL_ASSERT( p_rcv ); > > > - > > > OSM_LOG_ENTER( p_rcv->p_log, osm_pkey_rcv_process ); > > > > > > CL_ASSERT( p_madw ); > > > Index: opensm/osm_sm_state_mgr.c > > > =================================================================== > > > --- opensm/osm_sm_state_mgr.c (revision 5246) > > > +++ opensm/osm_sm_state_mgr.c (working copy) > > > @@ -406,8 +406,6 @@ void > > > osm_sm_state_mgr_destroy( > > > IN osm_sm_state_mgr_t * const p_sm_mgr ) > > > { > > > - CL_ASSERT( p_sm_mgr ); > > > - > > > OSM_LOG_ENTER( p_sm_mgr->p_log, osm_sm_state_mgr_destroy ); > > > > > > cl_spinlock_destroy( &p_sm_mgr->state_lock ); > > > @@ -500,8 +498,6 @@ osm_sm_state_mgr_process( > > > { > > > ib_api_status_t status = IB_SUCCESS; > > > > > > - CL_ASSERT( p_sm_mgr ); > > > - > > > OSM_LOG_ENTER( p_sm_mgr->p_log, osm_sm_state_mgr_process ); > > > > > > /* > > > @@ -760,8 +756,6 @@ osm_sm_state_mgr_check_legality( > > > { > > > ib_api_status_t status = IB_SUCCESS; > > > > > > - CL_ASSERT( p_sm_mgr ); > > > - > > > OSM_LOG_ENTER( p_sm_mgr->p_log, osm_sm_state_mgr_check_legality > ); > > > > > > /* > > > Index: opensm/osm_state_mgr.c > > > =================================================================== > > > --- opensm/osm_state_mgr.c (revision 5246) > > > +++ opensm/osm_state_mgr.c (working copy) > > > @@ -86,8 +86,6 @@ void > > > osm_state_mgr_destroy( > > > IN osm_state_mgr_t * const p_mgr ) > > > { > > > - CL_ASSERT( p_mgr ); > > > - > > > OSM_LOG_ENTER( p_mgr->p_log, osm_state_mgr_destroy ); > > > > > > /* destroy the locks */ > > > @@ -1884,8 +1882,6 @@ osm_state_mgr_process( > > > ib_api_status_t status; > > > osm_remote_sm_t *p_remote_sm; > > > > > > - CL_ASSERT( p_mgr ); > > > - > > > OSM_LOG_ENTER( p_mgr->p_log, osm_state_mgr_process ); > > > > > > /* if we are exiting do nothing */ > > > Index: opensm/osm_sa_guidinfo_record.c > > > =================================================================== > > > --- opensm/osm_sa_guidinfo_record.c (revision 5246) > > > +++ opensm/osm_sa_guidinfo_record.c (working copy) > > > @@ -433,8 +433,6 @@ osm_gir_rcv_process( > > > ib_api_status_t status; > > > osm_physp_t* p_req_physp; > > > > > > - CL_ASSERT( p_rcv ); > > > - > > > OSM_LOG_ENTER( p_rcv->p_log, osm_gir_rcv_process ); > > > > > > CL_ASSERT( p_madw ); > > > Index: opensm/osm_sa_vlarb_record.c > > > =================================================================== > > > --- opensm/osm_sa_vlarb_record.c (revision 5246) > > > +++ opensm/osm_sa_vlarb_record.c (working copy) > > > @@ -348,8 +348,6 @@ osm_vlarb_rec_rcv_process( > > > ib_net64_t comp_mask; > > > osm_physp_t* p_req_physp; > > > > > > - CL_ASSERT( p_rcv ); > > > - > > > OSM_LOG_ENTER( p_rcv->p_log, osm_vlarb_rec_rcv_process ); > > > > > > /* update the requestor physical port. */ > > > Index: opensm/osm_sa_lft_record.c > > > =================================================================== > > > --- opensm/osm_sa_lft_record.c (revision 5246) > > > +++ opensm/osm_sa_lft_record.c (working copy) > > > @@ -329,8 +329,6 @@ osm_lftr_rcv_process( > > > ib_api_status_t status; > > > osm_physp_t* p_req_physp; > > > > > > - CL_ASSERT( p_rcv ); > > > - > > > OSM_LOG_ENTER( p_rcv->p_log, osm_lftr_rcv_process ); > > > > > > CL_ASSERT( p_madw ); > > > Index: opensm/osm_sa_portinfo_record.c > > > =================================================================== > > > --- opensm/osm_sa_portinfo_record.c (revision 5246) > > > +++ opensm/osm_sa_portinfo_record.c (working copy) > > > @@ -600,8 +600,6 @@ osm_pir_rcv_process( > > > osm_physp_t* p_req_physp; > > > boolean_t trusted_req = TRUE; > > > > > > - CL_ASSERT( p_rcv ); > > > - > > > OSM_LOG_ENTER( p_rcv->p_log, osm_pir_rcv_process ); > > > > > > CL_ASSERT( p_madw ); > > > Index: opensm/osm_req.c > > > =================================================================== > > > --- opensm/osm_req.c (revision 5246) > > > +++ opensm/osm_req.c (working copy) > > > @@ -131,8 +131,6 @@ osm_req_get( > > > ib_api_status_t status = IB_SUCCESS; > > > ib_net64_t tid; > > > > > > - CL_ASSERT( p_req ); > > > - > > > OSM_LOG_ENTER( p_req->p_log, osm_req_get ); > > > > > > CL_ASSERT( p_path ); > > > @@ -222,8 +220,6 @@ osm_req_set( > > > ib_api_status_t status = IB_SUCCESS; > > > ib_net64_t tid; > > > > > > - CL_ASSERT( p_req ); > > > - > > > OSM_LOG_ENTER( p_req->p_log, osm_req_set ); > > > > > > CL_ASSERT( p_path ); > > > Index: opensm/osm_sa_pkey_record.c > > > =================================================================== > > > --- opensm/osm_sa_pkey_record.c (revision 5246) > > > +++ opensm/osm_sa_pkey_record.c (working copy) > > > @@ -344,8 +344,6 @@ osm_pkey_rec_rcv_process( > > > ib_net64_t comp_mask; > > > osm_physp_t* p_req_physp; > > > > > > - CL_ASSERT( p_rcv ); > > > - > > > OSM_LOG_ENTER( p_rcv->p_log, osm_pkey_rec_rcv_process ); > > > > > > CL_ASSERT( p_madw ); > > > Index: opensm/osm_lin_fwd_rcv.c > > > =================================================================== > > > --- opensm/osm_lin_fwd_rcv.c (revision 5246) > > > +++ opensm/osm_lin_fwd_rcv.c (working copy) > > > @@ -75,8 +75,6 @@ void > > > osm_lft_rcv_destroy( > > > IN osm_lft_rcv_t* const p_rcv ) > > > { > > > - CL_ASSERT( p_rcv ); > > > - > > > OSM_LOG_ENTER( p_rcv->p_log, osm_lft_rcv_destroy ); > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > @@ -121,8 +119,6 @@ osm_lft_rcv_process( > > > ib_net64_t node_guid; > > > ib_api_status_t status; > > > > > > - CL_ASSERT( p_rcv ); > > > - > > > OSM_LOG_ENTER( p_rcv->p_log, osm_lft_rcv_process ); > > > > > > CL_ASSERT( p_madw ); > > > Index: opensm/osm_sa_slvl_record.c > > > =================================================================== > > > --- opensm/osm_sa_slvl_record.c (revision 5246) > > > +++ opensm/osm_sa_slvl_record.c (working copy) > > > @@ -324,8 +324,6 @@ osm_slvl_rec_rcv_process( > > > ib_net64_t comp_mask; > > > osm_physp_t* p_req_physp; > > > > > > - CL_ASSERT( p_rcv ); > > > - > > > OSM_LOG_ENTER( p_rcv->p_log, osm_slvl_rec_rcv_process ); > > > > > > CL_ASSERT( p_madw ); > > > Index: opensm/osm_sminfo_rcv.c > > > =================================================================== > > > --- opensm/osm_sminfo_rcv.c (revision 5246) > > > +++ opensm/osm_sminfo_rcv.c (working copy) > > > @@ -80,8 +80,6 @@ void > > > osm_sminfo_rcv_destroy( > > > IN osm_sminfo_rcv_t* const p_rcv ) > > > { > > > - CL_ASSERT( p_rcv ); > > > - > > > OSM_LOG_ENTER( p_rcv->p_log, osm_sminfo_rcv_destroy ); > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > Index: opensm/osm_node_info_rcv.c > > > =================================================================== > > > --- opensm/osm_node_info_rcv.c (revision 5246) > > > +++ opensm/osm_node_info_rcv.c (working copy) > > > @@ -981,8 +981,6 @@ void > > > osm_ni_rcv_destroy( > > > IN osm_ni_rcv_t* const p_rcv ) > > > { > > > - CL_ASSERT( p_rcv ); > > > - > > > OSM_LOG_ENTER( p_rcv->p_log, osm_ni_rcv_destroy ); > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > @@ -1028,8 +1026,6 @@ osm_ni_rcv_process( > > > osm_node_t *p_node; > > > boolean_t process_new_flag = FALSE; > > > > > > - CL_ASSERT( p_rcv ); > > > - > > > OSM_LOG_ENTER( p_rcv->p_log, osm_ni_rcv_process ); > > > > > > CL_ASSERT( p_madw ); > > > Index: opensm/osm_mcast_mgr.c > > > =================================================================== > > > --- opensm/osm_mcast_mgr.c (revision 5246) > > > +++ opensm/osm_mcast_mgr.c (working copy) > > > @@ -394,8 +394,6 @@ void > > > osm_mcast_mgr_destroy( > > > IN osm_mcast_mgr_t* const p_mgr ) > > > { > > > - CL_ASSERT( p_mgr ); > > > - > > > OSM_LOG_ENTER( p_mgr->p_log, osm_mcast_mgr_destroy ); > > > > > > OSM_LOG_EXIT( p_mgr->p_log ); > > > @@ -449,8 +447,6 @@ __osm_mcast_mgr_set_tbl( > > > ib_net16_t block[IB_MCAST_BLOCK_SIZE]; > > > osm_signal_t signal = OSM_SIGNAL_DONE; > > > > > > - CL_ASSERT( p_mgr ); > > > - > > > OSM_LOG_ENTER( p_mgr->p_log, __osm_mcast_mgr_set_tbl ); > > > > > > CL_ASSERT( p_sw ); > > > Index: opensm/osm_sa_sminfo_record.c > > > =================================================================== > > > --- opensm/osm_sa_sminfo_record.c (revision 5246) > > > +++ opensm/osm_sa_sminfo_record.c (working copy) > > > @@ -89,8 +89,6 @@ void > > > osm_smir_rcv_destroy( > > > IN osm_smir_rcv_t* const p_rcv ) > > > { > > > - CL_ASSERT( p_rcv ); > > > - > > > OSM_LOG_ENTER( p_rcv->p_log, osm_smir_rcv_destroy ); > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > @@ -142,8 +140,6 @@ osm_smir_rcv_process( > > > ib_net64_t local_guid; > > > osm_port_t* local_port; > > > > > > - CL_ASSERT( p_rcv ); > > > - > > > OSM_LOG_ENTER( p_rcv->p_log, osm_smir_rcv_process ); > > > > > > CL_ASSERT( p_madw ); > > > Index: opensm/osm_trap_rcv.c > > > =================================================================== > > > --- opensm/osm_trap_rcv.c (revision 5246) > > > +++ opensm/osm_trap_rcv.c (working copy) > > > @@ -189,8 +189,6 @@ void > > > osm_trap_rcv_destroy( > > > IN osm_trap_rcv_t* const p_rcv ) > > > { > > > - CL_ASSERT( p_rcv ); > > > - > > > OSM_LOG_ENTER( p_rcv->p_log, osm_trap_rcv_destroy ); > > > > > > cl_event_wheel_destroy( &p_rcv->trap_aging_tracker ); > > > Index: opensm/osm_ucast_mgr.c > > > =================================================================== > > > --- opensm/osm_ucast_mgr.c (revision 5246) > > > +++ opensm/osm_ucast_mgr.c (working copy) > > > @@ -90,8 +90,6 @@ void > > > osm_ucast_mgr_destroy( > > > IN osm_ucast_mgr_t* const p_mgr ) > > > { > > > - CL_ASSERT( p_mgr ); > > > - > > > OSM_LOG_ENTER( p_mgr->p_log, osm_ucast_mgr_destroy ); > > > > > > OSM_LOG_EXIT( p_mgr->p_log ); > > > @@ -785,8 +783,6 @@ __osm_ucast_mgr_set_table( > > > uint32_t block_id_ho = 0; > > > uint8_t block[IB_SMP_DATA_SIZE]; > > > > > > - CL_ASSERT( p_mgr ); > > > - > > > OSM_LOG_ENTER( p_mgr->p_log, __osm_ucast_mgr_set_table ); > > > > > > CL_ASSERT( p_sw ); > > > Index: opensm/osm_sa_node_record.c > > > =================================================================== > > > --- opensm/osm_sa_node_record.c (revision 5246) > > > +++ opensm/osm_sa_node_record.c (working copy) > > > @@ -435,8 +435,6 @@ osm_nr_rcv_process( > > > ib_api_status_t status; > > > osm_physp_t* p_req_physp; > > > > > > - CL_ASSERT( p_rcv ); > > > - > > > OSM_LOG_ENTER( p_rcv->p_log, osm_nr_rcv_process ); > > > > > > CL_ASSERT( p_madw ); > > > Index: opensm/osm_sw_info_rcv.c > > > =================================================================== > > > --- opensm/osm_sw_info_rcv.c (revision 5246) > > > +++ opensm/osm_sw_info_rcv.c (working copy) > > > @@ -363,8 +363,6 @@ __osm_si_rcv_process_new( > > > ib_smp_t *p_smp; > > > cl_qmap_t *p_sw_guid_tbl; > > > > > > - CL_ASSERT( p_rcv ); > > > - > > > OSM_LOG_ENTER( p_rcv->p_log, __osm_si_rcv_process_new ); > > > > > > CL_ASSERT( p_madw ); > > > @@ -582,8 +580,6 @@ void > > > osm_si_rcv_destroy( > > > IN osm_si_rcv_t* const p_rcv ) > > > { > > > - CL_ASSERT( p_rcv ); > > > - > > > OSM_LOG_ENTER( p_rcv->p_log, osm_si_rcv_destroy ); > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > @@ -631,8 +627,6 @@ osm_si_rcv_process( > > > ib_net64_t node_guid; > > > osm_si_context_t *p_context; > > > > > > - CL_ASSERT( p_rcv ); > > > - > > > OSM_LOG_ENTER( p_rcv->p_log, osm_si_rcv_process ); > > > > > > CL_ASSERT( p_madw ); > > > Index: opensm/osm_mcast_fwd_rcv.c > > > =================================================================== > > > --- opensm/osm_mcast_fwd_rcv.c (revision 5246) > > > +++ opensm/osm_mcast_fwd_rcv.c (working copy) > > > @@ -77,8 +77,6 @@ void > > > osm_mft_rcv_destroy( > > > IN osm_mft_rcv_t* const p_rcv ) > > > { > > > - CL_ASSERT( p_rcv ); > > > - > > > OSM_LOG_ENTER( p_rcv->p_log, osm_mft_rcv_destroy ); > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > @@ -124,8 +122,6 @@ osm_mft_rcv_process( > > > ib_net64_t node_guid; > > > ib_api_status_t status; > > > > > > - CL_ASSERT( p_rcv ); > > > - > > > OSM_LOG_ENTER( p_rcv->p_log, osm_mft_rcv_process ); > > > > > > CL_ASSERT( p_madw ); > > > Index: opensm/osm_slvl_map_rcv.c > > > =================================================================== > > > --- opensm/osm_slvl_map_rcv.c (revision 5246) > > > +++ opensm/osm_slvl_map_rcv.c (working copy) > > > @@ -83,8 +83,6 @@ void > > > osm_slvl_rcv_destroy( > > > IN osm_slvl_rcv_t* const p_rcv ) > > > { > > > - CL_ASSERT( p_rcv ); > > > - > > > OSM_LOG_ENTER( p_rcv->p_log, osm_slvl_rcv_destroy ); > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > @@ -136,8 +134,6 @@ osm_slvl_rcv_process( > > > ib_net64_t node_guid; > > > uint8_t out_port_num, in_port_num; > > > > > > - CL_ASSERT( p_rcv ); > > > - > > > OSM_LOG_ENTER( p_rcv->p_log, osm_slvl_rcv_process ); > > > > > > CL_ASSERT( p_madw ); > > > Index: opensm/osm_node_desc_rcv.c > > > =================================================================== > > > --- opensm/osm_node_desc_rcv.c (revision 5246) > > > +++ opensm/osm_node_desc_rcv.c (working copy) > > > @@ -109,8 +109,6 @@ void > > > osm_nd_rcv_destroy( > > > IN osm_nd_rcv_t* const p_rcv ) > > > { > > > - CL_ASSERT( p_rcv ); > > > - > > > OSM_LOG_ENTER( p_rcv->p_log, osm_nd_rcv_destroy ); > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > @@ -152,8 +150,6 @@ osm_nd_rcv_process( > > > osm_node_t *p_node; > > > ib_net64_t node_guid; > > > > > > - CL_ASSERT( p_rcv ); > > > - > > > OSM_LOG_ENTER( p_rcv->p_log, osm_nd_rcv_process ); > > > > > > CL_ASSERT( p_madw ); > > > Index: opensm/osm_sa_mcmember_record.c > > > =================================================================== > > > --- opensm/osm_sa_mcmember_record.c (revision 5246) > > > +++ opensm/osm_sa_mcmember_record.c (working copy) > > > @@ -109,8 +109,6 @@ void > > > osm_mcmr_rcv_destroy( > > > IN osm_mcmr_recv_t* const p_rcv ) > > > { > > > - CL_ASSERT( p_rcv ); > > > - > > > OSM_LOG_ENTER( p_rcv->p_log, osm_mcmr_rcv_destroy ); > > > > > > cl_qlock_pool_destroy( &p_rcv->pool ); > > > @@ -1967,8 +1965,6 @@ osm_mcmr_query_mgrp(IN osm_mcmr_recv_t* > > > osm_physp_t* p_req_physp; > > > boolean_t trusted_req = TRUE; > > > > > > - CL_ASSERT( p_rcv ); > > > - > > > OSM_LOG_ENTER( p_rcv->p_log, osm_mcmr_query_mgrp ); > > > > > > CL_ASSERT( p_madw ); > > > @@ -2173,8 +2169,6 @@ osm_mcmr_rcv_process( > > > ib_member_rec_t *p_recvd_mcmember_rec; > > > boolean_t valid; > > > > > > - CL_ASSERT( p_rcv ); > > > - > > > OSM_LOG_ENTER( p_rcv->p_log, osm_mcmr_rcv_process ); > > > > > > CL_ASSERT( p_madw ); > > > Index: opensm/osm_drop_mgr.c > > > =================================================================== > > > --- opensm/osm_drop_mgr.c (revision 5246) > > > +++ opensm/osm_drop_mgr.c (working copy) > > > @@ -81,8 +81,6 @@ void > > > osm_drop_mgr_destroy( > > > IN osm_drop_mgr_t* const p_mgr ) > > > { > > > - CL_ASSERT( p_mgr ); > > > - > > > OSM_LOG_ENTER( p_mgr->p_log, osm_drop_mgr_destroy ); > > > > > > OSM_LOG_EXIT( p_mgr->p_log ); > > > @@ -597,8 +595,6 @@ osm_drop_mgr_process( > > > uint8_t port_num; > > > osm_physp_t *p_physp; > > > > > > - CL_ASSERT( p_mgr ); > > > - > > > OSM_LOG_ENTER( p_mgr->p_log, osm_drop_mgr_process ); > > > > > > p_node_guid_tbl = &p_mgr->p_subn->node_guid_tbl; > > > Index: opensm/osm_lid_mgr.c > > > =================================================================== > > > --- opensm/osm_lid_mgr.c (revision 5246) > > > +++ opensm/osm_lid_mgr.c (working copy) > > > @@ -1312,8 +1312,6 @@ osm_lid_mgr_process_subnet( > > > osm_physp_t *p_physp; > > > int lid_changed; > > > > > > - CL_ASSERT( p_mgr ); > > > - > > > OSM_LOG_ENTER( p_mgr->p_log, osm_lid_mgr_process_subnet ); > > > > > > CL_PLOCK_EXCL_ACQUIRE( p_mgr->p_lock ); > > > Index: opensm/osm_pkey_mgr.c > > > =================================================================== > > > --- opensm/osm_pkey_mgr.c (revision 5246) > > > +++ opensm/osm_pkey_mgr.c (working copy) > > > @@ -73,8 +73,6 @@ void > > > osm_pkey_mgr_destroy( > > > IN osm_pkey_mgr_t * const p_mgr ) > > > { > > > - CL_ASSERT( p_mgr ); > > > - > > > OSM_LOG_ENTER( p_mgr->p_log, osm_pkey_mgr_destroy ); > > > > > > OSM_LOG_EXIT( p_mgr->p_log ); > > > @@ -238,8 +236,6 @@ osm_pkey_mgr_process( > > > osm_physp_t *p_physp; > > > osm_signal_t result = OSM_SIGNAL_DONE; > > > > > > - CL_ASSERT( p_mgr ); > > > - > > > OSM_LOG_ENTER( p_mgr->p_log, osm_pkey_mgr_process ); > > > > > > p_node_guid_tbl = &p_mgr->p_subn->node_guid_tbl; > > > Index: opensm/osm_vl_arb_rcv.c > > > =================================================================== > > > --- opensm/osm_vl_arb_rcv.c (revision 5246) > > > +++ opensm/osm_vl_arb_rcv.c (working copy) > > > @@ -83,8 +83,6 @@ void > > > osm_vla_rcv_destroy( > > > IN osm_vla_rcv_t* const p_rcv ) > > > { > > > - CL_ASSERT( p_rcv ); > > > - > > > OSM_LOG_ENTER( p_rcv->p_log, osm_vla_rcv_destroy ); > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > @@ -136,8 +134,6 @@ osm_vla_rcv_process( > > > ib_net64_t node_guid; > > > uint8_t port_num, block_num; > > > > > > - CL_ASSERT( p_rcv ); > > > - > > > OSM_LOG_ENTER( p_rcv->p_log, osm_vla_rcv_process ); > > > > > > CL_ASSERT( p_madw ); > > > From eitan at mellanox.co.il Wed Feb 1 12:49:10 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 1 Feb 2006 22:49:10 +0200 Subject: [openib-general] RE: [PATCH] Opensm - asserts before OSM_LOG_ENTER Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B694@mtlexch01.mtl.com> Hi Hal, All of these asserts are checking objects that can only be missing if we have memory rundown or races in OpenSM destruction. I do not think these particular asserts are going to be useful anyway. The damage in them is bigger then their usefulness. We could however have a LINUX_ASSERT which is active only in LINUX. Can we have a macro that will be mapped to CL_ASSERT only if we are in Linux? Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Hal Rosenstock > Sent: Wednesday, February 01, 2006 10:18 PM > To: Eitan Zahavi > Cc: openib-general at openib.org > Subject: [openib-general] RE: [PATCH] Opensm - asserts before OSM_LOG_ENTER > > Hi Eitan, > > On Wed, 2006-02-01 at 09:50, Eitan Zahavi wrote: > > Hi Hal, > > > > Please see below: > > > > When trying to compile the windows stack with some late updates, > > I've > > > > encountered an issue with the addition/change of place of asserts to > > > > before the OSM_LOG_ENTER. Since OSM_LOG_ENTER declares a variable, > > > > then these asserts cause failure due to declaration in the middle of > > > > the function. > > > > > > The asserts are on a passed in pointer rather than the static variable > > > created by the MACRO based on the second parameter to OSM_LOG_ENTER. I > > > don't understand how this causes a problem. Is it Windows only? > > > [EZ] In C you are not allowed to mix variable declarations with > > statements like "if" (on the same code block). In debug build the > > CL_ASSERT includes an "if" statement that is later followed by > > OSM_LOG_ENTER which declares the static variable. It used to fail build > > in Linux but for some reason it stopped. I guess if we used -pedantic or > > ANSI it would fail. > > OK. I understand now what is going on. > > > Anyway, the assert is on the passed down parameter > > which is passed as NULL. This might happen only on race in the "destroy" > > flow - but if this is a race it is not guaranteed to catch the bug as > > the pointer might be free'ed after the assert. It should be caught as a > > "segfault" (dereferencing NULL pointer) or in valgrind. > > This is not always possible. > > > We have few options: > > a. Do not use same code tree for WinIB - I do not think we want that. > > I too would prefer not to do that but this is not a requirement of > OpenIB. > > > b. Put everything after the CL_ASSERT in an internal code block (i.e. > > "{") - I do not think we want to do this either. > > I don't think that solves the problem as doesn't OSM_LOG_EXIT need > access to that variable created by OSM_LOG_ENTER. > > > c. Move the CL_ASSERT before the function call (into the function > > caller). > > Ugh... > > > d. Give up these few asserts as this only can happen as a race during > > resource destruction. > > e. What about some conditionalization of these asserts ? > > #ifndef __WIN__ > CL_ASSERT(foo); > #endif > > It's already in other places in OpenSM. > > > I think that in this case it is more important to keep the WinIB and > > Linux tree identical. > > This is not a requirement although it is desirable. > > -- Hal > > > > > These asserts are all on the reciever object or the manager object, > > so > > > > I don't think they are really necessary. > > > > > > They compile out when not using debug. I saw these trip at SC05. > > [EZ] As explained - yes they can trip - but only if we have memory > > pollution (that could be caught by valgrind) or during exit - when and > > they really a race and might not be caught by the assert. > > > > > > -- Hal > > > > > > > The Following patch removes these asserts. > > > > > > > > Thanks, > > > > Yael > > > > > > > > Signed-off-by: Yael Kalka > > > > > > > > Index: opensm/osm_pkey_rcv.c > > > > =================================================================== > > > > --- opensm/osm_pkey_rcv.c (revision 5246) > > > > +++ opensm/osm_pkey_rcv.c (working copy) > > > > @@ -71,8 +71,6 @@ void > > > > osm_pkey_rcv_destroy( > > > > IN osm_pkey_rcv_t* const p_rcv ) > > > > { > > > > - CL_ASSERT( p_rcv ); > > > > - > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_pkey_rcv_destroy ); > > > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > > @@ -125,8 +123,6 @@ osm_pkey_rcv_process( > > > > uint8_t port_num; > > > > uint16_t block_num; > > > > > > > > - CL_ASSERT( p_rcv ); > > > > - > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_pkey_rcv_process ); > > > > > > > > CL_ASSERT( p_madw ); > > > > Index: opensm/osm_sm_state_mgr.c > > > > =================================================================== > > > > --- opensm/osm_sm_state_mgr.c (revision 5246) > > > > +++ opensm/osm_sm_state_mgr.c (working copy) > > > > @@ -406,8 +406,6 @@ void > > > > osm_sm_state_mgr_destroy( > > > > IN osm_sm_state_mgr_t * const p_sm_mgr ) > > > > { > > > > - CL_ASSERT( p_sm_mgr ); > > > > - > > > > OSM_LOG_ENTER( p_sm_mgr->p_log, osm_sm_state_mgr_destroy ); > > > > > > > > cl_spinlock_destroy( &p_sm_mgr->state_lock ); > > > > @@ -500,8 +498,6 @@ osm_sm_state_mgr_process( > > > > { > > > > ib_api_status_t status = IB_SUCCESS; > > > > > > > > - CL_ASSERT( p_sm_mgr ); > > > > - > > > > OSM_LOG_ENTER( p_sm_mgr->p_log, osm_sm_state_mgr_process ); > > > > > > > > /* > > > > @@ -760,8 +756,6 @@ osm_sm_state_mgr_check_legality( > > > > { > > > > ib_api_status_t status = IB_SUCCESS; > > > > > > > > - CL_ASSERT( p_sm_mgr ); > > > > - > > > > OSM_LOG_ENTER( p_sm_mgr->p_log, osm_sm_state_mgr_check_legality > > ); > > > > > > > > /* > > > > Index: opensm/osm_state_mgr.c > > > > =================================================================== > > > > --- opensm/osm_state_mgr.c (revision 5246) > > > > +++ opensm/osm_state_mgr.c (working copy) > > > > @@ -86,8 +86,6 @@ void > > > > osm_state_mgr_destroy( > > > > IN osm_state_mgr_t * const p_mgr ) > > > > { > > > > - CL_ASSERT( p_mgr ); > > > > - > > > > OSM_LOG_ENTER( p_mgr->p_log, osm_state_mgr_destroy ); > > > > > > > > /* destroy the locks */ > > > > @@ -1884,8 +1882,6 @@ osm_state_mgr_process( > > > > ib_api_status_t status; > > > > osm_remote_sm_t *p_remote_sm; > > > > > > > > - CL_ASSERT( p_mgr ); > > > > - > > > > OSM_LOG_ENTER( p_mgr->p_log, osm_state_mgr_process ); > > > > > > > > /* if we are exiting do nothing */ > > > > Index: opensm/osm_sa_guidinfo_record.c > > > > =================================================================== > > > > --- opensm/osm_sa_guidinfo_record.c (revision 5246) > > > > +++ opensm/osm_sa_guidinfo_record.c (working copy) > > > > @@ -433,8 +433,6 @@ osm_gir_rcv_process( > > > > ib_api_status_t status; > > > > osm_physp_t* p_req_physp; > > > > > > > > - CL_ASSERT( p_rcv ); > > > > - > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_gir_rcv_process ); > > > > > > > > CL_ASSERT( p_madw ); > > > > Index: opensm/osm_sa_vlarb_record.c > > > > =================================================================== > > > > --- opensm/osm_sa_vlarb_record.c (revision 5246) > > > > +++ opensm/osm_sa_vlarb_record.c (working copy) > > > > @@ -348,8 +348,6 @@ osm_vlarb_rec_rcv_process( > > > > ib_net64_t comp_mask; > > > > osm_physp_t* p_req_physp; > > > > > > > > - CL_ASSERT( p_rcv ); > > > > - > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_vlarb_rec_rcv_process ); > > > > > > > > /* update the requestor physical port. */ > > > > Index: opensm/osm_sa_lft_record.c > > > > =================================================================== > > > > --- opensm/osm_sa_lft_record.c (revision 5246) > > > > +++ opensm/osm_sa_lft_record.c (working copy) > > > > @@ -329,8 +329,6 @@ osm_lftr_rcv_process( > > > > ib_api_status_t status; > > > > osm_physp_t* p_req_physp; > > > > > > > > - CL_ASSERT( p_rcv ); > > > > - > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_lftr_rcv_process ); > > > > > > > > CL_ASSERT( p_madw ); > > > > Index: opensm/osm_sa_portinfo_record.c > > > > =================================================================== > > > > --- opensm/osm_sa_portinfo_record.c (revision 5246) > > > > +++ opensm/osm_sa_portinfo_record.c (working copy) > > > > @@ -600,8 +600,6 @@ osm_pir_rcv_process( > > > > osm_physp_t* p_req_physp; > > > > boolean_t trusted_req = TRUE; > > > > > > > > - CL_ASSERT( p_rcv ); > > > > - > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_pir_rcv_process ); > > > > > > > > CL_ASSERT( p_madw ); > > > > Index: opensm/osm_req.c > > > > =================================================================== > > > > --- opensm/osm_req.c (revision 5246) > > > > +++ opensm/osm_req.c (working copy) > > > > @@ -131,8 +131,6 @@ osm_req_get( > > > > ib_api_status_t status = IB_SUCCESS; > > > > ib_net64_t tid; > > > > > > > > - CL_ASSERT( p_req ); > > > > - > > > > OSM_LOG_ENTER( p_req->p_log, osm_req_get ); > > > > > > > > CL_ASSERT( p_path ); > > > > @@ -222,8 +220,6 @@ osm_req_set( > > > > ib_api_status_t status = IB_SUCCESS; > > > > ib_net64_t tid; > > > > > > > > - CL_ASSERT( p_req ); > > > > - > > > > OSM_LOG_ENTER( p_req->p_log, osm_req_set ); > > > > > > > > CL_ASSERT( p_path ); > > > > Index: opensm/osm_sa_pkey_record.c > > > > =================================================================== > > > > --- opensm/osm_sa_pkey_record.c (revision 5246) > > > > +++ opensm/osm_sa_pkey_record.c (working copy) > > > > @@ -344,8 +344,6 @@ osm_pkey_rec_rcv_process( > > > > ib_net64_t comp_mask; > > > > osm_physp_t* p_req_physp; > > > > > > > > - CL_ASSERT( p_rcv ); > > > > - > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_pkey_rec_rcv_process ); > > > > > > > > CL_ASSERT( p_madw ); > > > > Index: opensm/osm_lin_fwd_rcv.c > > > > =================================================================== > > > > --- opensm/osm_lin_fwd_rcv.c (revision 5246) > > > > +++ opensm/osm_lin_fwd_rcv.c (working copy) > > > > @@ -75,8 +75,6 @@ void > > > > osm_lft_rcv_destroy( > > > > IN osm_lft_rcv_t* const p_rcv ) > > > > { > > > > - CL_ASSERT( p_rcv ); > > > > - > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_lft_rcv_destroy ); > > > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > > @@ -121,8 +119,6 @@ osm_lft_rcv_process( > > > > ib_net64_t node_guid; > > > > ib_api_status_t status; > > > > > > > > - CL_ASSERT( p_rcv ); > > > > - > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_lft_rcv_process ); > > > > > > > > CL_ASSERT( p_madw ); > > > > Index: opensm/osm_sa_slvl_record.c > > > > =================================================================== > > > > --- opensm/osm_sa_slvl_record.c (revision 5246) > > > > +++ opensm/osm_sa_slvl_record.c (working copy) > > > > @@ -324,8 +324,6 @@ osm_slvl_rec_rcv_process( > > > > ib_net64_t comp_mask; > > > > osm_physp_t* p_req_physp; > > > > > > > > - CL_ASSERT( p_rcv ); > > > > - > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_slvl_rec_rcv_process ); > > > > > > > > CL_ASSERT( p_madw ); > > > > Index: opensm/osm_sminfo_rcv.c > > > > =================================================================== > > > > --- opensm/osm_sminfo_rcv.c (revision 5246) > > > > +++ opensm/osm_sminfo_rcv.c (working copy) > > > > @@ -80,8 +80,6 @@ void > > > > osm_sminfo_rcv_destroy( > > > > IN osm_sminfo_rcv_t* const p_rcv ) > > > > { > > > > - CL_ASSERT( p_rcv ); > > > > - > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_sminfo_rcv_destroy ); > > > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > > Index: opensm/osm_node_info_rcv.c > > > > =================================================================== > > > > --- opensm/osm_node_info_rcv.c (revision 5246) > > > > +++ opensm/osm_node_info_rcv.c (working copy) > > > > @@ -981,8 +981,6 @@ void > > > > osm_ni_rcv_destroy( > > > > IN osm_ni_rcv_t* const p_rcv ) > > > > { > > > > - CL_ASSERT( p_rcv ); > > > > - > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_ni_rcv_destroy ); > > > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > > @@ -1028,8 +1026,6 @@ osm_ni_rcv_process( > > > > osm_node_t *p_node; > > > > boolean_t process_new_flag = FALSE; > > > > > > > > - CL_ASSERT( p_rcv ); > > > > - > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_ni_rcv_process ); > > > > > > > > CL_ASSERT( p_madw ); > > > > Index: opensm/osm_mcast_mgr.c > > > > =================================================================== > > > > --- opensm/osm_mcast_mgr.c (revision 5246) > > > > +++ opensm/osm_mcast_mgr.c (working copy) > > > > @@ -394,8 +394,6 @@ void > > > > osm_mcast_mgr_destroy( > > > > IN osm_mcast_mgr_t* const p_mgr ) > > > > { > > > > - CL_ASSERT( p_mgr ); > > > > - > > > > OSM_LOG_ENTER( p_mgr->p_log, osm_mcast_mgr_destroy ); > > > > > > > > OSM_LOG_EXIT( p_mgr->p_log ); > > > > @@ -449,8 +447,6 @@ __osm_mcast_mgr_set_tbl( > > > > ib_net16_t block[IB_MCAST_BLOCK_SIZE]; > > > > osm_signal_t signal = OSM_SIGNAL_DONE; > > > > > > > > - CL_ASSERT( p_mgr ); > > > > - > > > > OSM_LOG_ENTER( p_mgr->p_log, __osm_mcast_mgr_set_tbl ); > > > > > > > > CL_ASSERT( p_sw ); > > > > Index: opensm/osm_sa_sminfo_record.c > > > > =================================================================== > > > > --- opensm/osm_sa_sminfo_record.c (revision 5246) > > > > +++ opensm/osm_sa_sminfo_record.c (working copy) > > > > @@ -89,8 +89,6 @@ void > > > > osm_smir_rcv_destroy( > > > > IN osm_smir_rcv_t* const p_rcv ) > > > > { > > > > - CL_ASSERT( p_rcv ); > > > > - > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_smir_rcv_destroy ); > > > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > > @@ -142,8 +140,6 @@ osm_smir_rcv_process( > > > > ib_net64_t local_guid; > > > > osm_port_t* local_port; > > > > > > > > - CL_ASSERT( p_rcv ); > > > > - > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_smir_rcv_process ); > > > > > > > > CL_ASSERT( p_madw ); > > > > Index: opensm/osm_trap_rcv.c > > > > =================================================================== > > > > --- opensm/osm_trap_rcv.c (revision 5246) > > > > +++ opensm/osm_trap_rcv.c (working copy) > > > > @@ -189,8 +189,6 @@ void > > > > osm_trap_rcv_destroy( > > > > IN osm_trap_rcv_t* const p_rcv ) > > > > { > > > > - CL_ASSERT( p_rcv ); > > > > - > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_trap_rcv_destroy ); > > > > > > > > cl_event_wheel_destroy( &p_rcv->trap_aging_tracker ); > > > > Index: opensm/osm_ucast_mgr.c > > > > =================================================================== > > > > --- opensm/osm_ucast_mgr.c (revision 5246) > > > > +++ opensm/osm_ucast_mgr.c (working copy) > > > > @@ -90,8 +90,6 @@ void > > > > osm_ucast_mgr_destroy( > > > > IN osm_ucast_mgr_t* const p_mgr ) > > > > { > > > > - CL_ASSERT( p_mgr ); > > > > - > > > > OSM_LOG_ENTER( p_mgr->p_log, osm_ucast_mgr_destroy ); > > > > > > > > OSM_LOG_EXIT( p_mgr->p_log ); > > > > @@ -785,8 +783,6 @@ __osm_ucast_mgr_set_table( > > > > uint32_t block_id_ho = 0; > > > > uint8_t block[IB_SMP_DATA_SIZE]; > > > > > > > > - CL_ASSERT( p_mgr ); > > > > - > > > > OSM_LOG_ENTER( p_mgr->p_log, __osm_ucast_mgr_set_table ); > > > > > > > > CL_ASSERT( p_sw ); > > > > Index: opensm/osm_sa_node_record.c > > > > =================================================================== > > > > --- opensm/osm_sa_node_record.c (revision 5246) > > > > +++ opensm/osm_sa_node_record.c (working copy) > > > > @@ -435,8 +435,6 @@ osm_nr_rcv_process( > > > > ib_api_status_t status; > > > > osm_physp_t* p_req_physp; > > > > > > > > - CL_ASSERT( p_rcv ); > > > > - > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_nr_rcv_process ); > > > > > > > > CL_ASSERT( p_madw ); > > > > Index: opensm/osm_sw_info_rcv.c > > > > =================================================================== > > > > --- opensm/osm_sw_info_rcv.c (revision 5246) > > > > +++ opensm/osm_sw_info_rcv.c (working copy) > > > > @@ -363,8 +363,6 @@ __osm_si_rcv_process_new( > > > > ib_smp_t *p_smp; > > > > cl_qmap_t *p_sw_guid_tbl; > > > > > > > > - CL_ASSERT( p_rcv ); > > > > - > > > > OSM_LOG_ENTER( p_rcv->p_log, __osm_si_rcv_process_new ); > > > > > > > > CL_ASSERT( p_madw ); > > > > @@ -582,8 +580,6 @@ void > > > > osm_si_rcv_destroy( > > > > IN osm_si_rcv_t* const p_rcv ) > > > > { > > > > - CL_ASSERT( p_rcv ); > > > > - > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_si_rcv_destroy ); > > > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > > @@ -631,8 +627,6 @@ osm_si_rcv_process( > > > > ib_net64_t node_guid; > > > > osm_si_context_t *p_context; > > > > > > > > - CL_ASSERT( p_rcv ); > > > > - > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_si_rcv_process ); > > > > > > > > CL_ASSERT( p_madw ); > > > > Index: opensm/osm_mcast_fwd_rcv.c > > > > =================================================================== > > > > --- opensm/osm_mcast_fwd_rcv.c (revision 5246) > > > > +++ opensm/osm_mcast_fwd_rcv.c (working copy) > > > > @@ -77,8 +77,6 @@ void > > > > osm_mft_rcv_destroy( > > > > IN osm_mft_rcv_t* const p_rcv ) > > > > { > > > > - CL_ASSERT( p_rcv ); > > > > - > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_mft_rcv_destroy ); > > > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > > @@ -124,8 +122,6 @@ osm_mft_rcv_process( > > > > ib_net64_t node_guid; > > > > ib_api_status_t status; > > > > > > > > - CL_ASSERT( p_rcv ); > > > > - > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_mft_rcv_process ); > > > > > > > > CL_ASSERT( p_madw ); > > > > Index: opensm/osm_slvl_map_rcv.c > > > > =================================================================== > > > > --- opensm/osm_slvl_map_rcv.c (revision 5246) > > > > +++ opensm/osm_slvl_map_rcv.c (working copy) > > > > @@ -83,8 +83,6 @@ void > > > > osm_slvl_rcv_destroy( > > > > IN osm_slvl_rcv_t* const p_rcv ) > > > > { > > > > - CL_ASSERT( p_rcv ); > > > > - > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_slvl_rcv_destroy ); > > > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > > @@ -136,8 +134,6 @@ osm_slvl_rcv_process( > > > > ib_net64_t node_guid; > > > > uint8_t out_port_num, in_port_num; > > > > > > > > - CL_ASSERT( p_rcv ); > > > > - > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_slvl_rcv_process ); > > > > > > > > CL_ASSERT( p_madw ); > > > > Index: opensm/osm_node_desc_rcv.c > > > > =================================================================== > > > > --- opensm/osm_node_desc_rcv.c (revision 5246) > > > > +++ opensm/osm_node_desc_rcv.c (working copy) > > > > @@ -109,8 +109,6 @@ void > > > > osm_nd_rcv_destroy( > > > > IN osm_nd_rcv_t* const p_rcv ) > > > > { > > > > - CL_ASSERT( p_rcv ); > > > > - > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_nd_rcv_destroy ); > > > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > > @@ -152,8 +150,6 @@ osm_nd_rcv_process( > > > > osm_node_t *p_node; > > > > ib_net64_t node_guid; > > > > > > > > - CL_ASSERT( p_rcv ); > > > > - > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_nd_rcv_process ); > > > > > > > > CL_ASSERT( p_madw ); > > > > Index: opensm/osm_sa_mcmember_record.c > > > > =================================================================== > > > > --- opensm/osm_sa_mcmember_record.c (revision 5246) > > > > +++ opensm/osm_sa_mcmember_record.c (working copy) > > > > @@ -109,8 +109,6 @@ void > > > > osm_mcmr_rcv_destroy( > > > > IN osm_mcmr_recv_t* const p_rcv ) > > > > { > > > > - CL_ASSERT( p_rcv ); > > > > - > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_mcmr_rcv_destroy ); > > > > > > > > cl_qlock_pool_destroy( &p_rcv->pool ); > > > > @@ -1967,8 +1965,6 @@ osm_mcmr_query_mgrp(IN osm_mcmr_recv_t* > > > > osm_physp_t* p_req_physp; > > > > boolean_t trusted_req = TRUE; > > > > > > > > - CL_ASSERT( p_rcv ); > > > > - > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_mcmr_query_mgrp ); > > > > > > > > CL_ASSERT( p_madw ); > > > > @@ -2173,8 +2169,6 @@ osm_mcmr_rcv_process( > > > > ib_member_rec_t *p_recvd_mcmember_rec; > > > > boolean_t valid; > > > > > > > > - CL_ASSERT( p_rcv ); > > > > - > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_mcmr_rcv_process ); > > > > > > > > CL_ASSERT( p_madw ); > > > > Index: opensm/osm_drop_mgr.c > > > > =================================================================== > > > > --- opensm/osm_drop_mgr.c (revision 5246) > > > > +++ opensm/osm_drop_mgr.c (working copy) > > > > @@ -81,8 +81,6 @@ void > > > > osm_drop_mgr_destroy( > > > > IN osm_drop_mgr_t* const p_mgr ) > > > > { > > > > - CL_ASSERT( p_mgr ); > > > > - > > > > OSM_LOG_ENTER( p_mgr->p_log, osm_drop_mgr_destroy ); > > > > > > > > OSM_LOG_EXIT( p_mgr->p_log ); > > > > @@ -597,8 +595,6 @@ osm_drop_mgr_process( > > > > uint8_t port_num; > > > > osm_physp_t *p_physp; > > > > > > > > - CL_ASSERT( p_mgr ); > > > > - > > > > OSM_LOG_ENTER( p_mgr->p_log, osm_drop_mgr_process ); > > > > > > > > p_node_guid_tbl = &p_mgr->p_subn->node_guid_tbl; > > > > Index: opensm/osm_lid_mgr.c > > > > =================================================================== > > > > --- opensm/osm_lid_mgr.c (revision 5246) > > > > +++ opensm/osm_lid_mgr.c (working copy) > > > > @@ -1312,8 +1312,6 @@ osm_lid_mgr_process_subnet( > > > > osm_physp_t *p_physp; > > > > int lid_changed; > > > > > > > > - CL_ASSERT( p_mgr ); > > > > - > > > > OSM_LOG_ENTER( p_mgr->p_log, osm_lid_mgr_process_subnet ); > > > > > > > > CL_PLOCK_EXCL_ACQUIRE( p_mgr->p_lock ); > > > > Index: opensm/osm_pkey_mgr.c > > > > =================================================================== > > > > --- opensm/osm_pkey_mgr.c (revision 5246) > > > > +++ opensm/osm_pkey_mgr.c (working copy) > > > > @@ -73,8 +73,6 @@ void > > > > osm_pkey_mgr_destroy( > > > > IN osm_pkey_mgr_t * const p_mgr ) > > > > { > > > > - CL_ASSERT( p_mgr ); > > > > - > > > > OSM_LOG_ENTER( p_mgr->p_log, osm_pkey_mgr_destroy ); > > > > > > > > OSM_LOG_EXIT( p_mgr->p_log ); > > > > @@ -238,8 +236,6 @@ osm_pkey_mgr_process( > > > > osm_physp_t *p_physp; > > > > osm_signal_t result = OSM_SIGNAL_DONE; > > > > > > > > - CL_ASSERT( p_mgr ); > > > > - > > > > OSM_LOG_ENTER( p_mgr->p_log, osm_pkey_mgr_process ); > > > > > > > > p_node_guid_tbl = &p_mgr->p_subn->node_guid_tbl; > > > > Index: opensm/osm_vl_arb_rcv.c > > > > =================================================================== > > > > --- opensm/osm_vl_arb_rcv.c (revision 5246) > > > > +++ opensm/osm_vl_arb_rcv.c (working copy) > > > > @@ -83,8 +83,6 @@ void > > > > osm_vla_rcv_destroy( > > > > IN osm_vla_rcv_t* const p_rcv ) > > > > { > > > > - CL_ASSERT( p_rcv ); > > > > - > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_vla_rcv_destroy ); > > > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > > @@ -136,8 +134,6 @@ osm_vla_rcv_process( > > > > ib_net64_t node_guid; > > > > uint8_t port_num, block_num; > > > > > > > > - CL_ASSERT( p_rcv ); > > > > - > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_vla_rcv_process ); > > > > > > > > CL_ASSERT( p_madw ); > > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sean.hefty at intel.com Wed Feb 1 13:34:57 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 1 Feb 2006 13:34:57 -0800 Subject: [openib-general] [PATCH] iWARP Include File Changes In-Reply-To: <1136487265.10878.17.camel@trinity.austin.ammasso.com> Message-ID: > enum ib_device_cap_flags { >@@ -86,6 +87,14 @@ > IB_DEVICE_RC_RNR_NAK_GEN = (1<<12), > IB_DEVICE_SRQ_RESIZE = (1<<13), > IB_DEVICE_N_NOTIFY_CQ = (1<<14), >+ IB_DEVICE_IN_ORD_PLCMNT = (1<<15), >+ IB_DEVICE_ZERO_STAG = (1<<16), >+ IB_DEVICE_SEND_W_INV = (1<<17), >+ IB_DEVICE_MW = (1<<18), >+ IB_DEVICE_FMR = (1<<19), >+ IB_DEVICE_SRQ = (1<<20), >+ IB_DEVICE_ARP = (1<<21), >+ IB_DEVICE_LLP = (1<<22), > }; I have a couple of questions below, but does anyone object to this portion of this patch (with some possible renaming)? Does in order placement indicate that a technique that polls memory can be used to determine if remote data has been received? Can you also describe how a user would use ARP and LLP flags? - Sean From swise at opengridcomputing.com Wed Feb 1 13:56:00 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 01 Feb 2006 15:56:00 -0600 Subject: [openib-general] [PATCH] iWARP Include File Changes In-Reply-To: References: Message-ID: <1138830960.16173.8.camel@stevo-desktop> On Wed, 2006-02-01 at 13:34 -0800, Sean Hefty wrote: > > enum ib_device_cap_flags { > >@@ -86,6 +87,14 @@ > > IB_DEVICE_RC_RNR_NAK_GEN = (1<<12), > > IB_DEVICE_SRQ_RESIZE = (1<<13), > > IB_DEVICE_N_NOTIFY_CQ = (1<<14), > >+ IB_DEVICE_IN_ORD_PLCMNT = (1<<15), > >+ IB_DEVICE_ZERO_STAG = (1<<16), > >+ IB_DEVICE_SEND_W_INV = (1<<17), > >+ IB_DEVICE_MW = (1<<18), > >+ IB_DEVICE_FMR = (1<<19), > >+ IB_DEVICE_SRQ = (1<<20), > >+ IB_DEVICE_ARP = (1<<21), > >+ IB_DEVICE_LLP = (1<<22), > > }; > > I have a couple of questions below, but does anyone object to this > portion of this patch (with some possible renaming)? > > Does in order placement indicate that a technique that polls memory can > be used to determine if remote data has been received? yes. > > Can you also describe how a user would use ARP and LLP flags? ARP == the rnic device handles ARP entirely for iwarp connections. The amso1100 device does this. Otherwise, it is assumed the native stack will do ARP to resolve MAC addresses for the rnic. The IW CM will use this to determine whether it needs to initiate ARP during rdma_resolve_addr() processing. LLP == the device exposes TCP connect methods. This capability isn't supported by the IW CM patch nor the Amso1100. It for future devices that might expose TCP connect methods. Some protocols that sit on top of iwarp require this. iSER in its current form would require this, for instance. From rdreier at cisco.com Wed Feb 1 14:07:17 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 01 Feb 2006 14:07:17 -0800 Subject: [openib-general] [PATCH] iWARP Include File Changes In-Reply-To: (Sean Hefty's message of "Wed, 1 Feb 2006 13:34:57 -0800") References: Message-ID: Sean> I have a couple of questions below, but does anyone object Sean> to this portion of this patch (with some possible renaming)? Nope, but as your questions indicate I think we need a nice big comment block that says what each flag really means. - R. From rdreier at cisco.com Wed Feb 1 14:11:08 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 01 Feb 2006 14:11:08 -0800 Subject: [openib-general] svn troubles? Message-ID: Or is it just me? I see even trivial stuff like svn ls https://openib.org/svn just hanging. - R. From sean.hefty at intel.com Wed Feb 1 14:13:37 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 1 Feb 2006 14:13:37 -0800 Subject: [openib-general] svn troubles? In-Reply-To: Message-ID: >Or is it just me? I see even trivial stuff like > > svn ls https://openib.org/svn > >just hanging. An svn update just worked for me. - Sean From sashak at voltaire.com Wed Feb 1 14:59:23 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 2 Feb 2006 00:59:23 +0200 Subject: [openib-general] Re: [PATCH] Opensm - using default dir In-Reply-To: <20060130124122.GY31887@mellanox.co.il> References: <5zwtgi0xs7.fsf@mtl066.yok.mtl.com> <20060130124122.GY31887@mellanox.co.il> Message-ID: <20060201225923.GA32188@sashak.voltaire.com> On 14:41 Mon 30 Jan , Michael S. Tsirkin wrote: > Quoting r. Yael Kalka : > > =================================================================== > > --- include/opensm/osm_svn_revision.h (revision 5203) > > +++ include/opensm/osm_svn_revision.h (working copy) > > @@ -1 +1 @@ > > -#define OSM_SVN_REVISION "" > > +#define OSM_SVN_REVISION "5203M" > > This looks like a mistake. > And, I think this shows that keeping the generated file osm_svn_revision.h > represents a problem. Good point. Hal, could we svn-remove this file? Sasha. From caitlinb at broadcom.com Wed Feb 1 15:06:02 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 1 Feb 2006 15:06:02 -0800 Subject: [openib-general] [PATCH] iWARP Include File Changes Message-ID: <54AD0F12E08D1541B826BE97C98F99F122D46D@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > On Wed, 2006-02-01 at 13:34 -0800, Sean Hefty wrote: >>> enum ib_device_cap_flags { >>> @@ -86,6 +87,14 @@ >>> IB_DEVICE_RC_RNR_NAK_GEN = (1<<12), >>> IB_DEVICE_SRQ_RESIZE = (1<<13), >>> IB_DEVICE_N_NOTIFY_CQ = (1<<14), >>> + IB_DEVICE_IN_ORD_PLCMNT = (1<<15), >>> + IB_DEVICE_ZERO_STAG = (1<<16), >>> + IB_DEVICE_SEND_W_INV = (1<<17), >>> + IB_DEVICE_MW = (1<<18), >>> + IB_DEVICE_FMR = (1<<19), >>> + IB_DEVICE_SRQ = (1<<20), >>> + IB_DEVICE_ARP = (1<<21), >>> + IB_DEVICE_LLP = (1<<22), >>> }; >> >> I have a couple of questions below, but does anyone object to this >> portion of this patch (with some possible renaming)? >> >> Does in order placement indicate that a technique that polls memory >> can be used to determine if remote data has been received? > > yes. > To be precise it indicates that *this* end can certify that its placements to memory will become visible to the local processor in the order that they were posted on the remote side. There is NOTHING that can tell you how your peer behaves. The best you can do is trust your peer to relay this info to you, but you can't know that they did so correctly. Also note that like many device attributes that this is read-only. You cannot *tell* your device to work this way. >> >> Can you also describe how a user would use ARP and LLP flags? > > ARP == the rnic device handles ARP entirely for iwarp > connections. The amso1100 device does this. Otherwise, it is > assumed the native stack will do ARP to resolve MAC addresses > for the rnic. The IW CM will use this to determine whether > it needs to initiate ARP during > rdma_resolve_addr() processing. > > LLP == the device exposes TCP connect methods. This > capability isn't supported by the IW CM patch nor the > Amso1100. It for future devices that might expose TCP > connect methods. Some protocols that sit on top of iwarp > require this. iSER in its current form would require this, for > instance. > > Devices that supported this flag would enable common device-independent code, but transport specific, code to perform RDMA enabling negotiations. Without this capability, this common logic must rely on device specific implementations. In the long run we want to avoid duplicating logic, especially if the logic is in the driver rather than on the device. From halr at voltaire.com Wed Feb 1 15:09:46 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Feb 2006 18:09:46 -0500 Subject: [openib-general] [PATCH] madeye: Minor update for rdma_node_type change Message-ID: <1138835384.15119.4864.camel@hal.voltaire.com> Update madeye for changes for rdma_node_type and use utility function to check the transport type based on the node type. Signed-off-by: Hal Rosenstock Index: linux-kernel/infiniband/util/madeye/madeye.c =================================================================== --- linux-kernel/infiniband/util/madeye/madeye.c (revision 5258) +++ linux-kernel/infiniband/util/madeye/madeye.c (working copy) @@ -517,7 +517,10 @@ static void madeye_add_one(struct ib_dev int reg_flags; u8 i, s, e; - if (device->node_type == IB_NODE_SWITCH) { + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + + if (device->node_type == RDMA_NODE_IB_SWITCH) { s = 0; e = 0; } else { @@ -559,7 +562,7 @@ static void madeye_remove_one(struct ib_ if (!port) return; - if (device->node_type == IB_NODE_SWITCH) { + if (device->node_type == RDMA_NODE_IB_SWITCH) { s = 0; e = 0; } else { From halr at voltaire.com Wed Feb 1 15:13:46 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Feb 2006 18:13:46 -0500 Subject: [openib-general] Re: [PATCH] Opensm - using default dir In-Reply-To: <20060201225923.GA32188@sashak.voltaire.com> References: <5zwtgi0xs7.fsf@mtl066.yok.mtl.com> <20060130124122.GY31887@mellanox.co.il> <20060201225923.GA32188@sashak.voltaire.com> Message-ID: <1138835625.15119.4890.camel@hal.voltaire.com> On Wed, 2006-02-01 at 17:59, Sasha Khapyorsky wrote: > On 14:41 Mon 30 Jan , Michael S. Tsirkin wrote: > > Quoting r. Yael Kalka : > > > =================================================================== > > > --- include/opensm/osm_svn_revision.h (revision 5203) > > > +++ include/opensm/osm_svn_revision.h (working copy) > > > @@ -1 +1 @@ > > > -#define OSM_SVN_REVISION "" > > > +#define OSM_SVN_REVISION "5203M" > > > > This looks like a mistake. > > And, I think this shows that keeping the generated file osm_svn_revision.h > > represents a problem. > > Good point. Hal, could we svn-remove this file? Yes, this is possible but there is a little more work involved here as the OSM_SVN_REVISION is checked for length 0 to determine whether to print out the svn version message right now. -- Hal > > Sasha. From halr at voltaire.com Wed Feb 1 15:17:20 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Feb 2006 18:17:20 -0500 Subject: [openib-general] [PATCH] [TRIVIAL] ib_verbs.h: Quiet compiler warning from rdma_node_get_transport Message-ID: <1138835839.15119.4911.camel@hal.voltaire.com> ib_verbs.h: Quiet compiler warning from rdma_node_get_transport gcc version 4.0.0 20050519 indicates the following warning with rdma_node_get_transport: warning: control may reach end of non-void function ‘rdma_node_get_transport’ being inlined Signed-off-by: Hal Rosenstock Message-ID: >Update madeye for changes for rdma_node_type and use utility function >to >check the transport type based on the node type. I had updates to the util directory that I forgot to commit. I just committed them. - Sean From sean.hefty at intel.com Wed Feb 1 15:32:33 2006 From: sean.hefty at intel.com (Hefty, Sean) Date: Wed, 1 Feb 2006 15:32:33 -0800 Subject: [openib-general] [PATCH] [TRIVIAL] ib_verbs.h: Quiet compilerwarning from rdma_node_get_transport Message-ID: >ib_verbs.h: Quiet compiler warning from rdma_node_get_transport > >gcc version 4.0.0 20050519 indicates the following warning with >rdma_node_get_transport: > >warning: control may reach end of non-void function >'rdma_node_get_transport' being inlined Thanks - committed. From rdreier at cisco.com Wed Feb 1 17:28:10 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 01 Feb 2006 17:28:10 -0800 Subject: [openib-general] [PATCH RESEND] net: Move destructor from neigh->ops to neigh_params In-Reply-To: (Roland Dreier's message of "Mon, 23 Jan 2006 13:27:32 -0800") References: Message-ID: Sorry to distract everyone from the VJ channel discussion, but on the other hand it looks like Dave is back... I'm resending this because I'd really like to get this problem fixed but I want to make sure we're doing it the right way. So either an ACK or a NAK with guidance would be great... This is a resend of a patch written by Michael S. Tsirkin . I'd like to get an ACK or NAK of it from Dave and other networking people, so that we can either merge it upstream or try a different approach. There definitely is a problem with neighbour destructors that IP-over-IB is running into. It would be good to know what the design was behind putting the destructor method in neigh->ops in the first place. Dave, if you want to merge this directly, that's fine. Or I'm fine with merging this through the IB tree if you'd prefer (if you want me to do that, let me know if you think it's 2.6.16 material). Thanks, Roland struct neigh_ops currently has a destructor field, which no in-kernel drivers outside of infiniband use. The infiniband/ulp/ipoib in-tree driver stashes some info in the neighbour structure (the results of the second-stage lookup from ARP results to real link-level path), and it uses neigh->ops->destructor to get a callback so it can clean up this extra info when a neighbour is freed. We've run into problems with this: since the destructor is in an ops field that is shared between neighbours that may belong to different net devices, there's no way to set/clear it safely. The following patch moves this field to neigh_parms where it can be safely set, together with its twin neigh_setup. Two additional patches in the patch series update ipoib to use this new interface. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- diff --git a/include/net/neighbour.h b/include/net/neighbour.h index 6fa9ae1..b0666d6 100644 --- a/include/net/neighbour.h +++ b/include/net/neighbour.h @@ -68,6 +68,7 @@ struct neigh_parms struct net_device *dev; struct neigh_parms *next; int (*neigh_setup)(struct neighbour *); + void (*neigh_destructor)(struct neighbour *); struct neigh_table *tbl; void *sysctl_table; @@ -145,7 +146,6 @@ struct neighbour struct neigh_ops { int family; - void (*destructor)(struct neighbour *); void (*solicit)(struct neighbour *, struct sk_buff*); void (*error_report)(struct neighbour *, struct sk_buff*); int (*output)(struct sk_buff*); diff --git a/net/core/neighbour.c b/net/core/neighbour.c index e68700f..3489e23 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -586,8 +586,8 @@ void neigh_destroy(struct neighbour *nei kfree(hh); } - if (neigh->ops && neigh->ops->destructor) - (neigh->ops->destructor)(neigh); + if (neigh->parms->neigh_destructor) + (neigh->parms->neigh_destructor)(neigh); skb_queue_purge(&neigh->arp_queue); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index fd3f5c8..9588124 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -247,7 +247,6 @@ static void path_free(struct net_device if (neigh->ah) ipoib_put_ah(neigh->ah); *to_ipoib_neigh(neigh->neighbour) = NULL; - neigh->neighbour->ops->destructor = NULL; kfree(neigh); } @@ -530,7 +529,6 @@ static void neigh_add_path(struct sk_buf err: *to_ipoib_neigh(skb->dst->neighbour) = NULL; list_del(&neigh->list); - neigh->neighbour->ops->destructor = NULL; kfree(neigh); ++priv->stats.tx_dropped; @@ -769,21 +767,9 @@ static void ipoib_neigh_destructor(struc ipoib_put_ah(ah); } -static int ipoib_neigh_setup(struct neighbour *neigh) -{ - /* - * Is this kosher? I can't find anybody in the kernel that - * sets neigh->destructor, so we should be able to set it here - * without trouble. - */ - neigh->ops->destructor = ipoib_neigh_destructor; - - return 0; -} - static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms) { - parms->neigh_setup = ipoib_neigh_setup; + parms->neigh_destructor = ipoib_neigh_destructor; return 0; } From davem at davemloft.net Wed Feb 1 17:31:37 2006 From: davem at davemloft.net (David S. Miller) Date: Wed, 01 Feb 2006 17:31:37 -0800 (PST) Subject: [openib-general] Re: [PATCH RESEND] net: Move destructor from neigh->ops to neigh_params In-Reply-To: References: Message-ID: <20060201.173137.10844554.davem@davemloft.net> From: Roland Dreier Date: Wed, 01 Feb 2006 17:28:10 -0800 > Sorry to distract everyone from the VJ channel discussion, but on the > other hand it looks like Dave is back... I'm resending this because > I'd really like to get this problem fixed but I want to make sure > we're doing it the right way. So either an ACK or a NAK with guidance > would be great... It's sitting in my backlog, and will be a net-2.6.17 issue not a net-2.6.16 one as we're in bug fix mode there. Sorry if you need this in 2.6.16, but that's not really practical. From yoshfuji at linux-ipv6.org Wed Feb 1 17:34:48 2006 From: yoshfuji at linux-ipv6.org (YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?=) Date: Thu, 02 Feb 2006 10:34:48 +0900 (JST) Subject: [openib-general] Re: [PATCH RESEND] net: Move destructor from neigh->ops to neigh_params In-Reply-To: References: Message-ID: <20060202.103448.118065476.yoshfuji@linux-ipv6.org> In article (at Wed, 01 Feb 2006 17:28:10 -0800), Roland Dreier says: > Sorry to distract everyone from the VJ channel discussion, but on the > other hand it looks like Dave is back... I'm resending this because > I'd really like to get this problem fixed but I want to make sure > we're doing it the right way. So either an ACK or a NAK with guidance > would be great... Sorry for silence. Since we have "setup," it'd be natural to have destruct; I seems sane to me. --yoshfuji From rdreier at cisco.com Wed Feb 1 17:33:39 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 01 Feb 2006 17:33:39 -0800 Subject: [openib-general] Re: [PATCH RESEND] net: Move destructor from neigh->ops to neigh_params In-Reply-To: <20060201.173137.10844554.davem@davemloft.net> (David S. Miller's message of "Wed, 01 Feb 2006 17:31:37 -0800 (PST)") References: <20060201.173137.10844554.davem@davemloft.net> Message-ID: David> It's sitting in my backlog, and will be a net-2.6.17 issue David> not a net-2.6.16 one as we're in bug fix mode there. David> Sorry if you need this in 2.6.16, but that's not really David> practical. No, that's fine... I was just resending in case you were using RED to manage your backlog. This is a real issue but I don't think it's hitting a lot of people. - R. From rdreier at cisco.com Wed Feb 1 20:14:16 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 01 Feb 2006 20:14:16 -0800 Subject: [openib-general] svn troubles? In-Reply-To: (Roland Dreier's message of "Wed, 01 Feb 2006 14:11:08 -0800") References: Message-ID: Seems OK for now... there was a very slow period for a while though. From ogerlitz at voltaire.com Wed Feb 1 22:13:26 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 02 Feb 2006 08:13:26 +0200 Subject: [openib-general] on calling rdma_disconnect from non sleepablecontext In-Reply-To: References: Message-ID: <43E1A306.10002@voltaire.com> Roland Dreier wrote: > Yes, the modify QP operation might sleep. For example, on Mellanox > hardware, modifying a QP requires a firmware command, which allocates > a mailbox with GFP_KERNEL and also sleeps until the command completes. Can you elaborate a little more on the "---might-- sleep" with regard to the Mellanox hardware/firmware? empirically i saw (and could not understand) that on 99% of the cases where my code called ib_modify_qp (via rdma_disconnect) and ib_destory_qp (directly) from non sleepable context (tasklet) it just worked fine. AFAIK since the mthca driver works in "events" command mode, it would always sleep after issuing a command to the FW till the command completion is reported to the commands EQ and then there's a wakeup. > I tried to document these sorts of rules in Documentation/infiniband/core_locking.txt Sure, the documentation does not state that modify qp is among those verbs who may be called from non sleepable context which is enough, and the CMA does modify QP before it send the DREQ so rdma_disconnect is also not allowed to be called from such context. Or. From ogerlitz at voltaire.com Wed Feb 1 22:24:22 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 02 Feb 2006 08:24:22 +0200 Subject: [openib-general] [PATCH 4/4] SA path record caching In-Reply-To: References: Message-ID: <43E1A596.3050707@voltaire.com> Sean Hefty wrote: Or >> How about either having a mod param to the local SA practically telling Or >> it to do nothing (ugly) or the local SA doing symbol_put to the cache Or >> query function and the CMA attempting to symbol_get it, if the symbol Or >> exists, then the CMA is first trying the cache before querying the SA, Or >> or whatever other better idea someone can come up with? > This sounds possible. Note that the local SA module will be there to > support multicast join/leave operations. I think we really need to have in the very short term some solution at hand which effectively allows for both the cma being used and the local SA not replicating the SA. This is since the trunk is used in various environments, currently most of them are not "all-to-all MPI doing path query" and the SA replica is not mature/tested enough to prove that any SM/SA can live more or less happily with it. Does it makes sense to you? can it be added with high priority? Or. From ogerlitz at voltaire.com Wed Feb 1 22:42:10 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 02 Feb 2006 08:42:10 +0200 Subject: [openib-general] [PATCH 0/4] SA path record caching In-Reply-To: References: Message-ID: <43E1A9C2.5060609@voltaire.com> Sean Hefty wrote: Or >> Aren't we creating a monster here??? if this is SA replica which should Or >> work for scale from day one, lets call it this way and see how to reach there > The cache update window is configurable. ... > Based on information from SilverStorm, the cache should work well in practice. I recall that SilverStorm mentioned they had a well working SA replica, but when i look on the naked math for 1k nodes and the hard to reach in real life uniform distribution of queries over time, i really can't see how to reach it unless you practically never update the cache and you have magically caused your SM/SA to survive the one and only session of those 1k get table queries (again 350 mad/sec, so many concurrent rmpp sessions etc etc). >> + neither MVAPICH nor OpenMPI are using path query > The national labs want all path records for their routing algorithms. I So your understanding they have an --all to all- 1k ranks IB app that needs to know for each node all the paths to other nodes? if the case is 1k IB app where each rank connects to (say log2(1k)=) 10 ranks then each rank needs all the paths to those 10 other nodes. > I believe that the problems here were API issues that make connecting > difficult. As a result, most applications just hard-coded everything. I guess feedback from MPI people telling whether they have plans to use path query would help us to see where we actually stand. Or. From mst at mellanox.co.il Wed Feb 1 22:57:36 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 2 Feb 2006 08:57:36 +0200 Subject: [openib-general] Re: [PATCH RESEND] net: Move destructor from neigh->ops to neigh_params In-Reply-To: References: Message-ID: <20060202065736.GC4216@mellanox.co.il> Quoting Roland Dreier : > David> Sorry if you need this in 2.6.16, but that's not really > David> practical. > > No, that's fine... I was just resending in case you were using RED to > manage your backlog. This is a real issue but I don't think it's > hitting a lot of people. So, how about we implement the all-neigh list work-around on trunk for now, probably inside #if version < 2.6.17? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From halr at voltaire.com Thu Feb 2 03:51:30 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Feb 2006 06:51:30 -0500 Subject: [openib-general] RE: [PATCH] Opensm - asserts before OSM_LOG_ENTER In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B694@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B694@mtlexch01.mtl.com> Message-ID: <1138881085.15119.8849.camel@hal.voltaire.com> Hi Eitan, On Wed, 2006-02-01 at 15:49, Eitan Zahavi wrote: > Hi Hal, > > All of these asserts are checking objects that can only be missing if we > have memory rundown or races in OpenSM destruction. I do not think these > particular asserts are going to be useful anyway. > The damage in them is bigger then their usefulness. What damage ? > We could however have a LINUX_ASSERT which is active only in LINUX. > Can we have a macro that will be mapped to CL_ASSERT only if we are in > Linux? That just hides it down one level so is essentially the same solution in my mind as the one I proposed to conditionalize the CL_ASSERTs which are problematic for Windows compilers. It appears to me that there is a gcc predefine of either unix or linux that could be used for this. Also, shouldn't we take Michael's approach anyhow (using one of those gcc predefines) ? Does VC6 need to be supported on the Windows side ? -- Hal > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: openib-general-bounces at openib.org [mailto:openib-general- > > bounces at openib.org] On Behalf Of Hal Rosenstock > > Sent: Wednesday, February 01, 2006 10:18 PM > > To: Eitan Zahavi > > Cc: openib-general at openib.org > > Subject: [openib-general] RE: [PATCH] Opensm - asserts before > OSM_LOG_ENTER > > > > Hi Eitan, > > > > On Wed, 2006-02-01 at 09:50, Eitan Zahavi wrote: > > > Hi Hal, > > > > > > Please see below: > > > > > When trying to compile the windows stack with some late updates, > > > I've > > > > > encountered an issue with the addition/change of place of > asserts to > > > > > before the OSM_LOG_ENTER. Since OSM_LOG_ENTER declares a > variable, > > > > > then these asserts cause failure due to declaration in the > middle of > > > > > the function. > > > > > > > > The asserts are on a passed in pointer rather than the static > variable > > > > created by the MACRO based on the second parameter to > OSM_LOG_ENTER. I > > > > don't understand how this causes a problem. Is it Windows only? > > > > > [EZ] In C you are not allowed to mix variable declarations with > > > statements like "if" (on the same code block). In debug build the > > > CL_ASSERT includes an "if" statement that is later followed by > > > OSM_LOG_ENTER which declares the static variable. It used to fail > build > > > in Linux but for some reason it stopped. I guess if we used > -pedantic or > > > ANSI it would fail. > > > > OK. I understand now what is going on. > > > > > Anyway, the assert is on the passed down parameter > > > which is passed as NULL. This might happen only on race in the > "destroy" > > > flow - but if this is a race it is not guaranteed to catch the bug > as > > > the pointer might be free'ed after the assert. It should be caught > as a > > > "segfault" (dereferencing NULL pointer) or in valgrind. > > > > This is not always possible. > > > > > We have few options: > > > a. Do not use same code tree for WinIB - I do not think we want > that. > > > > I too would prefer not to do that but this is not a requirement of > > OpenIB. > > > > > b. Put everything after the CL_ASSERT in an internal code block > (i.e. > > > "{") - I do not think we want to do this either. > > > > I don't think that solves the problem as doesn't OSM_LOG_EXIT need > > access to that variable created by OSM_LOG_ENTER. > > > > > c. Move the CL_ASSERT before the function call (into the function > > > caller). > > > > Ugh... > > > > > d. Give up these few asserts as this only can happen as a race > during > > > resource destruction. > > > > e. What about some conditionalization of these asserts ? > > > > #ifndef __WIN__ > > CL_ASSERT(foo); > > #endif > > > > It's already in other places in OpenSM. > > > > > I think that in this case it is more important to keep the WinIB and > > > Linux tree identical. > > > > This is not a requirement although it is desirable. > > > > -- Hal > > > > > > > These asserts are all on the reciever object or the manager > object, > > > so > > > > > I don't think they are really necessary. > > > > > > > > They compile out when not using debug. I saw these trip at SC05. > > > [EZ] As explained - yes they can trip - but only if we have memory > > > pollution (that could be caught by valgrind) or during exit - when > and > > > they really a race and might not be caught by the assert. > > > > > > > > -- Hal > > > > > > > > > The Following patch removes these asserts. > > > > > > > > > > Thanks, > > > > > Yael > > > > > > > > > > Signed-off-by: Yael Kalka > > > > > > > > > > Index: opensm/osm_pkey_rcv.c > > > > > > =================================================================== > > > > > --- opensm/osm_pkey_rcv.c (revision 5246) > > > > > +++ opensm/osm_pkey_rcv.c (working copy) > > > > > @@ -71,8 +71,6 @@ void > > > > > osm_pkey_rcv_destroy( > > > > > IN osm_pkey_rcv_t* const p_rcv ) > > > > > { > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_pkey_rcv_destroy ); > > > > > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > > > @@ -125,8 +123,6 @@ osm_pkey_rcv_process( > > > > > uint8_t port_num; > > > > > uint16_t block_num; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_pkey_rcv_process ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > Index: opensm/osm_sm_state_mgr.c > > > > > > =================================================================== > > > > > --- opensm/osm_sm_state_mgr.c (revision 5246) > > > > > +++ opensm/osm_sm_state_mgr.c (working copy) > > > > > @@ -406,8 +406,6 @@ void > > > > > osm_sm_state_mgr_destroy( > > > > > IN osm_sm_state_mgr_t * const p_sm_mgr ) > > > > > { > > > > > - CL_ASSERT( p_sm_mgr ); > > > > > - > > > > > OSM_LOG_ENTER( p_sm_mgr->p_log, osm_sm_state_mgr_destroy ); > > > > > > > > > > cl_spinlock_destroy( &p_sm_mgr->state_lock ); > > > > > @@ -500,8 +498,6 @@ osm_sm_state_mgr_process( > > > > > { > > > > > ib_api_status_t status = IB_SUCCESS; > > > > > > > > > > - CL_ASSERT( p_sm_mgr ); > > > > > - > > > > > OSM_LOG_ENTER( p_sm_mgr->p_log, osm_sm_state_mgr_process ); > > > > > > > > > > /* > > > > > @@ -760,8 +756,6 @@ osm_sm_state_mgr_check_legality( > > > > > { > > > > > ib_api_status_t status = IB_SUCCESS; > > > > > > > > > > - CL_ASSERT( p_sm_mgr ); > > > > > - > > > > > OSM_LOG_ENTER( p_sm_mgr->p_log, > osm_sm_state_mgr_check_legality > > > ); > > > > > > > > > > /* > > > > > Index: opensm/osm_state_mgr.c > > > > > > =================================================================== > > > > > --- opensm/osm_state_mgr.c (revision 5246) > > > > > +++ opensm/osm_state_mgr.c (working copy) > > > > > @@ -86,8 +86,6 @@ void > > > > > osm_state_mgr_destroy( > > > > > IN osm_state_mgr_t * const p_mgr ) > > > > > { > > > > > - CL_ASSERT( p_mgr ); > > > > > - > > > > > OSM_LOG_ENTER( p_mgr->p_log, osm_state_mgr_destroy ); > > > > > > > > > > /* destroy the locks */ > > > > > @@ -1884,8 +1882,6 @@ osm_state_mgr_process( > > > > > ib_api_status_t status; > > > > > osm_remote_sm_t *p_remote_sm; > > > > > > > > > > - CL_ASSERT( p_mgr ); > > > > > - > > > > > OSM_LOG_ENTER( p_mgr->p_log, osm_state_mgr_process ); > > > > > > > > > > /* if we are exiting do nothing */ > > > > > Index: opensm/osm_sa_guidinfo_record.c > > > > > > =================================================================== > > > > > --- opensm/osm_sa_guidinfo_record.c (revision 5246) > > > > > +++ opensm/osm_sa_guidinfo_record.c (working copy) > > > > > @@ -433,8 +433,6 @@ osm_gir_rcv_process( > > > > > ib_api_status_t status; > > > > > osm_physp_t* p_req_physp; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_gir_rcv_process ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > Index: opensm/osm_sa_vlarb_record.c > > > > > > =================================================================== > > > > > --- opensm/osm_sa_vlarb_record.c (revision 5246) > > > > > +++ opensm/osm_sa_vlarb_record.c (working copy) > > > > > @@ -348,8 +348,6 @@ osm_vlarb_rec_rcv_process( > > > > > ib_net64_t comp_mask; > > > > > osm_physp_t* p_req_physp; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_vlarb_rec_rcv_process ); > > > > > > > > > > /* update the requestor physical port. */ > > > > > Index: opensm/osm_sa_lft_record.c > > > > > > =================================================================== > > > > > --- opensm/osm_sa_lft_record.c (revision 5246) > > > > > +++ opensm/osm_sa_lft_record.c (working copy) > > > > > @@ -329,8 +329,6 @@ osm_lftr_rcv_process( > > > > > ib_api_status_t status; > > > > > osm_physp_t* p_req_physp; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_lftr_rcv_process ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > Index: opensm/osm_sa_portinfo_record.c > > > > > > =================================================================== > > > > > --- opensm/osm_sa_portinfo_record.c (revision 5246) > > > > > +++ opensm/osm_sa_portinfo_record.c (working copy) > > > > > @@ -600,8 +600,6 @@ osm_pir_rcv_process( > > > > > osm_physp_t* p_req_physp; > > > > > boolean_t trusted_req = TRUE; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_pir_rcv_process ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > Index: opensm/osm_req.c > > > > > > =================================================================== > > > > > --- opensm/osm_req.c (revision 5246) > > > > > +++ opensm/osm_req.c (working copy) > > > > > @@ -131,8 +131,6 @@ osm_req_get( > > > > > ib_api_status_t status = IB_SUCCESS; > > > > > ib_net64_t tid; > > > > > > > > > > - CL_ASSERT( p_req ); > > > > > - > > > > > OSM_LOG_ENTER( p_req->p_log, osm_req_get ); > > > > > > > > > > CL_ASSERT( p_path ); > > > > > @@ -222,8 +220,6 @@ osm_req_set( > > > > > ib_api_status_t status = IB_SUCCESS; > > > > > ib_net64_t tid; > > > > > > > > > > - CL_ASSERT( p_req ); > > > > > - > > > > > OSM_LOG_ENTER( p_req->p_log, osm_req_set ); > > > > > > > > > > CL_ASSERT( p_path ); > > > > > Index: opensm/osm_sa_pkey_record.c > > > > > > =================================================================== > > > > > --- opensm/osm_sa_pkey_record.c (revision 5246) > > > > > +++ opensm/osm_sa_pkey_record.c (working copy) > > > > > @@ -344,8 +344,6 @@ osm_pkey_rec_rcv_process( > > > > > ib_net64_t comp_mask; > > > > > osm_physp_t* p_req_physp; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_pkey_rec_rcv_process ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > Index: opensm/osm_lin_fwd_rcv.c > > > > > > =================================================================== > > > > > --- opensm/osm_lin_fwd_rcv.c (revision 5246) > > > > > +++ opensm/osm_lin_fwd_rcv.c (working copy) > > > > > @@ -75,8 +75,6 @@ void > > > > > osm_lft_rcv_destroy( > > > > > IN osm_lft_rcv_t* const p_rcv ) > > > > > { > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_lft_rcv_destroy ); > > > > > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > > > @@ -121,8 +119,6 @@ osm_lft_rcv_process( > > > > > ib_net64_t node_guid; > > > > > ib_api_status_t status; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_lft_rcv_process ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > Index: opensm/osm_sa_slvl_record.c > > > > > > =================================================================== > > > > > --- opensm/osm_sa_slvl_record.c (revision 5246) > > > > > +++ opensm/osm_sa_slvl_record.c (working copy) > > > > > @@ -324,8 +324,6 @@ osm_slvl_rec_rcv_process( > > > > > ib_net64_t comp_mask; > > > > > osm_physp_t* p_req_physp; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_slvl_rec_rcv_process ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > Index: opensm/osm_sminfo_rcv.c > > > > > > =================================================================== > > > > > --- opensm/osm_sminfo_rcv.c (revision 5246) > > > > > +++ opensm/osm_sminfo_rcv.c (working copy) > > > > > @@ -80,8 +80,6 @@ void > > > > > osm_sminfo_rcv_destroy( > > > > > IN osm_sminfo_rcv_t* const p_rcv ) > > > > > { > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_sminfo_rcv_destroy ); > > > > > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > > > Index: opensm/osm_node_info_rcv.c > > > > > > =================================================================== > > > > > --- opensm/osm_node_info_rcv.c (revision 5246) > > > > > +++ opensm/osm_node_info_rcv.c (working copy) > > > > > @@ -981,8 +981,6 @@ void > > > > > osm_ni_rcv_destroy( > > > > > IN osm_ni_rcv_t* const p_rcv ) > > > > > { > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_ni_rcv_destroy ); > > > > > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > > > @@ -1028,8 +1026,6 @@ osm_ni_rcv_process( > > > > > osm_node_t *p_node; > > > > > boolean_t process_new_flag = FALSE; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_ni_rcv_process ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > Index: opensm/osm_mcast_mgr.c > > > > > > =================================================================== > > > > > --- opensm/osm_mcast_mgr.c (revision 5246) > > > > > +++ opensm/osm_mcast_mgr.c (working copy) > > > > > @@ -394,8 +394,6 @@ void > > > > > osm_mcast_mgr_destroy( > > > > > IN osm_mcast_mgr_t* const p_mgr ) > > > > > { > > > > > - CL_ASSERT( p_mgr ); > > > > > - > > > > > OSM_LOG_ENTER( p_mgr->p_log, osm_mcast_mgr_destroy ); > > > > > > > > > > OSM_LOG_EXIT( p_mgr->p_log ); > > > > > @@ -449,8 +447,6 @@ __osm_mcast_mgr_set_tbl( > > > > > ib_net16_t block[IB_MCAST_BLOCK_SIZE]; > > > > > osm_signal_t signal = OSM_SIGNAL_DONE; > > > > > > > > > > - CL_ASSERT( p_mgr ); > > > > > - > > > > > OSM_LOG_ENTER( p_mgr->p_log, __osm_mcast_mgr_set_tbl ); > > > > > > > > > > CL_ASSERT( p_sw ); > > > > > Index: opensm/osm_sa_sminfo_record.c > > > > > > =================================================================== > > > > > --- opensm/osm_sa_sminfo_record.c (revision 5246) > > > > > +++ opensm/osm_sa_sminfo_record.c (working copy) > > > > > @@ -89,8 +89,6 @@ void > > > > > osm_smir_rcv_destroy( > > > > > IN osm_smir_rcv_t* const p_rcv ) > > > > > { > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_smir_rcv_destroy ); > > > > > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > > > @@ -142,8 +140,6 @@ osm_smir_rcv_process( > > > > > ib_net64_t local_guid; > > > > > osm_port_t* local_port; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_smir_rcv_process ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > Index: opensm/osm_trap_rcv.c > > > > > > =================================================================== > > > > > --- opensm/osm_trap_rcv.c (revision 5246) > > > > > +++ opensm/osm_trap_rcv.c (working copy) > > > > > @@ -189,8 +189,6 @@ void > > > > > osm_trap_rcv_destroy( > > > > > IN osm_trap_rcv_t* const p_rcv ) > > > > > { > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_trap_rcv_destroy ); > > > > > > > > > > cl_event_wheel_destroy( &p_rcv->trap_aging_tracker ); > > > > > Index: opensm/osm_ucast_mgr.c > > > > > > =================================================================== > > > > > --- opensm/osm_ucast_mgr.c (revision 5246) > > > > > +++ opensm/osm_ucast_mgr.c (working copy) > > > > > @@ -90,8 +90,6 @@ void > > > > > osm_ucast_mgr_destroy( > > > > > IN osm_ucast_mgr_t* const p_mgr ) > > > > > { > > > > > - CL_ASSERT( p_mgr ); > > > > > - > > > > > OSM_LOG_ENTER( p_mgr->p_log, osm_ucast_mgr_destroy ); > > > > > > > > > > OSM_LOG_EXIT( p_mgr->p_log ); > > > > > @@ -785,8 +783,6 @@ __osm_ucast_mgr_set_table( > > > > > uint32_t block_id_ho = 0; > > > > > uint8_t block[IB_SMP_DATA_SIZE]; > > > > > > > > > > - CL_ASSERT( p_mgr ); > > > > > - > > > > > OSM_LOG_ENTER( p_mgr->p_log, __osm_ucast_mgr_set_table ); > > > > > > > > > > CL_ASSERT( p_sw ); > > > > > Index: opensm/osm_sa_node_record.c > > > > > > =================================================================== > > > > > --- opensm/osm_sa_node_record.c (revision 5246) > > > > > +++ opensm/osm_sa_node_record.c (working copy) > > > > > @@ -435,8 +435,6 @@ osm_nr_rcv_process( > > > > > ib_api_status_t status; > > > > > osm_physp_t* p_req_physp; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_nr_rcv_process ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > Index: opensm/osm_sw_info_rcv.c > > > > > > =================================================================== > > > > > --- opensm/osm_sw_info_rcv.c (revision 5246) > > > > > +++ opensm/osm_sw_info_rcv.c (working copy) > > > > > @@ -363,8 +363,6 @@ __osm_si_rcv_process_new( > > > > > ib_smp_t *p_smp; > > > > > cl_qmap_t *p_sw_guid_tbl; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, __osm_si_rcv_process_new ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > @@ -582,8 +580,6 @@ void > > > > > osm_si_rcv_destroy( > > > > > IN osm_si_rcv_t* const p_rcv ) > > > > > { > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_si_rcv_destroy ); > > > > > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > > > @@ -631,8 +627,6 @@ osm_si_rcv_process( > > > > > ib_net64_t node_guid; > > > > > osm_si_context_t *p_context; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_si_rcv_process ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > Index: opensm/osm_mcast_fwd_rcv.c > > > > > > =================================================================== > > > > > --- opensm/osm_mcast_fwd_rcv.c (revision 5246) > > > > > +++ opensm/osm_mcast_fwd_rcv.c (working copy) > > > > > @@ -77,8 +77,6 @@ void > > > > > osm_mft_rcv_destroy( > > > > > IN osm_mft_rcv_t* const p_rcv ) > > > > > { > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_mft_rcv_destroy ); > > > > > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > > > @@ -124,8 +122,6 @@ osm_mft_rcv_process( > > > > > ib_net64_t node_guid; > > > > > ib_api_status_t status; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_mft_rcv_process ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > Index: opensm/osm_slvl_map_rcv.c > > > > > > =================================================================== > > > > > --- opensm/osm_slvl_map_rcv.c (revision 5246) > > > > > +++ opensm/osm_slvl_map_rcv.c (working copy) > > > > > @@ -83,8 +83,6 @@ void > > > > > osm_slvl_rcv_destroy( > > > > > IN osm_slvl_rcv_t* const p_rcv ) > > > > > { > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_slvl_rcv_destroy ); > > > > > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > > > @@ -136,8 +134,6 @@ osm_slvl_rcv_process( > > > > > ib_net64_t node_guid; > > > > > uint8_t out_port_num, in_port_num; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_slvl_rcv_process ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > Index: opensm/osm_node_desc_rcv.c > > > > > > =================================================================== > > > > > --- opensm/osm_node_desc_rcv.c (revision 5246) > > > > > +++ opensm/osm_node_desc_rcv.c (working copy) > > > > > @@ -109,8 +109,6 @@ void > > > > > osm_nd_rcv_destroy( > > > > > IN osm_nd_rcv_t* const p_rcv ) > > > > > { > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_nd_rcv_destroy ); > > > > > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > > > @@ -152,8 +150,6 @@ osm_nd_rcv_process( > > > > > osm_node_t *p_node; > > > > > ib_net64_t node_guid; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_nd_rcv_process ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > Index: opensm/osm_sa_mcmember_record.c > > > > > > =================================================================== > > > > > --- opensm/osm_sa_mcmember_record.c (revision 5246) > > > > > +++ opensm/osm_sa_mcmember_record.c (working copy) > > > > > @@ -109,8 +109,6 @@ void > > > > > osm_mcmr_rcv_destroy( > > > > > IN osm_mcmr_recv_t* const p_rcv ) > > > > > { > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_mcmr_rcv_destroy ); > > > > > > > > > > cl_qlock_pool_destroy( &p_rcv->pool ); > > > > > @@ -1967,8 +1965,6 @@ osm_mcmr_query_mgrp(IN osm_mcmr_recv_t* > > > > > osm_physp_t* p_req_physp; > > > > > boolean_t trusted_req = TRUE; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_mcmr_query_mgrp ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > @@ -2173,8 +2169,6 @@ osm_mcmr_rcv_process( > > > > > ib_member_rec_t *p_recvd_mcmember_rec; > > > > > boolean_t valid; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_mcmr_rcv_process ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > Index: opensm/osm_drop_mgr.c > > > > > > =================================================================== > > > > > --- opensm/osm_drop_mgr.c (revision 5246) > > > > > +++ opensm/osm_drop_mgr.c (working copy) > > > > > @@ -81,8 +81,6 @@ void > > > > > osm_drop_mgr_destroy( > > > > > IN osm_drop_mgr_t* const p_mgr ) > > > > > { > > > > > - CL_ASSERT( p_mgr ); > > > > > - > > > > > OSM_LOG_ENTER( p_mgr->p_log, osm_drop_mgr_destroy ); > > > > > > > > > > OSM_LOG_EXIT( p_mgr->p_log ); > > > > > @@ -597,8 +595,6 @@ osm_drop_mgr_process( > > > > > uint8_t port_num; > > > > > osm_physp_t *p_physp; > > > > > > > > > > - CL_ASSERT( p_mgr ); > > > > > - > > > > > OSM_LOG_ENTER( p_mgr->p_log, osm_drop_mgr_process ); > > > > > > > > > > p_node_guid_tbl = &p_mgr->p_subn->node_guid_tbl; > > > > > Index: opensm/osm_lid_mgr.c > > > > > > =================================================================== > > > > > --- opensm/osm_lid_mgr.c (revision 5246) > > > > > +++ opensm/osm_lid_mgr.c (working copy) > > > > > @@ -1312,8 +1312,6 @@ osm_lid_mgr_process_subnet( > > > > > osm_physp_t *p_physp; > > > > > int lid_changed; > > > > > > > > > > - CL_ASSERT( p_mgr ); > > > > > - > > > > > OSM_LOG_ENTER( p_mgr->p_log, osm_lid_mgr_process_subnet ); > > > > > > > > > > CL_PLOCK_EXCL_ACQUIRE( p_mgr->p_lock ); > > > > > Index: opensm/osm_pkey_mgr.c > > > > > > =================================================================== > > > > > --- opensm/osm_pkey_mgr.c (revision 5246) > > > > > +++ opensm/osm_pkey_mgr.c (working copy) > > > > > @@ -73,8 +73,6 @@ void > > > > > osm_pkey_mgr_destroy( > > > > > IN osm_pkey_mgr_t * const p_mgr ) > > > > > { > > > > > - CL_ASSERT( p_mgr ); > > > > > - > > > > > OSM_LOG_ENTER( p_mgr->p_log, osm_pkey_mgr_destroy ); > > > > > > > > > > OSM_LOG_EXIT( p_mgr->p_log ); > > > > > @@ -238,8 +236,6 @@ osm_pkey_mgr_process( > > > > > osm_physp_t *p_physp; > > > > > osm_signal_t result = OSM_SIGNAL_DONE; > > > > > > > > > > - CL_ASSERT( p_mgr ); > > > > > - > > > > > OSM_LOG_ENTER( p_mgr->p_log, osm_pkey_mgr_process ); > > > > > > > > > > p_node_guid_tbl = &p_mgr->p_subn->node_guid_tbl; > > > > > Index: opensm/osm_vl_arb_rcv.c > > > > > > =================================================================== > > > > > --- opensm/osm_vl_arb_rcv.c (revision 5246) > > > > > +++ opensm/osm_vl_arb_rcv.c (working copy) > > > > > @@ -83,8 +83,6 @@ void > > > > > osm_vla_rcv_destroy( > > > > > IN osm_vla_rcv_t* const p_rcv ) > > > > > { > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_vla_rcv_destroy ); > > > > > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > > > @@ -136,8 +134,6 @@ osm_vla_rcv_process( > > > > > ib_net64_t node_guid; > > > > > uint8_t port_num, block_num; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_vla_rcv_process ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From ogerlitz at voltaire.com Thu Feb 2 04:37:35 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 2 Feb 2006 14:37:35 +0200 (IST) Subject: [openib-general] [PATCH] iser: more cleanups Message-ID: removed two files plus more reorg and cleanups committed to r5262 Makefile | 4 - iscsi_iser.c | 81 ++++++++++++++++++++++++++++--- iscsi_iser.h | 23 +------- iser_conn.c | 2 iser_dto.c | 117 --------------------------------------------- iser_initiator.c | 128 +++++++++++++++++++++++++++++-------------------- iser_mod.c | 141 ------------------------------------------------------- iser_verbs.c | 2 8 files changed, 157 insertions(+), 341 deletions(-) removed iser_mod.c and iser_dto.c Signed-off-by: Or Gerlitz Index: ulp/iser/iser_conn.c =================================================================== --- ulp/iser/iser_conn.c (revision 5251) +++ ulp/iser/iser_conn.c (revision 5262) @@ -473,7 +473,7 @@ post_receive_control_exit: if(err && rx_desc) { - iser_dto_free(p_recv_dto); + iser_dto_buffs_release(p_recv_dto); if(rx_desc->data != NULL) kfree(rx_desc->data); kmem_cache_free(ig.desc_cache, rx_desc); Index: ulp/iser/iscsi_iser.h =================================================================== --- ulp/iser/iscsi_iser.h (revision 5251) +++ ulp/iser/iscsi_iser.h (revision 5262) @@ -501,25 +501,14 @@ #define USE_SIZE(size) (size) #define USE_ENTIRE_SIZE 0 -int iser_dto_add_regd_buff(struct iser_dto *p_dto, - struct iser_regd_buf *p_regd_buf, - unsigned long use_offset, - unsigned long use_size); +void iser_dto_add_regd_buff(struct iser_dto *p_dto, + struct iser_regd_buf *p_regd_buf, + unsigned long use_offset, + unsigned long use_size); -void iser_dto_free(struct iser_dto *p_dto); +void iser_dto_buffs_release(struct iser_dto *p_dto); -void iser_dto_send_create(struct iscsi_iser_conn *p_iser_conn, - struct iser_desc *tx_desc); - - /* iser_initiator.h */ -int iser_dma_map_task_data(struct iscsi_iser_cmd_task *p_iser_task, - struct iser_data_buf *p_data, - enum iser_data_dir iser_dir, - enum dma_data_direction dma_dir); - -void iser_dma_unmap_task_data(struct iscsi_iser_cmd_task *p_iser_task); - void iser_rcv_completion(struct iser_desc *p_desc, unsigned long dto_xfer_len); @@ -620,8 +609,6 @@ int iser_connect(struct iser_conn *p_iser_conn, struct sockaddr_in *dst_addr, struct sockaddr_in *src_addr); -int iser_disconnect(struct iser_conn *p_iser_conn); - int iser_free_qp_and_id(struct iser_conn *p_iser_conn); int iser_free_ib_conn_res(struct iser_conn *p_iser_conn); Index: ulp/iser/iser_verbs.c =================================================================== --- ulp/iser/iser_verbs.c (revision 5251) +++ ulp/iser/iser_verbs.c (revision 5262) @@ -644,7 +644,7 @@ struct iser_dto *p_dto = &p_desc->dto; struct iscsi_iser_conn *p_iser_conn = p_dto->p_conn; - iser_dto_free(p_dto); + iser_dto_buffs_release(p_dto); if(p_desc->type == ISCSI_RX) { kfree(p_desc->data); Index: ulp/iser/iser_initiator.c =================================================================== --- ulp/iser/iser_initiator.c (revision 5251) +++ ulp/iser/iser_initiator.c (revision 5262) @@ -40,10 +40,10 @@ #include "iscsi_iser.h" -int iser_dma_map_task_data(struct iscsi_iser_cmd_task *p_iser_task, - struct iser_data_buf *p_data, - enum iser_data_dir iser_dir, - enum dma_data_direction dma_dir) +static int iser_dma_map_task_data(struct iscsi_iser_cmd_task *p_iser_task, + struct iser_data_buf *p_data, + enum iser_data_dir iser_dir, + enum dma_data_direction dma_dir) { struct device *dma_device; dma_addr_t dma_addr; @@ -74,7 +74,7 @@ return 0; } -void iser_dma_unmap_task_data(struct iscsi_iser_cmd_task *p_iser_task) +static void iser_dma_unmap_task_data(struct iscsi_iser_cmd_task *p_iser_task) { struct device *dma_device; struct iser_data_buf *p_data; @@ -279,6 +279,30 @@ return 0; } + +/* creates a new tx descriptor and adds header regd buffer */ +static void iser_create_send_desc(struct iscsi_iser_conn *p_iser_conn, + struct iser_desc *tx_desc) +{ + struct iser_regd_buf *p_regd_hdr = &tx_desc->hdr_regd_buf; + struct iser_dto *p_send_dto = &tx_desc->dto; + + memset(p_regd_hdr, 0, sizeof(struct iser_regd_buf)); + p_regd_hdr->p_adaptor = p_iser_conn->ib_conn->p_adaptor; + p_regd_hdr->virt_addr = tx_desc; /* == &tx_desc->iser_header */ + p_regd_hdr->data_size = ISER_TOTAL_HEADERS_LEN; + + p_send_dto->p_conn = p_iser_conn; + p_send_dto->notify_enable = 1; + p_send_dto->regd_vector_len = 0; + + memset(&tx_desc->iser_header, 0, ISER_HDR_LEN); + tx_desc->iser_header.flags = ISER_VER; + + iser_dto_add_regd_buff(p_send_dto, p_regd_hdr, + USE_NO_OFFSET, USE_ENTIRE_SIZE); +} + static int iser_check_xmit(struct iscsi_iser_conn *conn, void *task) { @@ -325,7 +349,7 @@ p_ctask->desc.type = ISCSI_TX_SCSI_COMMAND; p_send_dto = &p_ctask->desc.dto; p_send_dto->p_task = p_ctask; - iser_dto_send_create(p_iser_conn, &p_ctask->desc); + iser_create_send_desc(p_iser_conn, &p_ctask->desc); if (sc->use_sg) { /* using a scatter list */ data_buf.p_buf = sc->request_buffer; @@ -367,7 +391,7 @@ send_command_error: if (p_send_dto != NULL) { - iser_dto_free(p_send_dto); + iser_dto_buffs_release(p_send_dto); /* FIXME we need to dec the ref count */ } if (p_iser_conn != NULL) { @@ -420,7 +444,7 @@ /* build the tx desc regd header and add it to the tx desc dto */ p_send_dto = &tx_desc->dto; p_send_dto->p_task = p_ctask; - iser_dto_send_create(p_iser_conn, tx_desc); + iser_create_send_desc(p_iser_conn, tx_desc); iser_reg_single(p_iser_conn->ib_conn->p_adaptor, p_send_dto->regd[0], DMA_TO_DEVICE); @@ -449,7 +473,7 @@ send_data_out_error: if (p_send_dto != NULL) - iser_dto_free(p_send_dto); + iser_dto_buffs_release(p_send_dto); if (tx_desc != NULL) kmem_cache_free(ig.desc_cache, tx_desc); @@ -484,7 +508,7 @@ p_mtask->desc.type = ISCSI_TX_CONTROL; p_send_dto = &p_mtask->desc.dto; p_send_dto->p_task = NULL; - iser_dto_send_create(p_iser_conn, &p_mtask->desc); + iser_create_send_desc(p_iser_conn, &p_mtask->desc); p_iser_adaptor = p_iser_conn->ib_conn->p_adaptor; @@ -492,37 +516,19 @@ itt = ntohl(p_mtask->hdr->itt); opcode = p_mtask->hdr->opcode & ISCSI_OPCODE_MASK; + data_seg_len = ntoh24(p_mtask->hdr->dlength); - /* no need to copy when there's data b/c the mtask is not reallocated * - * till the response related to this ITT is received */ - switch (opcode) { - - case ISCSI_OP_SCSI_TMFUNC: - /* ToDo p_ctrl_pdu->data.task_mgt_req.buf_in */ - case ISCSI_OP_NOOP_OUT: - case ISCSI_OP_LOGIN: - case ISCSI_OP_TEXT: - case ISCSI_OP_LOGOUT: - data_seg_len = ntoh24(p_mtask->hdr->dlength); - if (data_seg_len > 0) { - p_regd_buf = &p_mtask->desc.data_regd_buf; - memset(p_regd_buf, 0, sizeof(struct iser_regd_buf)); - p_regd_buf->p_adaptor = p_iser_adaptor; - p_regd_buf->virt_addr = p_mtask->data; - p_regd_buf->data_size = p_mtask->data_count; - iser_reg_single(p_iser_adaptor, p_regd_buf, - DMA_TO_DEVICE); - iser_dto_add_regd_buff(p_send_dto, p_regd_buf, - USE_NO_OFFSET, - USE_SIZE(data_seg_len)); - } - break; - - default: - iser_err("Unsupported opcode = %d\n", opcode); - err = -EINVAL; - goto send_control_error; - break; + if (data_seg_len > 0) { + p_regd_buf = &p_mtask->desc.data_regd_buf; + memset(p_regd_buf, 0, sizeof(struct iser_regd_buf)); + p_regd_buf->p_adaptor = p_iser_adaptor; + p_regd_buf->virt_addr = p_mtask->data; + p_regd_buf->data_size = p_mtask->data_count; + iser_reg_single(p_iser_adaptor, p_regd_buf, + DMA_TO_DEVICE); + iser_dto_add_regd_buff(p_send_dto, p_regd_buf, + USE_NO_OFFSET, + USE_SIZE(data_seg_len)); } if (iser_post_receive_control(p_iser_conn) != 0) { @@ -537,7 +543,7 @@ send_control_error: if (p_send_dto != NULL) - iser_dto_free(p_send_dto); + iser_dto_buffs_release(p_send_dto); if (p_iser_conn != NULL) { /* drop the conn, open tasks are deleted during shutdown */ iser_err("send ctrl failed, drop conn:0x%p\n", p_iser_conn); @@ -574,15 +580,6 @@ opcode = p_hdr->opcode & ISCSI_OPCODE_MASK; - /* FIXME - "task" handles for non cmds */ - /* - if (itt == ISCSI_INVALID_ITT || - (opcode != ISCSI_OP_SCSI_CMD_RSP && - opcode != ISCSI_OP_NOOP_IN && - opcode != ISCSI_OP_TMFUNC_RSP && - opcode != ISCSI_OP_LOGIN_RSP && - opcode != ISCSI_OP_TEXT_RSP && opcode != ISCSI_OP_LOGOUT_RSP)) - */ if (opcode == ISCSI_OP_SCSI_CMD_RSP) { p_session = p_iser_conn->session; itt = p_hdr->itt; @@ -610,7 +607,7 @@ if(rc) iscsi_iser_conn_failure(p_iser_conn, rc); - iser_dto_free(p_dto); + iser_dto_buffs_release(p_dto); kfree(p_rx_desc->data); kmem_cache_free(ig.desc_cache, p_rx_desc); @@ -628,7 +625,7 @@ iser_dbg("Initiator, Data sent p_dto=0x%p\n", p_dto); - iser_dto_free(p_dto); + iser_dto_buffs_release(p_dto); if(p_tx_desc->type == ISCSI_TX_DATAOUT) kmem_cache_free(ig.desc_cache, p_tx_desc); @@ -681,3 +678,30 @@ iser_dma_unmap_task_data(p_iser_task); } + +/* iser_dto_add_regd_buff - increments the reference count for * + * the registered buffer & adds it to the DTO object */ +void iser_dto_add_regd_buff(struct iser_dto *p_dto, + struct iser_regd_buf *p_regd_buf, + unsigned long use_offset, + unsigned long use_size) +{ + int add_idx; + + iser_regd_buff_ref(p_regd_buf); + + add_idx = p_dto->regd_vector_len; + p_dto->regd[add_idx] = p_regd_buf; + p_dto->used_sz[add_idx] = use_size; + p_dto->offset[add_idx] = use_offset; + + p_dto->regd_vector_len++; +} + +void iser_dto_buffs_release(struct iser_dto *p_dto) +{ + int i; + + for (i = 0; i < p_dto->regd_vector_len; i++) + iser_regd_buff_release(p_dto->regd[i]); +} Index: ulp/iser/Makefile =================================================================== --- ulp/iser/Makefile (revision 5251) +++ ulp/iser/Makefile (revision 5262) @@ -2,11 +2,9 @@ obj-$(CONFIG_INFINIBAND_ISER) += ib_iser.o -ib_iser-y := iser_mod.o \ - iser_verbs.o \ +ib_iser-y := iser_verbs.o \ iser_initiator.o \ iser_memory.o \ - iser_dto.o \ iser_socket.o \ iscsi_iser.o \ iser_conn.o \ Index: ulp/iser/iscsi_iser.c =================================================================== --- ulp/iser/iscsi_iser.c (revision 5251) +++ ulp/iser/iscsi_iser.c (revision 5262) @@ -4,6 +4,7 @@ * Copyright (C) 2004 Dmitry Yusupov * Copyright (C) 2004 Alex Aizman * Copyright (C) 2005 Mike Christie + * Copyright (c) 2005, 2006 Voltaire, Inc. All rights reserved. * maintained by openib-general at openib.org * * This program is free software; you can redistribute it and/or modify @@ -26,7 +27,6 @@ * Modified by: * Erez Zilber * - * Copyright (c) 2005 Voltaire, Inc. All rights reserved. * * $Id$ */ @@ -78,6 +78,21 @@ static unsigned int iscsi_max_lun = 512; module_param_named(max_lun, iscsi_max_lun, uint, S_IRUGO); +#define DRV_VER "$Rev$" +#define DRV_DATE "$LastChangedDate$" + +int iser_debug_level = 0; + +MODULE_DESCRIPTION("iSER (iSCSI Extensions for RDMA) Datamover " + "v" DRV_VER "(" DRV_DATE ")"); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_AUTHOR("Alex Nezhinsky, Dan Bar Dov"); + +module_param_named(debug_level, iser_debug_level, int, 0644); +MODULE_PARM_DESC(debug_level,"Enable debug tracing if > 0 (default:disabled)"); + +struct iser_global ig; + /** * iscsi_iser_cmd_init - Initialize iSCSI SCSI_READ or SCSI_WRITE commands * @@ -1783,26 +1798,76 @@ return rc; } -int iscsi_iser_init(void) +static int __init iser_init(void) { - int error; + int err; + iser_dbg( "Starting iSER datamover...\n"); + if (iscsi_max_lun < 1) { printk(KERN_ERR "Invalid max_lun value of %u\n", iscsi_max_lun); return -EINVAL; } + iscsi_iser_transport.max_lun = iscsi_max_lun; - error = iscsi_register_transport(&iscsi_iser_transport); - if (error) { - printk(KERN_ERR "iscsi_register_transport failed\n"); - return error; + memset(&ig, 0, sizeof(struct iser_global)); + + ig.desc_cache = kmem_cache_create("iser_descriptors", + sizeof (struct iser_desc), + 0, SLAB_HWCACHE_ALIGN, + NULL, NULL); + if (ig.desc_cache == NULL) + return -ENOMEM; + + /* adaptor init is called only after the first addr resolution */ + init_MUTEX(&ig.adaptor_list_sem); + INIT_LIST_HEAD(&ig.adaptor_list); + ig.num_adaptors = 0; + + err = iser_register_sockets(); + if (err) { + iser_err("iser socket init failed!\n"); + goto register_socket_failure; } + + err = iscsi_register_transport(&iscsi_iser_transport); + if (err) { + iser_err("iscsi_register_transport failed\n"); + goto register_transport_failure; + } + return 0; + +register_transport_failure: + iser_unreg_sockets(); +register_socket_failure: + kmem_cache_destroy(ig.desc_cache); + + return err; } -void iscsi_iser_exit(void) +static void __exit iser_exit(void) { + struct iser_adaptor *p_adaptor; + + iser_dbg( "Removing iSER datamover...\n"); + iscsi_unregister_transport(&iscsi_iser_transport); + + while(!list_empty(&ig.adaptor_list)) { + p_adaptor = list_entry(ig.adaptor_list.next, + struct iser_adaptor, ig_list); + list_del(&p_adaptor->ig_list); + iser_adaptor_release(p_adaptor); + kfree(p_adaptor); + ig.num_adaptors--; + } + + kmem_cache_destroy(ig.desc_cache); + + iser_unreg_sockets(); } +module_init(iser_init); +module_exit(iser_exit); Index: ulp/iser/iser_mod.c =================================================================== --- ulp/iser/iser_mod.c (revision 5251) +++ ulp/iser/iser_mod.c (revision 5262) @@ -1,141 +0,0 @@ -/* - * Copyright (c) 2004, 2005, 2006 Voltaire, Inc. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - * $Id$ - */ - -#ifndef MODULE -#define MODULE -#endif - -#ifndef __KERNEL__ -#define __KERNEL__ -#endif - -#include -#include -#include -#include -#include -#include -#include -#include - -#include "iscsi_iser.h" -#include "iser_socket.h" - -#define DRV_VER "$Rev: 135 $" -#define DRV_DATE "$LastChangedDate: 2005-11-29 15:24:35 +0200 (Tue, 29 Nov 2005) $" - -int iser_debug_level = 0; - -MODULE_DESCRIPTION("iSER (iSCSI Extensions for RDMA) Datamover " - "v" DRV_VER "(" DRV_DATE ")"); -MODULE_LICENSE("Dual BSD/GPL"); -MODULE_AUTHOR("Alex Nezhinsky, Dan Bar Dov"); - -module_param_named(debug_level, iser_debug_level, int, 0644); -MODULE_PARM_DESC(debug_level,"Enable debug tracing if > 0 (default:disabled)"); - -struct iser_global ig; - -static void iser_global_release(void); - -/** - * init_module - module initialization function - */ -int init_module(void) -{ - int err; - - iser_dbg( "Starting iSER datamover...\n"); - - memset(&ig, 0, sizeof(struct iser_global)); - - ig.desc_cache = kmem_cache_create("iser_descriptors", - sizeof (struct iser_desc), - 0, SLAB_HWCACHE_ALIGN, - NULL, NULL); - if (ig.desc_cache == NULL) - return -ENOMEM; - - /* adaptor init is called only after the first addr resolution */ - init_MUTEX(&ig.adaptor_list_sem); - INIT_LIST_HEAD(&ig.adaptor_list); - ig.num_adaptors = 0; - - err = iser_register_sockets(); - if (err) { - iser_err("iser socket init failed!\n"); - iser_global_release(); - return err; - } - - return iscsi_iser_init(); -} - -/** - * iser_global_release - Releases all resources - */ -static void iser_global_release(void) -{ - int err; - struct iser_adaptor *p_adaptor; - - iscsi_iser_exit(); - - while(!list_empty(&ig.adaptor_list)) { - p_adaptor = list_entry(ig.adaptor_list.next, - struct iser_adaptor, ig_list); - list_del(&p_adaptor->ig_list); - iser_adaptor_release(p_adaptor); - kfree(p_adaptor); - ig.num_adaptors--; - } - - if (ig.desc_cache != NULL) { - err = kmem_cache_destroy(ig.desc_cache); - if(err) - iser_err("kmem_cache_destory returned %d\n",err); - ig.desc_cache = NULL; - } - - iser_unreg_sockets(); -} - -/** - * cleanup_module - module cleanup function - */ -void cleanup_module(void) -{ - iser_dbg( "Removing iSER datamover...\n"); - iser_global_release(); -} Index: ulp/iser/iser_dto.c =================================================================== --- ulp/iser/iser_dto.c (revision 5251) +++ ulp/iser/iser_dto.c (revision 5262) @@ -1,117 +0,0 @@ -/* - * Copyright (c) 2004, 2005, 2006 Voltaire, Inc. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - * $Id$ - */ - -#include -#include -#include -#include - -#include "iscsi_iser.h" - -/** - * iser_dto_add_regd_buff - Increments the reference count for the registered - * buffer & adds it to the DTO object - * - * returns index of used buffer - */ -int iser_dto_add_regd_buff(struct iser_dto *p_dto, - struct iser_regd_buf *p_regd_buf, - unsigned long use_offset, - unsigned long use_size) -{ - int add_idx; - - iser_regd_buff_ref(p_regd_buf); - - add_idx = p_dto->regd_vector_len; - p_dto->regd[add_idx] = p_regd_buf; - p_dto->used_sz[add_idx] = use_size; - p_dto->offset[add_idx] = use_offset; - - p_dto->regd_vector_len++; - - return add_idx; -} - -/** - * iser_dto_buffs_release - free all registered buffers - */ -void iser_dto_buffs_release(struct iser_dto *p_dto) -{ - int i; - - for (i = 0; i < p_dto->regd_vector_len; i++) { - iser_dbg("Releasing DTO:0x%p, regd.buf:0x%p, #%d\n", - p_dto, p_dto->regd[i], i); - iser_regd_buff_release(p_dto->regd[i]); - p_dto->regd[i] = NULL; - } -} - -/** - * iser_dto_free - Frees DTO descriptor and all associated buffers - */ -void iser_dto_free(struct iser_dto *p_dto) -{ - iser_dto_buffs_release(p_dto); -} - - -/** - * Creates a new send DTO descriptor, - * adds header regd buffer - * - */ -void iser_dto_send_create(struct iscsi_iser_conn *p_iser_conn, - struct iser_desc *tx_desc) -{ - struct iser_regd_buf *p_regd_hdr = &tx_desc->hdr_regd_buf; - struct iser_dto *p_send_dto = &tx_desc->dto; - - memset(p_regd_hdr, 0, sizeof(struct iser_regd_buf)); - p_regd_hdr->p_adaptor = p_iser_conn->ib_conn->p_adaptor; - p_regd_hdr->virt_addr = tx_desc; /* == &tx_desc->iser_header */ - p_regd_hdr->data_size = ISER_TOTAL_HEADERS_LEN; - - p_send_dto->p_conn = p_iser_conn; - p_send_dto->notify_enable = 1; - p_send_dto->regd_vector_len = 0; - - memset(&tx_desc->iser_header, 0, ISER_HDR_LEN); - tx_desc->iser_header.flags = ISER_VER; - - iser_dto_add_regd_buff(p_send_dto, p_regd_hdr, - USE_NO_OFFSET, USE_ENTIRE_SIZE); -} - From yael at mellanox.co.il Thu Feb 2 05:29:48 2006 From: yael at mellanox.co.il (Yael Kalka) Date: Thu, 2 Feb 2006 15:29:48 +0200 Subject: [openib-general] RE: [PATCH] Opensm - asserts beforeOSM_LOG_ENTER Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FD7A@mtlexch01.mtl.com> Hi Hal, I don't think that we should have the asserts only on Linux. I do think that they are not assential, in the sense that Eitan described before - these asserts are relevant in debugging races that might occure anyways. I am currently checking Michael's approach on the windows stack, and will update you on that when I'm done. Thanks, Yael -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Thursday, February 02, 2006 1:52 PM To: Eitan Zahavi Cc: openib-general at openib.org Subject: RE: [openib-general] RE: [PATCH] Opensm - asserts beforeOSM_LOG_ENTER Hi Eitan, On Wed, 2006-02-01 at 15:49, Eitan Zahavi wrote: > Hi Hal, > > All of these asserts are checking objects that can only be missing if we > have memory rundown or races in OpenSM destruction. I do not think these > particular asserts are going to be useful anyway. > The damage in them is bigger then their usefulness. What damage ? > We could however have a LINUX_ASSERT which is active only in LINUX. > Can we have a macro that will be mapped to CL_ASSERT only if we are in > Linux? That just hides it down one level so is essentially the same solution in my mind as the one I proposed to conditionalize the CL_ASSERTs which are problematic for Windows compilers. It appears to me that there is a gcc predefine of either unix or linux that could be used for this. Also, shouldn't we take Michael's approach anyhow (using one of those gcc predefines) ? Does VC6 need to be supported on the Windows side ? -- Hal > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: openib-general-bounces at openib.org [mailto:openib-general- > > bounces at openib.org] On Behalf Of Hal Rosenstock > > Sent: Wednesday, February 01, 2006 10:18 PM > > To: Eitan Zahavi > > Cc: openib-general at openib.org > > Subject: [openib-general] RE: [PATCH] Opensm - asserts before > OSM_LOG_ENTER > > > > Hi Eitan, > > > > On Wed, 2006-02-01 at 09:50, Eitan Zahavi wrote: > > > Hi Hal, > > > > > > Please see below: > > > > > When trying to compile the windows stack with some late updates, > > > I've > > > > > encountered an issue with the addition/change of place of > asserts to > > > > > before the OSM_LOG_ENTER. Since OSM_LOG_ENTER declares a > variable, > > > > > then these asserts cause failure due to declaration in the > middle of > > > > > the function. > > > > > > > > The asserts are on a passed in pointer rather than the static > variable > > > > created by the MACRO based on the second parameter to > OSM_LOG_ENTER. I > > > > don't understand how this causes a problem. Is it Windows only? > > > > > [EZ] In C you are not allowed to mix variable declarations with > > > statements like "if" (on the same code block). In debug build the > > > CL_ASSERT includes an "if" statement that is later followed by > > > OSM_LOG_ENTER which declares the static variable. It used to fail > build > > > in Linux but for some reason it stopped. I guess if we used > -pedantic or > > > ANSI it would fail. > > > > OK. I understand now what is going on. > > > > > Anyway, the assert is on the passed down parameter > > > which is passed as NULL. This might happen only on race in the > "destroy" > > > flow - but if this is a race it is not guaranteed to catch the bug > as > > > the pointer might be free'ed after the assert. It should be caught > as a > > > "segfault" (dereferencing NULL pointer) or in valgrind. > > > > This is not always possible. > > > > > We have few options: > > > a. Do not use same code tree for WinIB - I do not think we want > that. > > > > I too would prefer not to do that but this is not a requirement of > > OpenIB. > > > > > b. Put everything after the CL_ASSERT in an internal code block > (i.e. > > > "{") - I do not think we want to do this either. > > > > I don't think that solves the problem as doesn't OSM_LOG_EXIT need > > access to that variable created by OSM_LOG_ENTER. > > > > > c. Move the CL_ASSERT before the function call (into the function > > > caller). > > > > Ugh... > > > > > d. Give up these few asserts as this only can happen as a race > during > > > resource destruction. > > > > e. What about some conditionalization of these asserts ? > > > > #ifndef __WIN__ > > CL_ASSERT(foo); > > #endif > > > > It's already in other places in OpenSM. > > > > > I think that in this case it is more important to keep the WinIB and > > > Linux tree identical. > > > > This is not a requirement although it is desirable. > > > > -- Hal > > > > > > > These asserts are all on the reciever object or the manager > object, > > > so > > > > > I don't think they are really necessary. > > > > > > > > They compile out when not using debug. I saw these trip at SC05. > > > [EZ] As explained - yes they can trip - but only if we have memory > > > pollution (that could be caught by valgrind) or during exit - when > and > > > they really a race and might not be caught by the assert. > > > > > > > > -- Hal > > > > > > > > > The Following patch removes these asserts. > > > > > > > > > > Thanks, > > > > > Yael > > > > > > > > > > Signed-off-by: Yael Kalka > > > > > > > > > > Index: opensm/osm_pkey_rcv.c > > > > > > =================================================================== > > > > > --- opensm/osm_pkey_rcv.c (revision 5246) > > > > > +++ opensm/osm_pkey_rcv.c (working copy) > > > > > @@ -71,8 +71,6 @@ void > > > > > osm_pkey_rcv_destroy( > > > > > IN osm_pkey_rcv_t* const p_rcv ) > > > > > { > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_pkey_rcv_destroy ); > > > > > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > > > @@ -125,8 +123,6 @@ osm_pkey_rcv_process( > > > > > uint8_t port_num; > > > > > uint16_t block_num; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_pkey_rcv_process ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > Index: opensm/osm_sm_state_mgr.c > > > > > > =================================================================== > > > > > --- opensm/osm_sm_state_mgr.c (revision 5246) > > > > > +++ opensm/osm_sm_state_mgr.c (working copy) > > > > > @@ -406,8 +406,6 @@ void > > > > > osm_sm_state_mgr_destroy( > > > > > IN osm_sm_state_mgr_t * const p_sm_mgr ) > > > > > { > > > > > - CL_ASSERT( p_sm_mgr ); > > > > > - > > > > > OSM_LOG_ENTER( p_sm_mgr->p_log, osm_sm_state_mgr_destroy ); > > > > > > > > > > cl_spinlock_destroy( &p_sm_mgr->state_lock ); > > > > > @@ -500,8 +498,6 @@ osm_sm_state_mgr_process( > > > > > { > > > > > ib_api_status_t status = IB_SUCCESS; > > > > > > > > > > - CL_ASSERT( p_sm_mgr ); > > > > > - > > > > > OSM_LOG_ENTER( p_sm_mgr->p_log, osm_sm_state_mgr_process ); > > > > > > > > > > /* > > > > > @@ -760,8 +756,6 @@ osm_sm_state_mgr_check_legality( > > > > > { > > > > > ib_api_status_t status = IB_SUCCESS; > > > > > > > > > > - CL_ASSERT( p_sm_mgr ); > > > > > - > > > > > OSM_LOG_ENTER( p_sm_mgr->p_log, > osm_sm_state_mgr_check_legality > > > ); > > > > > > > > > > /* > > > > > Index: opensm/osm_state_mgr.c > > > > > > =================================================================== > > > > > --- opensm/osm_state_mgr.c (revision 5246) > > > > > +++ opensm/osm_state_mgr.c (working copy) > > > > > @@ -86,8 +86,6 @@ void > > > > > osm_state_mgr_destroy( > > > > > IN osm_state_mgr_t * const p_mgr ) > > > > > { > > > > > - CL_ASSERT( p_mgr ); > > > > > - > > > > > OSM_LOG_ENTER( p_mgr->p_log, osm_state_mgr_destroy ); > > > > > > > > > > /* destroy the locks */ > > > > > @@ -1884,8 +1882,6 @@ osm_state_mgr_process( > > > > > ib_api_status_t status; > > > > > osm_remote_sm_t *p_remote_sm; > > > > > > > > > > - CL_ASSERT( p_mgr ); > > > > > - > > > > > OSM_LOG_ENTER( p_mgr->p_log, osm_state_mgr_process ); > > > > > > > > > > /* if we are exiting do nothing */ > > > > > Index: opensm/osm_sa_guidinfo_record.c > > > > > > =================================================================== > > > > > --- opensm/osm_sa_guidinfo_record.c (revision 5246) > > > > > +++ opensm/osm_sa_guidinfo_record.c (working copy) > > > > > @@ -433,8 +433,6 @@ osm_gir_rcv_process( > > > > > ib_api_status_t status; > > > > > osm_physp_t* p_req_physp; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_gir_rcv_process ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > Index: opensm/osm_sa_vlarb_record.c > > > > > > =================================================================== > > > > > --- opensm/osm_sa_vlarb_record.c (revision 5246) > > > > > +++ opensm/osm_sa_vlarb_record.c (working copy) > > > > > @@ -348,8 +348,6 @@ osm_vlarb_rec_rcv_process( > > > > > ib_net64_t comp_mask; > > > > > osm_physp_t* p_req_physp; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_vlarb_rec_rcv_process ); > > > > > > > > > > /* update the requestor physical port. */ > > > > > Index: opensm/osm_sa_lft_record.c > > > > > > =================================================================== > > > > > --- opensm/osm_sa_lft_record.c (revision 5246) > > > > > +++ opensm/osm_sa_lft_record.c (working copy) > > > > > @@ -329,8 +329,6 @@ osm_lftr_rcv_process( > > > > > ib_api_status_t status; > > > > > osm_physp_t* p_req_physp; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_lftr_rcv_process ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > Index: opensm/osm_sa_portinfo_record.c > > > > > > =================================================================== > > > > > --- opensm/osm_sa_portinfo_record.c (revision 5246) > > > > > +++ opensm/osm_sa_portinfo_record.c (working copy) > > > > > @@ -600,8 +600,6 @@ osm_pir_rcv_process( > > > > > osm_physp_t* p_req_physp; > > > > > boolean_t trusted_req = TRUE; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_pir_rcv_process ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > Index: opensm/osm_req.c > > > > > > =================================================================== > > > > > --- opensm/osm_req.c (revision 5246) > > > > > +++ opensm/osm_req.c (working copy) > > > > > @@ -131,8 +131,6 @@ osm_req_get( > > > > > ib_api_status_t status = IB_SUCCESS; > > > > > ib_net64_t tid; > > > > > > > > > > - CL_ASSERT( p_req ); > > > > > - > > > > > OSM_LOG_ENTER( p_req->p_log, osm_req_get ); > > > > > > > > > > CL_ASSERT( p_path ); > > > > > @@ -222,8 +220,6 @@ osm_req_set( > > > > > ib_api_status_t status = IB_SUCCESS; > > > > > ib_net64_t tid; > > > > > > > > > > - CL_ASSERT( p_req ); > > > > > - > > > > > OSM_LOG_ENTER( p_req->p_log, osm_req_set ); > > > > > > > > > > CL_ASSERT( p_path ); > > > > > Index: opensm/osm_sa_pkey_record.c > > > > > > =================================================================== > > > > > --- opensm/osm_sa_pkey_record.c (revision 5246) > > > > > +++ opensm/osm_sa_pkey_record.c (working copy) > > > > > @@ -344,8 +344,6 @@ osm_pkey_rec_rcv_process( > > > > > ib_net64_t comp_mask; > > > > > osm_physp_t* p_req_physp; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_pkey_rec_rcv_process ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > Index: opensm/osm_lin_fwd_rcv.c > > > > > > =================================================================== > > > > > --- opensm/osm_lin_fwd_rcv.c (revision 5246) > > > > > +++ opensm/osm_lin_fwd_rcv.c (working copy) > > > > > @@ -75,8 +75,6 @@ void > > > > > osm_lft_rcv_destroy( > > > > > IN osm_lft_rcv_t* const p_rcv ) > > > > > { > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_lft_rcv_destroy ); > > > > > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > > > @@ -121,8 +119,6 @@ osm_lft_rcv_process( > > > > > ib_net64_t node_guid; > > > > > ib_api_status_t status; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_lft_rcv_process ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > Index: opensm/osm_sa_slvl_record.c > > > > > > =================================================================== > > > > > --- opensm/osm_sa_slvl_record.c (revision 5246) > > > > > +++ opensm/osm_sa_slvl_record.c (working copy) > > > > > @@ -324,8 +324,6 @@ osm_slvl_rec_rcv_process( > > > > > ib_net64_t comp_mask; > > > > > osm_physp_t* p_req_physp; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_slvl_rec_rcv_process ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > Index: opensm/osm_sminfo_rcv.c > > > > > > =================================================================== > > > > > --- opensm/osm_sminfo_rcv.c (revision 5246) > > > > > +++ opensm/osm_sminfo_rcv.c (working copy) > > > > > @@ -80,8 +80,6 @@ void > > > > > osm_sminfo_rcv_destroy( > > > > > IN osm_sminfo_rcv_t* const p_rcv ) > > > > > { > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_sminfo_rcv_destroy ); > > > > > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > > > Index: opensm/osm_node_info_rcv.c > > > > > > =================================================================== > > > > > --- opensm/osm_node_info_rcv.c (revision 5246) > > > > > +++ opensm/osm_node_info_rcv.c (working copy) > > > > > @@ -981,8 +981,6 @@ void > > > > > osm_ni_rcv_destroy( > > > > > IN osm_ni_rcv_t* const p_rcv ) > > > > > { > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_ni_rcv_destroy ); > > > > > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > > > @@ -1028,8 +1026,6 @@ osm_ni_rcv_process( > > > > > osm_node_t *p_node; > > > > > boolean_t process_new_flag = FALSE; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_ni_rcv_process ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > Index: opensm/osm_mcast_mgr.c > > > > > > =================================================================== > > > > > --- opensm/osm_mcast_mgr.c (revision 5246) > > > > > +++ opensm/osm_mcast_mgr.c (working copy) > > > > > @@ -394,8 +394,6 @@ void > > > > > osm_mcast_mgr_destroy( > > > > > IN osm_mcast_mgr_t* const p_mgr ) > > > > > { > > > > > - CL_ASSERT( p_mgr ); > > > > > - > > > > > OSM_LOG_ENTER( p_mgr->p_log, osm_mcast_mgr_destroy ); > > > > > > > > > > OSM_LOG_EXIT( p_mgr->p_log ); > > > > > @@ -449,8 +447,6 @@ __osm_mcast_mgr_set_tbl( > > > > > ib_net16_t block[IB_MCAST_BLOCK_SIZE]; > > > > > osm_signal_t signal = OSM_SIGNAL_DONE; > > > > > > > > > > - CL_ASSERT( p_mgr ); > > > > > - > > > > > OSM_LOG_ENTER( p_mgr->p_log, __osm_mcast_mgr_set_tbl ); > > > > > > > > > > CL_ASSERT( p_sw ); > > > > > Index: opensm/osm_sa_sminfo_record.c > > > > > > =================================================================== > > > > > --- opensm/osm_sa_sminfo_record.c (revision 5246) > > > > > +++ opensm/osm_sa_sminfo_record.c (working copy) > > > > > @@ -89,8 +89,6 @@ void > > > > > osm_smir_rcv_destroy( > > > > > IN osm_smir_rcv_t* const p_rcv ) > > > > > { > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_smir_rcv_destroy ); > > > > > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > > > @@ -142,8 +140,6 @@ osm_smir_rcv_process( > > > > > ib_net64_t local_guid; > > > > > osm_port_t* local_port; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_smir_rcv_process ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > Index: opensm/osm_trap_rcv.c > > > > > > =================================================================== > > > > > --- opensm/osm_trap_rcv.c (revision 5246) > > > > > +++ opensm/osm_trap_rcv.c (working copy) > > > > > @@ -189,8 +189,6 @@ void > > > > > osm_trap_rcv_destroy( > > > > > IN osm_trap_rcv_t* const p_rcv ) > > > > > { > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_trap_rcv_destroy ); > > > > > > > > > > cl_event_wheel_destroy( &p_rcv->trap_aging_tracker ); > > > > > Index: opensm/osm_ucast_mgr.c > > > > > > =================================================================== > > > > > --- opensm/osm_ucast_mgr.c (revision 5246) > > > > > +++ opensm/osm_ucast_mgr.c (working copy) > > > > > @@ -90,8 +90,6 @@ void > > > > > osm_ucast_mgr_destroy( > > > > > IN osm_ucast_mgr_t* const p_mgr ) > > > > > { > > > > > - CL_ASSERT( p_mgr ); > > > > > - > > > > > OSM_LOG_ENTER( p_mgr->p_log, osm_ucast_mgr_destroy ); > > > > > > > > > > OSM_LOG_EXIT( p_mgr->p_log ); > > > > > @@ -785,8 +783,6 @@ __osm_ucast_mgr_set_table( > > > > > uint32_t block_id_ho = 0; > > > > > uint8_t block[IB_SMP_DATA_SIZE]; > > > > > > > > > > - CL_ASSERT( p_mgr ); > > > > > - > > > > > OSM_LOG_ENTER( p_mgr->p_log, __osm_ucast_mgr_set_table ); > > > > > > > > > > CL_ASSERT( p_sw ); > > > > > Index: opensm/osm_sa_node_record.c > > > > > > =================================================================== > > > > > --- opensm/osm_sa_node_record.c (revision 5246) > > > > > +++ opensm/osm_sa_node_record.c (working copy) > > > > > @@ -435,8 +435,6 @@ osm_nr_rcv_process( > > > > > ib_api_status_t status; > > > > > osm_physp_t* p_req_physp; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_nr_rcv_process ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > Index: opensm/osm_sw_info_rcv.c > > > > > > =================================================================== > > > > > --- opensm/osm_sw_info_rcv.c (revision 5246) > > > > > +++ opensm/osm_sw_info_rcv.c (working copy) > > > > > @@ -363,8 +363,6 @@ __osm_si_rcv_process_new( > > > > > ib_smp_t *p_smp; > > > > > cl_qmap_t *p_sw_guid_tbl; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, __osm_si_rcv_process_new ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > @@ -582,8 +580,6 @@ void > > > > > osm_si_rcv_destroy( > > > > > IN osm_si_rcv_t* const p_rcv ) > > > > > { > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_si_rcv_destroy ); > > > > > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > > > @@ -631,8 +627,6 @@ osm_si_rcv_process( > > > > > ib_net64_t node_guid; > > > > > osm_si_context_t *p_context; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_si_rcv_process ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > Index: opensm/osm_mcast_fwd_rcv.c > > > > > > =================================================================== > > > > > --- opensm/osm_mcast_fwd_rcv.c (revision 5246) > > > > > +++ opensm/osm_mcast_fwd_rcv.c (working copy) > > > > > @@ -77,8 +77,6 @@ void > > > > > osm_mft_rcv_destroy( > > > > > IN osm_mft_rcv_t* const p_rcv ) > > > > > { > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_mft_rcv_destroy ); > > > > > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > > > @@ -124,8 +122,6 @@ osm_mft_rcv_process( > > > > > ib_net64_t node_guid; > > > > > ib_api_status_t status; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_mft_rcv_process ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > Index: opensm/osm_slvl_map_rcv.c > > > > > > =================================================================== > > > > > --- opensm/osm_slvl_map_rcv.c (revision 5246) > > > > > +++ opensm/osm_slvl_map_rcv.c (working copy) > > > > > @@ -83,8 +83,6 @@ void > > > > > osm_slvl_rcv_destroy( > > > > > IN osm_slvl_rcv_t* const p_rcv ) > > > > > { > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_slvl_rcv_destroy ); > > > > > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > > > @@ -136,8 +134,6 @@ osm_slvl_rcv_process( > > > > > ib_net64_t node_guid; > > > > > uint8_t out_port_num, in_port_num; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_slvl_rcv_process ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > Index: opensm/osm_node_desc_rcv.c > > > > > > =================================================================== > > > > > --- opensm/osm_node_desc_rcv.c (revision 5246) > > > > > +++ opensm/osm_node_desc_rcv.c (working copy) > > > > > @@ -109,8 +109,6 @@ void > > > > > osm_nd_rcv_destroy( > > > > > IN osm_nd_rcv_t* const p_rcv ) > > > > > { > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_nd_rcv_destroy ); > > > > > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > > > @@ -152,8 +150,6 @@ osm_nd_rcv_process( > > > > > osm_node_t *p_node; > > > > > ib_net64_t node_guid; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_nd_rcv_process ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > Index: opensm/osm_sa_mcmember_record.c > > > > > > =================================================================== > > > > > --- opensm/osm_sa_mcmember_record.c (revision 5246) > > > > > +++ opensm/osm_sa_mcmember_record.c (working copy) > > > > > @@ -109,8 +109,6 @@ void > > > > > osm_mcmr_rcv_destroy( > > > > > IN osm_mcmr_recv_t* const p_rcv ) > > > > > { > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_mcmr_rcv_destroy ); > > > > > > > > > > cl_qlock_pool_destroy( &p_rcv->pool ); > > > > > @@ -1967,8 +1965,6 @@ osm_mcmr_query_mgrp(IN osm_mcmr_recv_t* > > > > > osm_physp_t* p_req_physp; > > > > > boolean_t trusted_req = TRUE; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_mcmr_query_mgrp ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > @@ -2173,8 +2169,6 @@ osm_mcmr_rcv_process( > > > > > ib_member_rec_t *p_recvd_mcmember_rec; > > > > > boolean_t valid; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_mcmr_rcv_process ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > Index: opensm/osm_drop_mgr.c > > > > > > =================================================================== > > > > > --- opensm/osm_drop_mgr.c (revision 5246) > > > > > +++ opensm/osm_drop_mgr.c (working copy) > > > > > @@ -81,8 +81,6 @@ void > > > > > osm_drop_mgr_destroy( > > > > > IN osm_drop_mgr_t* const p_mgr ) > > > > > { > > > > > - CL_ASSERT( p_mgr ); > > > > > - > > > > > OSM_LOG_ENTER( p_mgr->p_log, osm_drop_mgr_destroy ); > > > > > > > > > > OSM_LOG_EXIT( p_mgr->p_log ); > > > > > @@ -597,8 +595,6 @@ osm_drop_mgr_process( > > > > > uint8_t port_num; > > > > > osm_physp_t *p_physp; > > > > > > > > > > - CL_ASSERT( p_mgr ); > > > > > - > > > > > OSM_LOG_ENTER( p_mgr->p_log, osm_drop_mgr_process ); > > > > > > > > > > p_node_guid_tbl = &p_mgr->p_subn->node_guid_tbl; > > > > > Index: opensm/osm_lid_mgr.c > > > > > > =================================================================== > > > > > --- opensm/osm_lid_mgr.c (revision 5246) > > > > > +++ opensm/osm_lid_mgr.c (working copy) > > > > > @@ -1312,8 +1312,6 @@ osm_lid_mgr_process_subnet( > > > > > osm_physp_t *p_physp; > > > > > int lid_changed; > > > > > > > > > > - CL_ASSERT( p_mgr ); > > > > > - > > > > > OSM_LOG_ENTER( p_mgr->p_log, osm_lid_mgr_process_subnet ); > > > > > > > > > > CL_PLOCK_EXCL_ACQUIRE( p_mgr->p_lock ); > > > > > Index: opensm/osm_pkey_mgr.c > > > > > > =================================================================== > > > > > --- opensm/osm_pkey_mgr.c (revision 5246) > > > > > +++ opensm/osm_pkey_mgr.c (working copy) > > > > > @@ -73,8 +73,6 @@ void > > > > > osm_pkey_mgr_destroy( > > > > > IN osm_pkey_mgr_t * const p_mgr ) > > > > > { > > > > > - CL_ASSERT( p_mgr ); > > > > > - > > > > > OSM_LOG_ENTER( p_mgr->p_log, osm_pkey_mgr_destroy ); > > > > > > > > > > OSM_LOG_EXIT( p_mgr->p_log ); > > > > > @@ -238,8 +236,6 @@ osm_pkey_mgr_process( > > > > > osm_physp_t *p_physp; > > > > > osm_signal_t result = OSM_SIGNAL_DONE; > > > > > > > > > > - CL_ASSERT( p_mgr ); > > > > > - > > > > > OSM_LOG_ENTER( p_mgr->p_log, osm_pkey_mgr_process ); > > > > > > > > > > p_node_guid_tbl = &p_mgr->p_subn->node_guid_tbl; > > > > > Index: opensm/osm_vl_arb_rcv.c > > > > > > =================================================================== > > > > > --- opensm/osm_vl_arb_rcv.c (revision 5246) > > > > > +++ opensm/osm_vl_arb_rcv.c (working copy) > > > > > @@ -83,8 +83,6 @@ void > > > > > osm_vla_rcv_destroy( > > > > > IN osm_vla_rcv_t* const p_rcv ) > > > > > { > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_vla_rcv_destroy ); > > > > > > > > > > OSM_LOG_EXIT( p_rcv->p_log ); > > > > > @@ -136,8 +134,6 @@ osm_vla_rcv_process( > > > > > ib_net64_t node_guid; > > > > > uint8_t port_num, block_num; > > > > > > > > > > - CL_ASSERT( p_rcv ); > > > > > - > > > > > OSM_LOG_ENTER( p_rcv->p_log, osm_vla_rcv_process ); > > > > > > > > > > CL_ASSERT( p_madw ); > > > > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Thu Feb 2 05:31:09 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 2 Feb 2006 15:31:09 +0200 Subject: [openib-general] Re: [PATCH] Opensm - using default dir In-Reply-To: <1138835625.15119.4890.camel@hal.voltaire.com> References: <5zwtgi0xs7.fsf@mtl066.yok.mtl.com> <20060130124122.GY31887@mellanox.co.il> <20060201225923.GA32188@sashak.voltaire.com> <1138835625.15119.4890.camel@hal.voltaire.com> Message-ID: <20060202133109.GB32188@sashak.voltaire.com> On 18:13 Wed 01 Feb , Hal Rosenstock wrote: > On Wed, 2006-02-01 at 17:59, Sasha Khapyorsky wrote: > > On 14:41 Mon 30 Jan , Michael S. Tsirkin wrote: > > > Quoting r. Yael Kalka : > > > > =================================================================== > > > > --- include/opensm/osm_svn_revision.h (revision 5203) > > > > +++ include/opensm/osm_svn_revision.h (working copy) > > > > @@ -1 +1 @@ > > > > -#define OSM_SVN_REVISION "" > > > > +#define OSM_SVN_REVISION "5203M" > > > > > > This looks like a mistake. > > > And, I think this shows that keeping the generated file osm_svn_revision.h > > > represents a problem. > > > > Good point. Hal, could we svn-remove this file? > > Yes, this is possible but there is a little more work involved here as > the OSM_SVN_REVISION is checked for length 0 to determine whether to > print out the svn version message right now. It is ok. What I mean (and believe Michael too) is to not store osm_svn_revision.h under SVN, but generate in build time. Like this: --- a/src/userspace/management/osm/opensm/Makefile.am +++ b/src/userspace/management/osm/opensm/Makefile.am @@ -12,22 +12,20 @@ endif if OSMV_OPENIB .PHONY: always $(srcdir)/../include/opensm/osm_svn_revision.h: always - if \ - test '!' -d '$(srcdir)/.svn'; \ - then \ - echo Exported svn revision; \ + echo -n "#define OSM_SVN_REVISION \"" >$(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ + if test '!' -d '$(srcdir)/.svn'; then \ + echo -n "" >> $(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ else \ - echo -n "#define OSM_SVN_REVISION \"" >$(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ svnversion -n $(srcdir)/.. >> $(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ - echo "\"" >> $(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ - if cmp -s $(srcdir)/../include/opensm/osm_svn_revision_new.h \ - $(srcdir)/../include/opensm/osm_svn_revision.h ; \ - then \ - rm $(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ - else \ - mv $(srcdir)/../include/opensm/osm_svn_revision_new.h \ - $(srcdir)/../include/opensm/osm_svn_revision.h ; \ - fi \ + fi ; \ + echo "\"" >> $(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ + if cmp -s $(srcdir)/../include/opensm/osm_svn_revision_new.h \ + $(srcdir)/../include/opensm/osm_svn_revision.h ; \ + then \ + rm $(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ + else \ + mv $(srcdir)/../include/opensm/osm_svn_revision_new.h \ + $(srcdir)/../include/opensm/osm_svn_revision.h ; \ fi endif Sasha. From mst at mellanox.co.il Thu Feb 2 08:37:58 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 2 Feb 2006 18:37:58 +0200 Subject: [openib-general] [PATCH] support setting node description Message-ID: <20060202163758.GF5673@mellanox.co.il> Roland, Hal, With node description trap standartization in progress, I think we are ready to commit the bits that make it possible to set node description to a meaningful value, which makes more sense than just returning whatever string hardware has burned on its flash. Here's a repost of Roland's patch with some fixes. OK to merge now? Signed-off-by: Michael S. Tsirkin ----------------- Updated to svn rev 4044 Make sure the whole 64 byte description is initialized before passing it to hardware. Signed-off-by: Michael S. Tsirkin This patch does a few things: - Adds node_guid and node_desc fields to struct ib_device - Has mthca set these fields on startup - Extends modify_device method to handle setting node_desc - Exposes node_desc in sysfs - Allows userspace to set node_desc by writing into sysfs file, eg. echo -n `hostname` >> /sys/class/linux-kernel/drivers/infiniband/mthca0/node_desc This should probably be combined with Sean's work to get rid of node_guid queries in ULPs. Comments? - R. Index: linux-2.6.14/drivers/infiniband/core/sysfs.c =================================================================== --- linux-2.6.14/drivers/infiniband/core/sysfs.c (revision 4042) +++ linux-2.6.14/drivers/infiniband/core/sysfs.c (working copy) @@ -637,14 +637,42 @@ be16_to_cpu(((__be16 *) &attr.node_guid)[3])); } +static ssize_t show_node_desc(struct class_device *cdev, char *buf) +{ + struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); + + return sprintf(buf, "%.64s\n", dev->node_desc); +} + +static ssize_t set_node_desc(struct class_device *cdev, const char *buf, + size_t count) +{ + struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); + struct ib_device_modify desc = {}; + int ret; + + if (!dev->modify_device) + return -EIO; + + memcpy(desc.node_desc, buf, min_t(int, count, 64)); + ret = ib_modify_device(dev, IB_DEVICE_MODIFY_NODE_DESC, &desc); + if (ret) + return ret; + + return count; +} + static CLASS_DEVICE_ATTR(node_type, S_IRUGO, show_node_type, NULL); static CLASS_DEVICE_ATTR(sys_image_guid, S_IRUGO, show_sys_image_guid, NULL); static CLASS_DEVICE_ATTR(node_guid, S_IRUGO, show_node_guid, NULL); +static CLASS_DEVICE_ATTR(node_desc, S_IRUGO | S_IWUSR, show_node_desc, + set_node_desc); static struct class_device_attribute *ib_class_attributes[] = { &class_device_attr_node_type, &class_device_attr_sys_image_guid, - &class_device_attr_node_guid + &class_device_attr_node_guid, + &class_device_attr_node_desc }; static struct class ib_class = { Index: linux-2.6.14/drivers/infiniband/hw/mthca/mthca_provider.c =================================================================== --- linux-2.6.14/drivers/infiniband/hw/mthca/mthca_provider.c (revision 4042) +++ linux-2.6.14/drivers/infiniband/hw/mthca/mthca_provider.c (working copy) @@ -177,6 +177,23 @@ return err; } +static int mthca_modify_device(struct ib_device *ibdev, + int mask, + struct ib_device_modify *props) +{ + if (mask & ~IB_DEVICE_MODIFY_NODE_DESC) + return -EOPNOTSUPP; + + if (mask & IB_DEVICE_MODIFY_NODE_DESC) { + if (down_interruptible(&to_mdev(ibdev)->cap_mask_mutex)) + return -ERESTARTSYS; + memcpy(ibdev->node_desc, props->node_desc, 64); + up(&to_mdev(ibdev)->cap_mask_mutex); + } + + return 0; +} + static int mthca_modify_port(struct ib_device *ibdev, u8 port, int port_modify_mask, struct ib_port_modify *props) @@ -1071,6 +1088,20 @@ goto out; init_query_mad(in_mad); + in_mad->attr_id = IB_SMP_ATTR_NODE_DESC; + + err = mthca_MAD_IFC(dev, 1, 1, + 1, NULL, NULL, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + memcpy(dev->ib_dev.node_desc, out_mad->data, 64); + in_mad->attr_id = IB_SMP_ATTR_NODE_INFO; err = mthca_MAD_IFC(dev, 1, 1, @@ -1129,6 +1160,7 @@ dev->ib_dev.class_dev.dev = &dev->pdev->dev; dev->ib_dev.query_device = mthca_query_device; dev->ib_dev.query_port = mthca_query_port; + dev->ib_dev.modify_device = mthca_modify_device; dev->ib_dev.modify_port = mthca_modify_port; dev->ib_dev.query_pkey = mthca_query_pkey; dev->ib_dev.query_gid = mthca_query_gid; Index: linux-2.6.14/drivers/infiniband/hw/mthca/mthca_mad.c =================================================================== --- linux-2.6.14/drivers/infiniband/hw/mthca/mthca_mad.c (revision 4042) +++ linux-2.6.14/drivers/infiniband/hw/mthca/mthca_mad.c (working copy) @@ -106,6 +106,19 @@ } } +static void node_desc_override(struct ib_device *dev, + struct ib_mad *mad) +{ + if ((mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || + mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) && + mad->mad_hdr.method == IB_MGMT_METHOD_GET_RESP && + mad->mad_hdr.attr_id == IB_SMP_ATTR_NODE_DESC) { + down(&to_mdev(dev)->cap_mask_mutex); + memcpy(((struct ib_smp *) mad)->data, dev->node_desc, 64); + up(&to_mdev(dev)->cap_mask_mutex); + } +} + static void forward_trap(struct mthca_dev *dev, u8 port_num, struct ib_mad *mad) @@ -204,8 +217,10 @@ return IB_MAD_RESULT_FAILURE; } - if (!out_mad->mad_hdr.status) + if (!out_mad->mad_hdr.status) { smp_snoop(ibdev, port_num, in_mad); + node_desc_override(ibdev, out_mad); + } /* set return bit in status of directed route responses */ if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) Index: linux-2.6.14/drivers/infiniband/include/rdma/ib_verbs.h =================================================================== --- linux-2.6.14/drivers/infiniband/include/rdma/ib_verbs.h (revision 4044) +++ linux-2.6.14/drivers/infiniband/include/rdma/ib_verbs.h (working copy) @@ -231,11 +231,13 @@ }; enum ib_device_modify_flags { - IB_DEVICE_MODIFY_SYS_IMAGE_GUID = 1 + IB_DEVICE_MODIFY_SYS_IMAGE_GUID = 1 << 0, + IB_DEVICE_MODIFY_NODE_DESC = 1 << 1 }; struct ib_device_modify { u64 sys_image_guid; + char node_desc[64]; }; enum ib_port_modify_flags { @@ -959,6 +961,7 @@ u64 uverbs_cmd_mask; int uverbs_abi_ver; + char node_desc[64]; __be64 node_guid; u8 node_type; u8 phys_port_cnt; -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From rdreier at cisco.com Thu Feb 2 09:08:27 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 02 Feb 2006 09:08:27 -0800 Subject: [openib-general] on calling rdma_disconnect from non sleepablecontext In-Reply-To: <43E1A306.10002@voltaire.com> (Or Gerlitz's message of "Thu, 02 Feb 2006 08:13:26 +0200") References: <43E1A306.10002@voltaire.com> Message-ID: Or> Can you elaborate a little more on the "---might-- sleep" with Or> regard to the Mellanox hardware/firmware? empirically i saw Or> (and could not understand) that on 99% of the cases where my Or> code called ib_modify_qp (via rdma_disconnect) and Or> ib_destory_qp (directly) from non sleepable context (tasklet) Or> it just worked fine. Or> AFAIK since the mthca driver works in "events" command mode, Or> it would always sleep after issuing a command to the FW till Or> the command completion is reported to the commands EQ and then Or> there's a wakeup. It is somewhat strange that it worked for you. My only guess is that the command completed quickly enough that the command code didn't actually hit the schedulera and try to switch tasks. - R. From halr at voltaire.com Thu Feb 2 09:10:19 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Feb 2006 12:10:19 -0500 Subject: [openib-general] Re: [PATCH] support setting node description In-Reply-To: <20060202163758.GF5673@mellanox.co.il> References: <20060202163758.GF5673@mellanox.co.il> Message-ID: <1138900197.15119.11040.camel@hal.voltaire.com> Michael, On Thu, 2006-02-02 at 11:37, Michael S. Tsirkin wrote: > Roland, Hal, > With node description trap standartization in progress, I think > we are ready to commit the bits that make it possible to set node > description to a meaningful value, which makes more sense > than just returning whatever string hardware has burned on its flash. > > Here's a repost of Roland's patch with some fixes. > OK to merge now? I'm OK with this now confident that there will be a standard mechanism other than polling that the SM can use to detect this change and a subsequent patch if needed for this. -- Hal > Signed-off-by: Michael S. Tsirkin > > ----------------- > > > Updated to svn rev 4044 > Make sure the whole 64 byte description is initialized before passing it to > hardware. > > Signed-off-by: Michael S. Tsirkin > > This patch does a few things: > - Adds node_guid and node_desc fields to struct ib_device > - Has mthca set these fields on startup > - Extends modify_device method to handle setting node_desc > - Exposes node_desc in sysfs > - Allows userspace to set node_desc by writing into sysfs file, eg. > echo -n `hostname` >> /sys/class/linux-kernel/drivers/infiniband/mthca0/node_desc > > This should probably be combined with Sean's work to get rid of > node_guid queries in ULPs. > > Comments? > > - R. > > Index: linux-2.6.14/drivers/infiniband/core/sysfs.c > =================================================================== > --- linux-2.6.14/drivers/infiniband/core/sysfs.c (revision 4042) > +++ linux-2.6.14/drivers/infiniband/core/sysfs.c (working copy) > @@ -637,14 +637,42 @@ > be16_to_cpu(((__be16 *) &attr.node_guid)[3])); > } > > +static ssize_t show_node_desc(struct class_device *cdev, char *buf) > +{ > + struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); > + > + return sprintf(buf, "%.64s\n", dev->node_desc); > +} > + > +static ssize_t set_node_desc(struct class_device *cdev, const char *buf, > + size_t count) > +{ > + struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); > + struct ib_device_modify desc = {}; > + int ret; > + > + if (!dev->modify_device) > + return -EIO; > + > + memcpy(desc.node_desc, buf, min_t(int, count, 64)); > + ret = ib_modify_device(dev, IB_DEVICE_MODIFY_NODE_DESC, &desc); > + if (ret) > + return ret; > + > + return count; > +} > + > static CLASS_DEVICE_ATTR(node_type, S_IRUGO, show_node_type, NULL); > static CLASS_DEVICE_ATTR(sys_image_guid, S_IRUGO, show_sys_image_guid, NULL); > static CLASS_DEVICE_ATTR(node_guid, S_IRUGO, show_node_guid, NULL); > +static CLASS_DEVICE_ATTR(node_desc, S_IRUGO | S_IWUSR, show_node_desc, > + set_node_desc); > > static struct class_device_attribute *ib_class_attributes[] = { > &class_device_attr_node_type, > &class_device_attr_sys_image_guid, > - &class_device_attr_node_guid > + &class_device_attr_node_guid, > + &class_device_attr_node_desc > }; > > static struct class ib_class = { > Index: linux-2.6.14/drivers/infiniband/hw/mthca/mthca_provider.c > =================================================================== > --- linux-2.6.14/drivers/infiniband/hw/mthca/mthca_provider.c (revision 4042) > +++ linux-2.6.14/drivers/infiniband/hw/mthca/mthca_provider.c (working copy) > @@ -177,6 +177,23 @@ > return err; > } > > +static int mthca_modify_device(struct ib_device *ibdev, > + int mask, > + struct ib_device_modify *props) > +{ > + if (mask & ~IB_DEVICE_MODIFY_NODE_DESC) > + return -EOPNOTSUPP; > + > + if (mask & IB_DEVICE_MODIFY_NODE_DESC) { > + if (down_interruptible(&to_mdev(ibdev)->cap_mask_mutex)) > + return -ERESTARTSYS; > + memcpy(ibdev->node_desc, props->node_desc, 64); > + up(&to_mdev(ibdev)->cap_mask_mutex); > + } > + > + return 0; > +} > + > static int mthca_modify_port(struct ib_device *ibdev, > u8 port, int port_modify_mask, > struct ib_port_modify *props) > @@ -1071,6 +1088,20 @@ > goto out; > > init_query_mad(in_mad); > + in_mad->attr_id = IB_SMP_ATTR_NODE_DESC; > + > + err = mthca_MAD_IFC(dev, 1, 1, > + 1, NULL, NULL, in_mad, out_mad, > + &status); > + if (err) > + goto out; > + if (status) { > + err = -EINVAL; > + goto out; > + } > + > + memcpy(dev->ib_dev.node_desc, out_mad->data, 64); > + > in_mad->attr_id = IB_SMP_ATTR_NODE_INFO; > > err = mthca_MAD_IFC(dev, 1, 1, > @@ -1129,6 +1160,7 @@ > dev->ib_dev.class_dev.dev = &dev->pdev->dev; > dev->ib_dev.query_device = mthca_query_device; > dev->ib_dev.query_port = mthca_query_port; > + dev->ib_dev.modify_device = mthca_modify_device; > dev->ib_dev.modify_port = mthca_modify_port; > dev->ib_dev.query_pkey = mthca_query_pkey; > dev->ib_dev.query_gid = mthca_query_gid; > Index: linux-2.6.14/drivers/infiniband/hw/mthca/mthca_mad.c > =================================================================== > --- linux-2.6.14/drivers/infiniband/hw/mthca/mthca_mad.c (revision 4042) > +++ linux-2.6.14/drivers/infiniband/hw/mthca/mthca_mad.c (working copy) > @@ -106,6 +106,19 @@ > } > } > > +static void node_desc_override(struct ib_device *dev, > + struct ib_mad *mad) > +{ > + if ((mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || > + mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) && > + mad->mad_hdr.method == IB_MGMT_METHOD_GET_RESP && > + mad->mad_hdr.attr_id == IB_SMP_ATTR_NODE_DESC) { > + down(&to_mdev(dev)->cap_mask_mutex); > + memcpy(((struct ib_smp *) mad)->data, dev->node_desc, 64); > + up(&to_mdev(dev)->cap_mask_mutex); > + } > +} > + > static void forward_trap(struct mthca_dev *dev, > u8 port_num, > struct ib_mad *mad) > @@ -204,8 +217,10 @@ > return IB_MAD_RESULT_FAILURE; > } > > - if (!out_mad->mad_hdr.status) > + if (!out_mad->mad_hdr.status) { > smp_snoop(ibdev, port_num, in_mad); > + node_desc_override(ibdev, out_mad); > + } > > /* set return bit in status of directed route responses */ > if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) > Index: linux-2.6.14/drivers/infiniband/include/rdma/ib_verbs.h > =================================================================== > --- linux-2.6.14/drivers/infiniband/include/rdma/ib_verbs.h (revision 4044) > +++ linux-2.6.14/drivers/infiniband/include/rdma/ib_verbs.h (working copy) > @@ -231,11 +231,13 @@ > }; > > enum ib_device_modify_flags { > - IB_DEVICE_MODIFY_SYS_IMAGE_GUID = 1 > + IB_DEVICE_MODIFY_SYS_IMAGE_GUID = 1 << 0, > + IB_DEVICE_MODIFY_NODE_DESC = 1 << 1 > }; > > struct ib_device_modify { > u64 sys_image_guid; > + char node_desc[64]; > }; > > enum ib_port_modify_flags { > @@ -959,6 +961,7 @@ > u64 uverbs_cmd_mask; > int uverbs_abi_ver; > > + char node_desc[64]; > __be64 node_guid; > u8 node_type; > u8 phys_port_cnt; From mshefty at ichips.intel.com Thu Feb 2 09:41:33 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 02 Feb 2006 09:41:33 -0800 Subject: [openib-general] [PATCH 0/4] SA path record caching In-Reply-To: <43E1A9C2.5060609@voltaire.com> References: <43E1A9C2.5060609@voltaire.com> Message-ID: <43E2444D.7090300@ichips.intel.com> Or Gerlitz wrote: > I recall that SilverStorm mentioned they had a well working SA replica, > but when i look on the naked math for 1k nodes and the hard to reach in > real life uniform distribution of queries over time, i really can't see > how to reach it unless you practically never update the cache and you > have magically caused your SM/SA to survive the one and only session of > those 1k get table queries (again 350 mad/sec, so many concurrent rmpp > sessions etc etc). Whether the cache is there or not, the SA is going to see at least this many queries in order to establish all-to-all connections. This doesn't seem any worse to me than what a DNS server might see under the same conditions. > I guess feedback from MPI people telling whether they have plans to use > path query would help us to see where we actually stand. Hal and I will be meeting with them in Sonoma to make sure that we have their requirements down. But they have asked that each node have all path records as input into their routing algorithms. - Sean From rdreier at cisco.com Thu Feb 2 10:05:00 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 02 Feb 2006 10:05:00 -0800 Subject: [openib-general] Re: [PATCH] support setting node description In-Reply-To: <20060202163758.GF5673@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 2 Feb 2006 18:37:58 +0200") References: <20060202163758.GF5673@mellanox.co.il> Message-ID: OK, seems like we have consensus here. I committed this (with down/up replaced by mutex_lock/mutex_unlock) and queued it for 2.6.17. - R. From rdreier at cisco.com Thu Feb 2 10:05:25 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 02 Feb 2006 10:05:25 -0800 Subject: [openib-general] Re: [PATCH RESEND] net: Move destructor from neigh->ops to neigh_params In-Reply-To: <20060202065736.GC4216@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 2 Feb 2006 08:57:36 +0200") References: <20060202065736.GC4216@mellanox.co.il> Message-ID: Michael> So, how about we implement the all-neigh list work-around Michael> on trunk for now, probably inside #if version < 2.6.17? Makes sense. How serious is this issue? What do you have to do to trigger a problem? - R. From rdreier at cisco.com Thu Feb 2 10:45:27 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 02 Feb 2006 10:45:27 -0800 Subject: [openib-general] Re: [PATCH] enable the fmr pool user to set the page size In-Reply-To: (Or Gerlitz's message of "Tue, 17 Jan 2006 12:05:08 +0200 (IST)") References: Message-ID: I applied this and queued it for 2.6.17, but I'm a little worried about how much testing you have done with this. It looks safe enough but your patch didn't even compile, due to the extra ";" in > struct ib_fmr_attr attr = { > .max_pages = params->max_pages_per_fmr, > .max_maps = IB_FMR_MAX_REMAPS, > - .page_size = PAGE_SHIFT > + .page_shift = params->page_shift; and also a bunch of missed page_size -> page_shift conversions in mthca_mr.c. - R. From mst at mellanox.co.il Thu Feb 2 10:57:13 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 2 Feb 2006 20:57:13 +0200 Subject: [openib-general] Re: [PATCH RESEND] net: Move destructor from neigh->ops to neigh_params In-Reply-To: References: Message-ID: <20060202185713.GA6219@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH RESEND] net: Move destructor from neigh->ops to neigh_params > > Michael> So, how about we implement the all-neigh list work-around > Michael> on trunk for now, probably inside #if version < 2.6.17? > > Makes sense. How serious is this issue? As far as I rememeber its an OOPs, or data corruption, or resource leak. > What do you have to do to > trigger a problem? AFAIK flushing the arp cache while doing traffic was what did it. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mshefty at ichips.intel.com Thu Feb 2 12:02:28 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 02 Feb 2006 12:02:28 -0800 Subject: [openib-general] [PATCH 4/4] SA path record caching In-Reply-To: <43E1A596.3050707@voltaire.com> References: <43E1A596.3050707@voltaire.com> Message-ID: <43E26554.6000506@ichips.intel.com> Or Gerlitz wrote: > I think we really need to have in the very short term some solution at > hand which effectively allows for both the cma being used and the local > SA not replicating the SA. > > This is since the trunk is used in various environments, currently most > of them are not "all-to-all MPI doing path query" and the SA replica is > not mature/tested enough to prove that any SM/SA can live more or less > happily with it. > > Does it makes sense to you? can it be added with high priority? I've added the ability to disable the cache. The CMA will issue a path record query if an item is not found in the cache. To disable the cache, the module should be loaded with "cache_timeout=0". - Sean From sean.hefty at intel.com Thu Feb 2 14:34:01 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 2 Feb 2006 14:34:01 -0800 Subject: [openib-general] [PATCH] CMA / local SA: make local SA module optional for CMA operations Message-ID: Permit loading the CMA module without the local SA module. Symbols from the local SA are exported dynamically, and acquired by the CMA on CMA module load. This also provides an optimization to avoid checking the cache when it is disabled, but the module is loaded. This will become more important when multicast tracking is added to the local SA module. Signed-off-by: Sean Hefty --- I'm not sure if this is an appropriate way of doing this, but it worked in my testing. Index: core/local_sa.c =================================================================== --- core/local_sa.c (revision 5278) +++ core/local_sa.c (working copy) @@ -353,7 +353,6 @@ unlock: mutex_unlock(&lock); return ret; } -EXPORT_SYMBOL(ib_get_path_rec); static void sa_db_free_data(void *context, void *data) { @@ -446,6 +445,9 @@ static void sa_db_remove_one(struct ib_d static int __init sa_db_init(void) { cache_timeout = msecs_to_jiffies(cache_timeout); + if (cache_timeout) { + EXPORT_SYMBOL(ib_get_path_rec); + } hold_time = msecs_to_jiffies(hold_time); update_delay = msecs_to_jiffies(update_delay); return ib_register_client(&sa_db_client); Index: core/cma.c =================================================================== --- core/cma.c (revision 5278) +++ core/cma.c (working copy) @@ -63,6 +63,10 @@ static LIST_HEAD(dev_list); static LIST_HEAD(listen_any_list); static DEFINE_MUTEX(lock); +static int (*cma_get_ib_path)(struct ib_device *device, u8 port_num, + union ib_gid *sgid, union ib_gid *dgid, + u16 pkey, struct ib_sa_path_rec *rec); + struct cma_device { struct list_head list; struct ib_device *device; @@ -1122,6 +1126,9 @@ static int cma_resolve_ib_route(struct r struct cma_work *work; int ret; + if (!cma_get_ib_path) + return cma_query_ib_route(id_priv, timeout_ms); + work = kzalloc(sizeof *work, GFP_KERNEL); if (!work) return -ENOMEM; @@ -1132,7 +1139,7 @@ static int cma_resolve_ib_route(struct r goto err1; } - ret = ib_get_path_rec(id_priv->id.device, id_priv->id.port_num, + ret = cma_get_ib_path(id_priv->id.device, id_priv->id.port_num, ib_addr_get_sgid(addr), ib_addr_get_dgid(addr), ib_addr_get_pkey(addr), route->path_rec); if (ret) @@ -1697,12 +1704,20 @@ static void cma_remove_one(struct ib_dev static int cma_init(void) { - return ib_register_client(&cma_client); + int ret; + + cma_get_ib_path = symbol_get(ib_get_path_rec); + ret = ib_register_client(&cma_client); + if (ret && cma_get_ib_path) + symbol_put(ib_get_path_rec); + return ret; } static void cma_cleanup(void) { ib_unregister_client(&cma_client); + if (cma_get_ib_path) + symbol_put(ib_get_path_rec); } module_init(cma_init); From sashak at voltaire.com Thu Feb 2 14:51:43 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 3 Feb 2006 00:51:43 +0200 Subject: [openib-general] [PATH] [TRIVIAL] opensm/osm_pkey_rcv_ctrl.c: make local function static Message-ID: <20060202225143.GH32188@sashak.voltaire.com> Hi, This makes local and "undercored" function static Signed-off-by: Sasha Khapyorsky diff --git a/src/userspace/management/osm/opensm/osm_pkey_rcv_ctrl.c b/src/userspace/management/osm/opensm/osm_pkey_rcv_ctrl.c --- a/src/userspace/management/osm/opensm/osm_pkey_rcv_ctrl.c +++ b/src/userspace/management/osm/opensm/osm_pkey_rcv_ctrl.c @@ -49,7 +49,7 @@ /********************************************************************** **********************************************************************/ -void +static void __osm_pkey_rcv_ctrl_disp_callback( IN void *context, IN void *p_data ) From sashak at voltaire.com Thu Feb 2 14:58:53 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 3 Feb 2006 00:58:53 +0200 Subject: [openib-general] [PATCH] osm/*/autogen.sh: check and create config dir Message-ID: <20060202225853.GJ32188@sashak.voltaire.com> Hi, This adds check and creation of 'config' directory in autogen.sh scripts. It is useful with stan-alone (non SVN) tree. Signed-off-by: Sasha Khapyorsky diff --git a/src/userspace/management/osm/complib/autogen.sh b/src/userspace/management/osm/complib/autogen.sh --- a/src/userspace/management/osm/complib/autogen.sh +++ b/src/userspace/management/osm/complib/autogen.sh @@ -3,6 +3,9 @@ # We change dir since the later utilities assume to work in the project dir cd ${0%*/*} +# create config dir if not exist +test -d config || mkdir config + set -x (aclocal -I config -I ../config 2>&1 ) && \ (libtoolize --force --copy) && \ diff --git a/src/userspace/management/osm/include/autogen.sh b/src/userspace/management/osm/include/autogen.sh index 6e51328..03401b0 100755 --- a/src/userspace/management/osm/include/autogen.sh +++ b/src/userspace/management/osm/include/autogen.sh @@ -3,6 +3,9 @@ # We change dir since the later utilities assume to work in the project dir cd ${0%*/*} +# create config dir if not exist +test -d config || mkdir config + set -x aclocal -I config libtoolize --force --copy diff --git a/src/userspace/management/osm/libvendor/autogen.sh b/src/userspace/management/osm/libvendor/autogen.sh index 8d06d45..d30bf8f 100755 --- a/src/userspace/management/osm/libvendor/autogen.sh +++ b/src/userspace/management/osm/libvendor/autogen.sh @@ -3,6 +3,9 @@ # We change dir since the later utilities assume to work in the project dir cd ${0%*/*} +# create config dir if not exist +test -d config || mkdir config + set -x (aclocal -I config -I ../config 2>&1 ) && \ (libtoolize --force --copy) && \ diff --git a/src/userspace/management/osm/opensm/autogen.sh b/src/userspace/management/osm/opensm/autogen.sh index 8d06d45..d30bf8f 100755 --- a/src/userspace/management/osm/opensm/autogen.sh +++ b/src/userspace/management/osm/opensm/autogen.sh @@ -3,6 +3,9 @@ # We change dir since the later utilities assume to work in the project dir cd ${0%*/*} +# create config dir if not exist +test -d config || mkdir config + set -x (aclocal -I config -I ../config 2>&1 ) && \ (libtoolize --force --copy) && \ diff --git a/src/userspace/management/osm/osmtest/autogen.sh b/src/userspace/management/osm/osmtest/autogen.sh index 8d06d45..d30bf8f 100755 --- a/src/userspace/management/osm/osmtest/autogen.sh +++ b/src/userspace/management/osm/osmtest/autogen.sh @@ -3,6 +3,9 @@ # We change dir since the later utilities assume to work in the project dir cd ${0%*/*} +# create config dir if not exist +test -d config || mkdir config + set -x (aclocal -I config -I ../config 2>&1 ) && \ (libtoolize --force --copy) && \ From suri at baymicrosystems.com Thu Feb 2 15:00:24 2006 From: suri at baymicrosystems.com (Suresh Shelvapille) Date: Thu, 2 Feb 2006 18:00:24 -0500 Subject: [openib-general] looping with class_device_attr store In-Reply-To: <52r7a94liq.fsf@cisco.com> Message-ID: <200602022300.k12N0TOo019781@mail.baymicrosystems.com> Hi: I created a class attr called portstate to toggle the port state values (INIT/ARM etc...), as follows: static CLASS_DEVICE_ATTR(portstate, S_IRUGO|S_IWUSR, show_stats, store_portstate); this results in a file under sys/class/Infiniband/mysw/. When I write to the file from linux shell as, echo "1" > portstate, my store function callback gets called forever.... Any ideas... Thanks a lot in advance! Suri From arlin.r.davis at intel.com Thu Feb 2 15:05:45 2006 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Thu, 2 Feb 2006 15:05:45 -0800 Subject: [dat-discussions] RE: [openib-general] RE: [RFC] DAT 2.0 immediate data proposal Message-ID: <59278FC0C48A994BABABD069571E45680DD55720@orsmsx401.amr.corp.intel.com> Here is an updated immediate data proposal based on the latest discussions. I am working on a patch. -arlin -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: DAT_immediate_data_rev2.pdf Type: application/octet-stream Size: 49071 bytes Desc: DAT_immediate_data_rev2.pdf URL: From sean.hefty at intel.com Thu Feb 2 15:53:42 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 2 Feb 2006 15:53:42 -0800 Subject: [dat-discussions] RE: [openib-general] RE: [RFC] DAT 2.0immediate data proposal In-Reply-To: <59278FC0C48A994BABABD069571E45680DD55720@orsmsx401.amr.corp.intel.com> Message-ID: Here is an updated immediate data proposal based on the latest discussions. I am working on a patch. I don't see any app using this unless immediate data is supported, and the data shows up in a completion. You have variables to indicate support, the number of work requests immediate consumes, and how the data is reported. What app is going to use this? If writes are not supported, then the app will already have to deal with doing a write followed by their own send. As a further complication, reserving the first 4 bytes of any receive buffer for immediate data is only going to cause alignment issues for the user. This defines an API with different behavior based on the underlying transport in ways that are visible to the application. Add a flag that specifies if immediate data is supported. Define one way of doing that, and move on. I fail to see any benefit complicating the API for a transport that has to emulate transferring immediate data in an application visible way. - Sean -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Thu Feb 2 17:32:30 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 02 Feb 2006 17:32:30 -0800 Subject: [openib-general] [PATCH] CMA / local SA: make local SA module optional for CMA operations In-Reply-To: (Sean Hefty's message of "Thu, 2 Feb 2006 14:34:01 -0800") References: Message-ID: Sean> Permit loading the CMA module without the local SA module. Sean> Symbols from the local SA are exported dynamically, and Sean> acquired by the CMA on CMA module load. This is a good goal but the implementation looks way too crazy. Can we revamp the API to the SA so that this symbol chicanery isn't required? Maybe have an SA core that the cache registers itself with when it's loaded? Especially this: > + if (cache_timeout) { > + EXPORT_SYMBOL(ib_get_path_rec); > + } I'm sure that putting an EXPORT_SYMBOL inside an if() is not mergeable upstream. Does it even do anything? As far as I can see from the definition of EXPORT_SYBMOL in , the symbol will get exported anyway. symbol_get()/symbol_put() is a little suspicious too. - R. From rdreier at cisco.com Thu Feb 2 17:42:18 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 02 Feb 2006 17:42:18 -0800 Subject: [openib-general] Re: ipoib_mcast_send.patch In-Reply-To: <20060117192107.GA12456@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 17 Jan 2006 21:21:07 +0200") References: <20060112213248.GH9256@mellanox.co.il> <20060117192107.GA12456@mellanox.co.il> Message-ID: I started looking at IPoIB patches again. In ipoib_mcast_send.patch, we have: > --- linux-2.6.15.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-01-23 21:24:10.000000000 +0200 > +++ linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-01-23 21:25:19.000000000 +0200 > @@ -600,6 +600,10 @@ int ipoib_mcast_start_thread(struct net_ > queue_work(ipoib_workqueue, &priv->mcast_task); > mutex_unlock(&mcast_mutex); > > + spin_lock_irq(&priv->lock); > + set_bit(IPOIB_MCAST_STARTED, &priv->flags); > + spin_unlock_irq(&priv->lock); This seems to leave a window where we set the IPOIB_MCAST_STARTED flag but the multicast work hasn't run yet. Then it seems we're still susceptible to the issue you described here: > Further, there's an additional issue that I saw in testing: > ipoib_mcast_send may get called when priv->broadcast is NULL > (e.g. if the device was downed and then upped internally because > of a port event). > If this happends and the sendonly join request gets completed before > priv->broadcast is set, we get an oops - R. From rdreier at cisco.com Thu Feb 2 20:31:59 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 02 Feb 2006 20:31:59 -0800 Subject: [openib-general] Re: does the mthca driver support RTS->SQD event request? In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3013DC66A@mtlexch01.mtl.com> (Dotan Barak's message of "Mon, 30 Jan 2006 10:59:27 +0200") References: <6AB138A2AB8C8E4A98B9C0C3D52670E3013DC66A@mtlexch01.mtl.com> Message-ID: Here's a patch to mthca that adds handling for the en_sqd_async_notify flag in modify QP. I haven't tested it yet. Can you review it and run your test, and let me know how it works? If it looks good to you then I will commit it and queue it for 2.6.17. - R. --- infiniband/hw/mthca/mthca_cmd.c (revision 5287) +++ infiniband/hw/mthca/mthca_cmd.c (working copy) @@ -1638,7 +1638,8 @@ int mthca_MODIFY_QP(struct mthca_dev *de } } else - err = mthca_cmd(dev, mailbox->dma, (!!is_ee << 24) | num, + err = mthca_cmd(dev, mailbox->dma, + optmask | (!!is_ee << 24) | num, op_mod, op[trans], CMD_TIME_CLASS_C, status); if (my_mailbox) --- infiniband/hw/mthca/mthca_qp.c (revision 5287) +++ infiniband/hw/mthca/mthca_qp.c (working copy) @@ -575,6 +575,7 @@ int mthca_modify_qp(struct ib_qp *ibqp, struct mthca_qp_param *qp_param; struct mthca_qp_context *qp_context; u32 req_param, opt_param; + u32 sqd_event = 0; u8 status; int err; @@ -839,8 +840,13 @@ int mthca_modify_qp(struct ib_qp *ibqp, qp_context->srqn = cpu_to_be32(1 << 24 | to_msrq(ibqp->srq)->srqn); + if (cur_state == IB_QPS_RTS && new_state == IB_QPS_SQD && + attr_mask & IB_QP_EN_SQD_ASYNC_NOTIFY && + attr->en_sqd_async_notify) + sqd_event = 1 << 31; + err = mthca_MODIFY_QP(dev, state_table[cur_state][new_state].trans, - qp->qpn, 0, mailbox, 0, &status); + qp->qpn, 0, mailbox, sqd_event, &status); if (status) { mthca_warn(dev, "modify QP %d returned status %02x.\n", state_table[cur_state][new_state].trans, status); From sean.hefty at intel.com Thu Feb 2 21:22:58 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 2 Feb 2006 21:22:58 -0800 Subject: [openib-general] [PATCH] CMA / local SA: make local SA module optional for CMA operations In-Reply-To: Message-ID: >This is a good goal but the implementation looks way too crazy. Can >we revamp the API to the SA so that this symbol chicanery isn't >required? Maybe have an SA core that the cache registers itself with >when it's loaded? It's probably not even worth it long term. I was planning on using the local SA module to track multicast records. The code ran without requiring the local SA module being loaded, which is what I was aiming for. >Especially this: > > > + if (cache_timeout) { > > + EXPORT_SYMBOL(ib_get_path_rec); > > + } > >I'm sure that putting an EXPORT_SYMBOL inside an if() is not mergeable >upstream. Does it even do anything? As far as I can see from the >definition of EXPORT_SYBMOL in , the symbol will get >exported anyway. I had to include the { } around the if() to get this to compile, and couldn't find anywhere in the code where EXPORT_SYMBOL was called inside a function. I finally just gave up looking and tried it to see if it would work. I'll admit that I was surprised that the build didn't complain, but I don't know if it actually prevented that symbol from being exported. >symbol_get()/symbol_put() is a little suspicious too. Symbol_get/symbol_put are exported as GPL only, so I don't think that we can use them anyway. I was just going to toss this entire patch and let the cache_timeout parameter be the key to disable the cache for now. - Sean From halr at voltaire.com Fri Feb 3 03:53:30 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Feb 2006 06:53:30 -0500 Subject: [openib-general] Re: [PATH] [TRIVIAL] opensm/osm_pkey_rcv_ctrl.c: make local function static In-Reply-To: <20060202225143.GH32188@sashak.voltaire.com> References: <20060202225143.GH32188@sashak.voltaire.com> Message-ID: <1138967606.15119.16069.camel@hal.voltaire.com> On Thu, 2006-02-02 at 17:51, Sasha Khapyorsky wrote: > Hi, > > This makes local and "undercored" function static Thanks. Applied. -- Hal > Signed-off-by: Sasha Khapyorsky > > diff --git a/src/userspace/management/osm/opensm/osm_pkey_rcv_ctrl.c b/src/userspace/management/osm/opensm/osm_pkey_rcv_ctrl.c > --- a/src/userspace/management/osm/opensm/osm_pkey_rcv_ctrl.c > +++ b/src/userspace/management/osm/opensm/osm_pkey_rcv_ctrl.c > @@ -49,7 +49,7 @@ > > /********************************************************************** > **********************************************************************/ > -void > +static void > __osm_pkey_rcv_ctrl_disp_callback( > IN void *context, > IN void *p_data ) From halr at voltaire.com Fri Feb 3 04:03:06 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Feb 2006 07:03:06 -0500 Subject: [openib-general] Re: [PATH] [TRIVIAL] opensm/osm_port.h: comment correction In-Reply-To: <20060202225352.GI32188@sashak.voltaire.com> References: <20060202225352.GI32188@sashak.voltaire.com> Message-ID: <1138968177.15119.16125.camel@hal.voltaire.com> On Thu, 2006-02-02 at 17:53, Sasha Khapyorsky wrote: > Hi, > > This fixes outdated comment. Thanks. Applied. -- Hal > Signed-off-by: Sasha Khapyorsky > > diff --git a/src/userspace/management/osm/include/opensm/osm_port.h b/src/userspace/management/osm/include/opensm/osm_port.h > --- a/src/userspace/management/osm/include/opensm/osm_port.h > +++ b/src/userspace/management/osm/include/opensm/osm_port.h > @@ -1346,9 +1346,9 @@ osm_port_destroy( > * SEE ALSO > * Port, osm_port_init, osm_port_destroy, osm_port_is_inited > *********/ > -/****f* OpenSM: Port/osm_port_destroy > +/****f* OpenSM: Port/osm_port_delete > * NAME > -* osm_port_destroy > +* osm_port_delete > * > * DESCRIPTION > * This function destroys and deallocates a Port object. From halr at voltaire.com Fri Feb 3 04:23:03 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Feb 2006 07:23:03 -0500 Subject: [openib-general] Re: [PATCH] osm/*/autogen.sh: check and create config dir In-Reply-To: <20060202225853.GJ32188@sashak.voltaire.com> References: <20060202225853.GJ32188@sashak.voltaire.com> Message-ID: <1138969375.15119.16242.camel@hal.voltaire.com> On Thu, 2006-02-02 at 17:58, Sasha Khapyorsky wrote: > Hi, > > This adds check and creation of 'config' directory in autogen.sh scripts. > It is useful with stan-alone (non SVN) tree. Thanks. Applied. -- Hal > Signed-off-by: Sasha Khapyorsky From schihei at de.ibm.com Fri Feb 3 07:30:40 2006 From: schihei at de.ibm.com (Heiko J Schick) Date: Fri, 03 Feb 2006 16:30:40 +0100 Subject: [openib-general] Interchanged parameters for get_user_pages in uverbs_mem.c Message-ID: <43E37720.7040306@de.ibm.com> Hello Roland, is it possible that the "force" and write" parameter for the function get_user_pages are interchanged. The signature of get_user_pages is: int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, int len, int write, <<<<< int force, <<<<< struct page **pages, struct vm_area_struct **vmas) The usage in uverbs_mem.c in function ib_umem_get (line 110) is: ret = get_user_pages(current, current->mm, cur_base, min_t(int, npages, PAGE_SIZE / sizeof (struct page *)), 1, <<<< IS WRITE ??? !write, <<<< IS FORCE ??? page_list, NULL); If that is the case, I think the problem was not seen, because ib_umem_get was only used with the "write" flag at the moment. -- Mit freundlichen Gruessen / Kind Regards Heiko Joerg Schick ---------------------------------------------------------------------- Heiko J Schick I/O Firmware Development II Linux InfiniBand Device Drivers IBM Deutschland Entwicklung GmbH external: 49-07031-16-0 x4219 Schoenaicher Str. 220 t/l: 120-4129 71032 Boeblingen email: schickhj at de.ibm.com ---------------------------------------------------------------------- From halr at voltaire.com Fri Feb 3 05:54:56 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Feb 2006 08:54:56 -0500 Subject: [openib-general] Re: [PATCH] Opensm - using default dir In-Reply-To: <20060202133109.GB32188@sashak.voltaire.com> References: <5zwtgi0xs7.fsf@mtl066.yok.mtl.com> <20060130124122.GY31887@mellanox.co.il> <20060201225923.GA32188@sashak.voltaire.com> <1138835625.15119.4890.camel@hal.voltaire.com> <20060202133109.GB32188@sashak.voltaire.com> Message-ID: <1138974885.15119.16762.camel@hal.voltaire.com> On Thu, 2006-02-02 at 08:31, Sasha Khapyorsky wrote: > On 18:13 Wed 01 Feb , Hal Rosenstock wrote: > > On Wed, 2006-02-01 at 17:59, Sasha Khapyorsky wrote: > > > On 14:41 Mon 30 Jan , Michael S. Tsirkin wrote: > > > > Quoting r. Yael Kalka : > > > > > =================================================================== > > > > > --- include/opensm/osm_svn_revision.h (revision 5203) > > > > > +++ include/opensm/osm_svn_revision.h (working copy) > > > > > @@ -1 +1 @@ > > > > > -#define OSM_SVN_REVISION "" > > > > > +#define OSM_SVN_REVISION "5203M" > > > > > > > > This looks like a mistake. > > > > And, I think this shows that keeping the generated file osm_svn_revision.h > > > > represents a problem. > > > > > > Good point. Hal, could we svn-remove this file? > > > > Yes, this is possible but there is a little more work involved here as > > the OSM_SVN_REVISION is checked for length 0 to determine whether to > > print out the svn version message right now. > > It is ok. What I mean (and believe Michael too) is to not store > osm_svn_revision.h under SVN, but generate in build time. Like this: Got it. Thanks. Applied. -- Hal From mdidomenico at gmail.com Fri Feb 3 06:45:00 2006 From: mdidomenico at gmail.com (Michael Di Domenico) Date: Fri, 3 Feb 2006 09:45:00 -0500 Subject: [openib-general] svn trouble Message-ID: <97a7c7ed0602030645u5c1cfa27m8932bebe45b4b230@mail.gmail.com> I'm having difficulty downloading the openib source via svn from openib.org... Under cygwin, i do; svn co https://openib.org/svn/gen2 it starts downloading files till it gets to, the last file shown below, in which it case it just hangs and i have to kill it. Any thoughts? Thanks $ svn co https://openib.org/svn/gen2 A gen2/trunk/src/userspace/mpi/mvapich-gen2/www/www1 A gen2/trunk/src/userspace/mpi/mvapich-gen2/www/www1/Jumpshots.html A gen2/trunk/src/userspace/mpi/mvapich-gen2/www/www1/chp4_servs.html A gen2/trunk/src/userspace/mpi/mvapich-gen2/www/www1/mpiman.html A gen2/trunk/src/userspace/mpi/mvapich-gen2/www/www1/cleanipcs.html A gen2/trunk/src/userspace/mpi/mvapich-gen2/www/www1/mpireconfig.html A gen2/trunk/src/userspace/mpi/mvapich-gen2/www/www1/tstmachines.html A gen2/trunk/src/userspace/mpi/mvapich-gen2/www/www1/mpif90.html A gen2/trunk/src/userspace/mpi/mvapich-gen2/www/www1/MPI.html A gen2/trunk/src/userspace/mpi/mvapich-gen2/www/www1/index.html A gen2/trunk/src/userspace/mpi/mvapich-gen2/www/www1/mpif77.html A gen2/trunk/src/userspace/mpi/mvapich-gen2/www/www1/mpirun.html A gen2/trunk/src/userspace/mpi/mvapich-gen2/www/www1/mpiCC.html A gen2/trunk/src/userspace/mpi/mvapich-gen2/www/www1/mpicc.html From sean.hubbell at dbresearch.net Fri Feb 3 06:50:24 2006 From: sean.hubbell at dbresearch.net (Sean Hubbell) Date: Fri, 03 Feb 2006 08:50:24 -0600 Subject: [openib-general] relocation error / link time reference error Message-ID: <43E36DB0.7070801@dbresearch.net> Hello, I have just updated my kernel to 2.6.15 and updated openib as well (tools as well as the driver). When I run the tools like ibping I get the following error: ibping: relocation error: ibping: symbol argv0 version IBCOMMON_1.0 not defined in file libibcommon.so.1 with link time reference. Running ldd on ibping points to the latest ib libraries that I compiled this morning (the code itself is from 2 days ago). Has anyone see this as well and does anyone have an idea for a fix? Sean Hubbell From swise at opengridcomputing.com Fri Feb 3 07:14:45 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 03 Feb 2006 09:14:45 -0600 Subject: [openib-general] Re: [PATCH] CMA and iWARP Message-ID: <1138979685.8616.3.camel@stevo-desktop> Sean, I'll fix these and we'll re-release an up-to-date patch. Thanks, Steve. ---- referencing comments From Sean ---- Tom Tucker wrote: > +/* Handles an inbound connect request. The function creates a new > + * iw_cm_id to represent the new connection and inherits the client > + * callback function and other attributes from the listening parent. > + * > + * The work item contains a pointer to the listen_cm_id and the event. The > + * listen_cm_id contains the client cm_handler, context and device. These are > + * copied when the device is cloned. The event contains the new four tuple. Does the code take a reference on the listen_cm_id before scheduling the work item? > + */ > +static int cm_conn_req_handler(struct iwcm_work* work) > +{ > + struct iw_cm_id* cm_id; > + struct iwcm_id_private* cm_id_priv; > + int rc; > + > + /* If the status was not successful, ignore request */ > + if (work->event.status) { > + printk(KERN_ERR "%s:%d Bad status=%d for connection request ... " > + "should be filtered by provider\n", > + __FUNCTION__, __LINE__, > + work->event.status); > + return work->event.status; > + } > + cm_id = iw_create_cm_id(work->cm_id->id.device, work->cm_id->id.cm_handler, > + work->cm_id->id.context); > + if (IS_ERR(cm_id)) > + return PTR_ERR(cm_id); > + > + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > + cm_id_priv->id.local_addr = work->event.local_addr; > + cm_id_priv->id.remote_addr = work->event.remote_addr; > + cm_id_priv->id.provider_id = work->event.provider_id; > + cm_id_priv->id.state = IW_CM_STATE_CONN_RECV; > + > + /* Call the client CM handler */ > + rc = cm_id->cm_handler(cm_id, &work->event); > + if (rc) { > + cm_id->state = IW_CM_STATE_IDLE; > + iw_destroy_cm_id(cm_id); > + } > + kfree(work); > + return 0; > +} > + > +/* > + * Handles the transition to established state on the passive side. > + */ > +static int cm_conn_est_handler(struct iwcm_work* work) > +{ {snip} > + /* Call the client CM handler */ > + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, &work->event); A reference needs to be taken on the cm_id_priv before invoking the callback to block destruction. (I didn't see that a reference was released...) > +static int cm_conn_rep_handler(struct iwcm_work* work) > +{ > + struct iwcm_id_private* cm_id_priv; > + unsigned long flags; > + int ret = 0; {snip} > + > + /* Call the client CM handler */ > + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, &work->event); > + if (ret) { > + cm_id_priv->id.state = IW_CM_STATE_IDLE; > + iw_destroy_cm_id(&cm_id_priv->id); > + } Same here - a reference is needed to block destruction before invoking the callback. > +static int cm_disconnect_handler(struct iwcm_work* work) > +{ > + struct iwcm_id_private* cm_id_priv; > + int ret = 0; > + > + cm_id_priv = work->cm_id; > + > + cm_id_priv->id.state = IW_CM_STATE_IDLE; > + > + /* Call the client CM handler */ > + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, &work->event); > + if (ret) > + iw_destroy_cm_id(&cm_id_priv->id); And here... > +static void cm_event_handler(struct iw_cm_id* cm_id, > + struct iw_cm_event* event) > +{ > + struct iwcm_work *work; > + struct iwcm_id_private* cm_id_priv; > + > + work = kmalloc(sizeof *work, GFP_ATOMIC); > + if (!work) > + return; > + > + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > + INIT_WORK(&work->work, cm_work_handler, work); > + work->cm_id = cm_id_priv; Reference the cm_id before queuing the work item. It needs to be released after processing any callbacks. - Sean From halr at voltaire.com Fri Feb 3 07:07:36 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Feb 2006 10:07:36 -0500 Subject: [openib-general] relocation error / link time reference error In-Reply-To: <43E36DB0.7070801@dbresearch.net> References: <43E36DB0.7070801@dbresearch.net> Message-ID: <1138979240.15119.17035.camel@hal.voltaire.com> Hi Sean, On Fri, 2006-02-03 at 09:50, Sean Hubbell wrote: > Hello, > > I have just updated my kernel to 2.6.15 and updated openib as well > (tools as well as the driver). When I run the tools like ibping I get > the following error: > > ibping: relocation error: ibping: symbol argv0 version IBCOMMON_1.0 not > defined in file libibcommon.so.1 with link time reference. > > Running ldd on ibping points to the latest ib libraries that I compiled > this morning (the code itself is from 2 days ago). > > Has anyone see this as well and does anyone have an idea for a fix? Can you update to the latest (r5292) and it should be fixed. If not, pick up the latest libibcommon.map and rebuild and reinstall libibcommon. Let me know whether this resolves your issue. Thanks. -- Hal > > Sean Hubbell > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Fri Feb 3 08:51:21 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 03 Feb 2006 08:51:21 -0800 Subject: [openib-general] Re: Interchanged parameters for get_user_pages in uverbs_mem.c In-Reply-To: <43E37720.7040306@de.ibm.com> (Heiko J. Schick's message of "Fri, 03 Feb 2006 16:30:40 +0100") References: <43E37720.7040306@de.ibm.com> Message-ID: Heiko> Hello Roland, is it possible that the "force" and write" Heiko> parameter for the function get_user_pages are interchanged. No, the parameters are that way intentionally. The explanation is in the svn log: r2642 | roland | 2005-06-16 15:43:04 -0700 (Thu, 16 Jun 2005) | 7 lines Always ask get_user_pages() for writable pages, but pass force=1 if the consumer has only asked for read-only pages. This fixes a problem registering memory that has just been allocated but not touched yet, while allowing registration of read-only memory to continue to work. Signed-off-by: Roland Dreier - R. From rdreier at cisco.com Fri Feb 3 08:58:56 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 03 Feb 2006 08:58:56 -0800 Subject: [openib-general] kernel 2.6.16-rc2 is out Message-ID: Linus just released 2.6.16-rc2. If you have a chance, please test the IB drivers in the stock kernel (ie don't replace drivers/infiniband with a svn tree) and report any issues you see. Also, please make sure that any important fixes that you would like to see in 2.6.16 are upstream -- if not, let me know so we can get them merged. Thanks, Roland From halr at voltaire.com Fri Feb 3 09:22:56 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Feb 2006 12:22:56 -0500 Subject: [openib-general] kernel 2.6.16-rc2 is out In-Reply-To: References: Message-ID: <1138987143.26011.29.camel@hal.voltaire.com> Hi Roland, On Fri, 2006-02-03 at 11:58, Roland Dreier wrote: > Linus just released 2.6.16-rc2. If you have a chance, please test the > IB drivers in the stock kernel (ie don't replace drivers/infiniband > with a svn tree) and report any issues you see. Also, please make > sure that any important fixes that you would like to see in 2.6.16 are > upstream -- if not, let me know so we can get them merged. Perhaps not critical for 2.6.16, but should be pushed upstream are the following SMI changes: r5045 | halr | 2006-01-17 11:44:14 -0500 (Tue, 17 Jan 2006) | 6 lines Changed paths: M /gen2/trunk/src/linux-kernel/infiniband/core/mad.c Simplified patch to properly handle directed route SMP with a beginning or ending LID routed part. Signed-off-by: Ralph Campbell Signed-off-by: Hal Rosenstock ------------------------------------------------------------------------ r4984 | halr | 2006-01-14 13:08:55 -0500 (Sat, 14 Jan 2006) | 15 lines Changed paths: M /gen2/trunk/src/linux-kernel/infiniband/core/agent.c M /gen2/trunk/src/linux-kernel/infiniband/core/mad.c M /gen2/trunk/src/linux-kernel/infiniband/core/smi.h Further simplification of SMI by eliminating smi_check_local_dr_smp The call to ib_get_agent_port() shouldn't be possible to fail when smi_check_local_dr_smp() is called from ib_mad_recv_done_handler(). When it is called from handle_outgoing_dr_smp(), the device and port_num come from mad_agent_priv so I assume the call to ib_get_agent_port() shouldn't fail either. In either case, smi_check_local_smp() only uses the mad_agent pointer to check that mad_agent->device->process_mad is not NULL. The device pointer would have to be the same as the one passed to smi_check_local_dr_smp() since that pointer is used later instead of the one checked in smi_check_local_smp(). Patch supplied by Ralph Campbell Signed-off-by: Hal Rosenstock ------------------------------------------------------------------------ r4983 | halr | 2006-01-14 12:42:08 -0500 (Sat, 14 Jan 2006) | 10 lines Changed paths: M /gen2/trunk/src/linux-kernel/infiniband/core/agent.c Remove redundant check from agent.c::smi_check_local_dr_smp smi_check_local_dr_smp() is called only from two places in core/mad.c It returns 0 or 1. In smi_check_local_dr_smp(), it checks for a directed route SMP but this function is only called when the SMP is a directed route so this is a NOP. Patch supplied by Ralph Campbell Signed-off-by: Hal Rosenstock Thanks. -- Hal > Thanks, > Roland > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sean.hubbell at dbresearch.net Fri Feb 3 10:58:43 2006 From: sean.hubbell at dbresearch.net (Sean Hubbell) Date: Fri, 03 Feb 2006 12:58:43 -0600 Subject: [openib-general] relocation error / link time reference error In-Reply-To: <1138979240.15119.17035.camel@hal.voltaire.com> References: <43E36DB0.7070801@dbresearch.net> <1138979240.15119.17035.camel@hal.voltaire.com> Message-ID: <43E3A7E3.4020307@dbresearch.net> Hal, I downloaded r5292 and tried to build on an x86_64 Dell Power Edge 2850. I got some errors with respect to opensm/osm_svn_revision.h No such file or directory. I also received the OSM_SVN_REVISION undeclared; I supposed this would be in the osm_svn_revision.h file. How is this file supposed to be generated (with svnversion?)? Sean Hal Rosenstock wrote: >Hi Sean, > >On Fri, 2006-02-03 at 09:50, Sean Hubbell wrote: > > >>Hello, >> >> I have just updated my kernel to 2.6.15 and updated openib as well >>(tools as well as the driver). When I run the tools like ibping I get >>the following error: >> >>ibping: relocation error: ibping: symbol argv0 version IBCOMMON_1.0 not >>defined in file libibcommon.so.1 with link time reference. >> >>Running ldd on ibping points to the latest ib libraries that I compiled >>this morning (the code itself is from 2 days ago). >> >>Has anyone see this as well and does anyone have an idea for a fix? >> >> > >Can you update to the latest (r5292) and it should be fixed. If not, >pick up the latest libibcommon.map and rebuild and reinstall >libibcommon. > >Let me know whether this resolves your issue. Thanks. > >-- Hal > > > >>Sean Hubbell >> >>_______________________________________________ >>openib-general mailing list >>openib-general at openib.org >>http://openib.org/mailman/listinfo/openib-general >> >>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> >> > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > From halr at voltaire.com Fri Feb 3 11:24:49 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Feb 2006 14:24:49 -0500 Subject: [openib-general] relocation error / link time reference error In-Reply-To: <43E3A7E3.4020307@dbresearch.net> References: <43E36DB0.7070801@dbresearch.net> <1138979240.15119.17035.camel@hal.voltaire.com> <43E3A7E3.4020307@dbresearch.net> Message-ID: <1138994689.26011.82.camel@hal.voltaire.com> Hi Sean, On Fri, 2006-02-03 at 13:58, Sean Hubbell wrote: > Hal, > > I downloaded r5292 and tried to build on an x86_64 Dell Power Edge > 2850. I got some errors with respect to opensm/osm_svn_revision.h No > such file or directory. osm/include/opensm/osm_svn_revision.h ? I made a change for this this AM to not have it in the tree as it is autogenerated during the OpenSM make. > I also received the OSM_SVN_REVISION undeclared; > I supposed this would be in the osm_svn_revision.h file. Yes. > How is this file supposed to be generated (with svnversion?)? It is created when OpenSM is built. How did you build ? -- Hal > Sean > > Hal Rosenstock wrote: > > >Hi Sean, > > > >On Fri, 2006-02-03 at 09:50, Sean Hubbell wrote: > > > > > >>Hello, > >> > >> I have just updated my kernel to 2.6.15 and updated openib as well > >>(tools as well as the driver). When I run the tools like ibping I get > >>the following error: > >> > >>ibping: relocation error: ibping: symbol argv0 version IBCOMMON_1.0 not > >>defined in file libibcommon.so.1 with link time reference. > >> > >>Running ldd on ibping points to the latest ib libraries that I compiled > >>this morning (the code itself is from 2 days ago). > >> > >>Has anyone see this as well and does anyone have an idea for a fix? > >> > >> > > > >Can you update to the latest (r5292) and it should be fixed. If not, > >pick up the latest libibcommon.map and rebuild and reinstall > >libibcommon. > > > >Let me know whether this resolves your issue. Thanks. > > > >-- Hal > > > > > > > >>Sean Hubbell > >> > >>_______________________________________________ > >>openib-general mailing list > >>openib-general at openib.org > >>http://openib.org/mailman/listinfo/openib-general > >> > >>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > >> > >> > > > >_______________________________________________ > >openib-general mailing list > >openib-general at openib.org > >http://openib.org/mailman/listinfo/openib-general > > > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > From grave at ipno.in2p3.fr Fri Feb 3 12:34:57 2006 From: grave at ipno.in2p3.fr (Xavier Grave) Date: Fri, 03 Feb 2006 21:34:57 +0100 Subject: [openib-general] Ada binding to libibverb well advanced but... Message-ID: <1138998897.15501.4.camel@ipnnarval> Hi all, I have binded all the functions and structures needed to translate rc_pinpong.c to Ada. But I try to understand the code as I translate it. Here comes the trouble for me :) When it comes to the big while with send and receive counters it's too difficult for me. Can somebody send me a very simple example with only one send from the client side to one receive on the server side ? Thanks in advance, xavier From Arkady.Kanevsky at netapp.com Fri Feb 3 12:38:15 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Fri, 3 Feb 2006 15:38:15 -0500 Subject: [openib-general] DAPL BOF at Sonoma Workshop Message-ID: There is no DAT presentation at Data Centric Workshop at Sonoma this time. Instead we will have DAT BOF. I will participate remotely and James Lentini a DAPL maintaner at OpenIB will run it. The bridge info is 888-827-8686 conf. id 1068642. James will send email on the time of the BOF. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Fri Feb 3 14:25:10 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 03 Feb 2006 14:25:10 -0800 Subject: [openib-general] Ada binding to libibverb well advanced but... In-Reply-To: <1138998897.15501.4.camel@ipnnarval> References: <1138998897.15501.4.camel@ipnnarval> Message-ID: <43E3D846.4020401@ichips.intel.com> Xavier Grave wrote: > I have binded all the functions and structures needed to translate > rc_pinpong.c to Ada. But I try to understand the code as I translate it. > Here comes the trouble for me :) When it comes to the big while with > send and receive counters it's too difficult for me. > Can somebody send me a very simple example with only one send from the > client side to one receive on the server side ? You may want to look at the example code in librdmacm. The code there does multiple sends over multiple connections, but the code may be easier to follow. (Basically, just ignore the outer for loops.) - Sean From grave at ipno.in2p3.fr Fri Feb 3 14:31:55 2006 From: grave at ipno.in2p3.fr (Xavier Grave) Date: Fri, 03 Feb 2006 23:31:55 +0100 Subject: [openib-general] Ada binding to libibverb well advanced but... In-Reply-To: <43E3D846.4020401@ichips.intel.com> References: <1138998897.15501.4.camel@ipnnarval> <43E3D846.4020401@ichips.intel.com> Message-ID: <1139005915.15501.7.camel@ipnnarval> > You may want to look at the example code in librdmacm. The code there does > multiple sends over multiple connections, but the code may be easier to follow. > (Basically, just ignore the outer for loops. Thanks a lot ! From rdreier at cisco.com Fri Feb 3 14:32:46 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 03 Feb 2006 14:32:46 -0800 Subject: [openib-general] kernel 2.6.16-rc2 is out In-Reply-To: <1138987143.26011.29.camel@hal.voltaire.com> (Hal Rosenstock's message of "03 Feb 2006 12:22:56 -0500") References: <1138987143.26011.29.camel@hal.voltaire.com> Message-ID: Thanks, I queued the first patch (fixing DR SMPs with a LID routed part) for 2.6.16 and put the other two (which looked to be pure cleanups) in my 2.6.17 queue. - R. From rdreier at cisco.com Fri Feb 3 14:50:52 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 03 Feb 2006 14:50:52 -0800 Subject: [openib-general] Re: does the mthca driver support RTS->SQD event request? In-Reply-To: (Roland Dreier's message of "Thu, 02 Feb 2006 20:31:59 -0800") References: <6AB138A2AB8C8E4A98B9C0C3D52670E3013DC66A@mtlexch01.mtl.com> Message-ID: I tested the previous patch, and I found that I forgot to add the chunk to allow the IB_QP_EN_SQD_ASYNC_NOTIFY attribute to the allowed mask for RTS->SQD transitions. With the patch below (already committed, queued for 2.6.17) a simple test of the SQ drained event works for me. - R. Index: infiniband/hw/mthca/mthca_cmd.c =================================================================== --- infiniband/hw/mthca/mthca_cmd.c (revision 5292) +++ infiniband/hw/mthca/mthca_cmd.c (working copy) @@ -1638,7 +1638,8 @@ int mthca_MODIFY_QP(struct mthca_dev *de } } else - err = mthca_cmd(dev, mailbox->dma, (!!is_ee << 24) | num, + err = mthca_cmd(dev, mailbox->dma, + optmask | (!!is_ee << 24) | num, op_mod, op[trans], CMD_TIME_CLASS_C, status); if (my_mailbox) Index: infiniband/hw/mthca/mthca_qp.c =================================================================== --- infiniband/hw/mthca/mthca_qp.c (revision 5292) +++ infiniband/hw/mthca/mthca_qp.c (working copy) @@ -413,6 +413,12 @@ static const struct { }, [IB_QPS_SQD] = { .trans = MTHCA_TRANS_RTS2SQD, + .opt_param = { + [UD] = IB_QP_EN_SQD_ASYNC_NOTIFY, + [UC] = IB_QP_EN_SQD_ASYNC_NOTIFY, + [RC] = IB_QP_EN_SQD_ASYNC_NOTIFY, + [MLX] = IB_QP_EN_SQD_ASYNC_NOTIFY + } }, }, [IB_QPS_SQD] = { @@ -575,6 +581,7 @@ int mthca_modify_qp(struct ib_qp *ibqp, struct mthca_qp_param *qp_param; struct mthca_qp_context *qp_context; u32 req_param, opt_param; + u32 sqd_event = 0; u8 status; int err; @@ -839,8 +846,13 @@ int mthca_modify_qp(struct ib_qp *ibqp, qp_context->srqn = cpu_to_be32(1 << 24 | to_msrq(ibqp->srq)->srqn); + if (cur_state == IB_QPS_RTS && new_state == IB_QPS_SQD && + attr_mask & IB_QP_EN_SQD_ASYNC_NOTIFY && + attr->en_sqd_async_notify) + sqd_event = 1 << 31; + err = mthca_MODIFY_QP(dev, state_table[cur_state][new_state].trans, - qp->qpn, 0, mailbox, 0, &status); + qp->qpn, 0, mailbox, sqd_event, &status); if (status) { mthca_warn(dev, "modify QP %d returned status %02x.\n", state_table[cur_state][new_state].trans, status); From openib-general at openib.org Fri Feb 3 15:28:22 2006 From: openib-general at openib.org (openib-general at openib.org) Date: Fri, 3 Feb 2006 15:28:22 -0800 (PST) Subject: [openib-general] openib-general@openib.org Message-ID: <20060203232822.DFE1A22834D@openib.ca.sandia.gov> AD4ULT ME7DIA Vi4deo Cl7ips Sl4ide Sh7ows Sc4reen Sh7ots AD4ULTS ON7LY -------------- next part -------------- A non-text attachment was scrubbed... Name: ad4ultme7dia.zip Type: application/x-zip-compressed Size: 3879 bytes Desc: ad4ultme7dia.zip URL: From ralphc at pathscale.com Fri Feb 3 16:08:02 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Fri, 03 Feb 2006 16:08:02 -0800 Subject: [openib-general] [PATCH] use set_current_state() in SDP Message-ID: <1139011682.475.16.camel@brick.internal.keyresearch.com> The set_current_state() macro should be used instead of setting the task state directly. Signed-off-by: Ralph Campbell Index: src/linux-kernel/infiniband/ulp/sdp/sdp_conn.c =================================================================== --- src/linux-kernel/infiniband/ulp/sdp/sdp_conn.c (revision 5294) +++ src/linux-kernel/infiniband/ulp/sdp/sdp_conn.c (working copy) @@ -801,7 +801,7 @@ add_wait_queue_exclusive(&(conn->lock.waitq), &wait); for (;;) { - current->state = TASK_UNINTERRUPTIBLE; + set_current_state(TASK_UNINTERRUPTIBLE); spin_unlock_irqrestore(&(conn->lock.slock), f); schedule(); spin_lock_irqsave(&(conn->lock.slock), f); @@ -811,7 +811,7 @@ break; } - current->state = TASK_RUNNING; + set_current_state(TASK_RUNNING); remove_wait_queue(&(conn->lock.waitq), &wait); } -- Ralph Campbell From arlin.r.davis at intel.com Fri Feb 3 16:22:15 2006 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Fri, 3 Feb 2006 16:22:15 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT 2.0 immediate data proposal Message-ID: <59278FC0C48A994BABABD069571E45680DD8B9F2@orsmsx401.amr.corp.intel.com> All, During the DAT con-call today it was suggested that I draft the application requirements for immediate data. Here is my first cut; fire away! "Applications need an optimized mechanism to notify the receiving end that RDMA write data has completed beyond the two operation method currently used (RDMA write followed by message send). This new RDMA write feature will support 4-bytes of inline data that will be sent immediately after the RDMA write operation is complete. It should avoid any latency penalties normally associated with a two operation method. The initiating side must expose a 4-byte immediate data parameter for the application to set the inline data. The receiving side must provide a mechanism to accept the 4-byte immediate data. On the receiving side, the write with immediate completion notification is indicated through a receive completion. It is the responsibility of the provider to identify to the application 4-byte immediate data from a normal 4-byte send message. The inline byte ordering is application specific." Hopefully, this will help us come to a consensus on the proper interface and delivery mechanism for immediate data. Thanks, -arlin ________________________________ From: Hefty, Sean Sent: Thursday, February 02, 2006 3:54 PM To: Davis, Arlin R; dat-discussions at yahoogroups.com Cc: Lentini, James; openib-general at openib.org Subject: RE: [dat-discussions] RE: [openib-general] RE: [RFC] DAT 2.0immediate data proposal Here is an updated immediate data proposal based on the latest discussions. I am working on a patch. I don't see any app using this unless immediate data is supported, and the data shows up in a completion. You have variables to indicate support, the number of work requests immediate consumes, and how the data is reported. What app is going to use this? If writes are not supported, then the app will already have to deal with doing a write followed by their own send. As a further complication, reserving the first 4 bytes of any receive buffer for immediate data is only going to cause alignment issues for the user. This defines an API with different behavior based on the underlying transport in ways that are visible to the application. Add a flag that specifies if immediate data is supported. Define one way of doing that, and move on. I fail to see any benefit complicating the API for a transport that has to emulate transferring immediate data in an application visible way. - Sean -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Fri Feb 3 16:30:07 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 03 Feb 2006 16:30:07 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT 2.0 immediate data proposal In-Reply-To: <59278FC0C48A994BABABD069571E45680DD8B9F2@orsmsx401.amr.corp.intel.com> References: <59278FC0C48A994BABABD069571E45680DD8B9F2@orsmsx401.amr.corp.intel.com> Message-ID: <43E3F58F.8060007@ichips.intel.com> Davis, Arlin R wrote: > “Applications need an optimized mechanism to notify the receiving end > that RDMA write data has completed beyond the two operation method > currently used (RDMA write followed by message send). This new RDMA > write feature will support 4-bytes of inline data that will be sent Is there any reason to restrict the size of the immediate data? Could you define the API such that the size is variable? I.e. the provider can simply give the immediate data size, with 0 indicating that it is not supported. > It should avoid > any latency penalties normally associated with a two operation method. I would state this as a requirement. A write followed by a send should be pushed to the application, since they may be able to provide additional optimizations (such as combining operations) beyond what a provider could. > The initiating side must expose a 4-byte immediate data parameter for > the application to set the inline data. The receiving side must provide > a mechanism to accept the 4-byte immediate data. On the receiving side, > the write with immediate completion notification is indicated through a > receive completion. It is the responsibility of the provider to identify > to the application 4-byte immediate data from a normal 4-byte send > message. The inline byte ordering is application specific.” Requirements look good to me. - Sean From ralphc at pathscale.com Fri Feb 3 17:05:53 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Fri, 03 Feb 2006 17:05:53 -0800 Subject: [openib-general] [PATCH] change Mellanox SDP workaround to a module parameter Message-ID: <1139015153.475.20.camel@brick.internal.keyresearch.com> This patch changes the hardwired MTU limit of 1024 in SDP into a module parameter so it can be disabled for HCAs without the RC performance problem. Signed-off-by: Ralph Campbell Index: src/linux-kernel/infiniband/ulp/sdp/sdp_actv.c =================================================================== --- src/linux-kernel/infiniband/ulp/sdp/sdp_actv.c (revision 5294) +++ src/linux-kernel/infiniband/ulp/sdp/sdp_actv.c (working copy) @@ -35,6 +35,11 @@ #include "sdp_main.h" +static int sdp_path_mtu_max = IB_MTU_1024; +module_param(sdp_path_mtu_max, int, 0); +MODULE_PARM_DESC(sdp_path_mtu_max, "Maximum path MTU to use for SDP " + "(0=no max, 1=256, 2=512, 3=1024, 4=2048, 5=4096)"); + /* * Connection establishment functions */ @@ -443,14 +448,12 @@ * save message */ sdp_buff_q_put_tail(&conn->send_post, buff); -#if 1 + conn->path_mtu = path->mtu; /* * Mellanox performance bug workaround. */ - if (path->mtu > IB_MTU_1024) - path->mtu = IB_MTU_1024; -#endif - conn->path_mtu = path->mtu; + if (sdp_path_mtu_max && conn->path_mtu > sdp_path_mtu_max) + conn->path_mtu = sdp_path_mtu_max; /* * set QP/CM parameters. */ -- Ralph Campbell From rdreier at cisco.com Fri Feb 3 17:06:27 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 03 Feb 2006 17:06:27 -0800 Subject: [openib-general] [PATCH] use set_current_state() in SDP In-Reply-To: <1139011682.475.16.camel@brick.internal.keyresearch.com> (Ralph Campbell's message of "Fri, 03 Feb 2006 16:08:02 -0800") References: <1139011682.475.16.camel@brick.internal.keyresearch.com> Message-ID: I think both of these places can use __set_current_state(). - R. From ralphc at pathscale.com Fri Feb 3 17:51:17 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Fri, 03 Feb 2006 17:51:17 -0800 Subject: [openib-general] [PATCH] use set_current_state() in SDP In-Reply-To: References: <1139011682.475.16.camel@brick.internal.keyresearch.com> Message-ID: <1139017877.475.30.camel@brick.internal.keyresearch.com> On Fri, 2006-02-03 at 17:06 -0800, Roland Dreier wrote: > I think both of these places can use __set_current_state(). > > - R. Good point. Here is the updated patch. Signed-off-by: Ralph Campbell Index: src/linux-kernel/infiniband/ulp/sdp/sdp_conn.c =================================================================== --- src/linux-kernel/infiniband/ulp/sdp/sdp_conn.c (revision 5294) +++ src/linux-kernel/infiniband/ulp/sdp/sdp_conn.c (working copy) @@ -801,7 +801,7 @@ add_wait_queue_exclusive(&(conn->lock.waitq), &wait); for (;;) { - current->state = TASK_UNINTERRUPTIBLE; + __set_current_state(TASK_UNINTERRUPTIBLE); spin_unlock_irqrestore(&(conn->lock.slock), f); schedule(); spin_lock_irqsave(&(conn->lock.slock), f); @@ -811,7 +811,7 @@ break; } - current->state = TASK_RUNNING; + __set_current_state(TASK_RUNNING); remove_wait_queue(&(conn->lock.waitq), &wait); } -- Ralph Campbell From veronicaams at rh.dk Sat Feb 4 03:41:52 2006 From: veronicaams at rh.dk (Veronica Amsler) Date: Sat, 4 Feb 2006 06:41:52 -0500 Subject: [openib-general] Re: z o r Message-ID: <000001c6297f$fdd4e090$e447a8c0@ameer> Hello, Do you want to OVER P A Y for your MED ? Nothing like you need it, S AVE your self over 50% http://www.kankrupo.com C V V l A l. A L A L l. G I U R S. M A $ $ $ 6 8 6 7 5 9 , , , 4 5 8 9 5 5 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Sat Feb 4 08:33:57 2006 From: rolandd at cisco.com (Roland Dreier) Date: Sat, 04 Feb 2006 16:33:57 +0000 Subject: [openib-general] [git patch review 1/2] IB/mad: Handle DR SMPs with a LID routed part Message-ID: <1139070837111-02eec52639fd6aed@cisco.com> Fix handling of directed route SMPs with a beginning or ending LID routed part. Signed-off-by: Ralph Campbell Signed-off-by: Hal Rosenstock Signed-off-by: Roland Dreier --- drivers/infiniband/core/mad.c | 10 +++++++++- 1 files changed, 9 insertions(+), 1 deletions(-) 8cf3f04f45694db0699f608c0e3fb550c607cc88 diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index d393b50..c82f47a 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -665,7 +665,15 @@ static int handle_outgoing_dr_smp(struct struct ib_wc mad_wc; struct ib_send_wr *send_wr = &mad_send_wr->send_wr; - if (!smi_handle_dr_smp_send(smp, device->node_type, port_num)) { + /* + * Directed route handling starts if the initial LID routed part of + * a request or the ending LID routed part of a response is empty. + * If we are at the start of the LID routed part, don't update the + * hop_ptr or hop_cnt. See section 14.2.2, Vol 1 IB spec. + */ + if ((ib_get_smp_direction(smp) ? smp->dr_dlid : smp->dr_slid) == + IB_LID_PERMISSIVE && + !smi_handle_dr_smp_send(smp, device->node_type, port_num)) { ret = -EINVAL; printk(KERN_ERR PFX "Invalid directed route\n"); goto out; -- 1.1.3 From rolandd at cisco.com Sat Feb 4 08:33:57 2006 From: rolandd at cisco.com (Roland Dreier) Date: Sat, 04 Feb 2006 16:33:57 +0000 Subject: [openib-general] [git patch review 2/2] IB: Don't doublefree pages from scatterlist In-Reply-To: <1139070837111-02eec52639fd6aed@cisco.com> Message-ID: <1139070837112-3fe13a3288c20f5c@cisco.com> On some architectures, mapping the scatterlist may coalesce entries: if that coalesced list is then used for freeing the pages afterwards, there's a danger that pages may be doubly freed (and others leaked). Fix Infiniband's __ib_umem_release by freeing from a separate array beyond the scatterlist: IB_UMEM_MAX_PAGE_CHUNK lowered to fit one page. Signed-off-by: Hugh Dickins Signed-off-by: Roland Dreier --- drivers/infiniband/core/uverbs_mem.c | 22 ++++++++++++++++------ include/rdma/ib_verbs.h | 3 +-- 2 files changed, 17 insertions(+), 8 deletions(-) 46fc99a4a1429f843e3b6df8ed1f571944bef4e2 diff --git a/drivers/infiniband/core/uverbs_mem.c b/drivers/infiniband/core/uverbs_mem.c index 36a32c3..87a363e 100644 --- a/drivers/infiniband/core/uverbs_mem.c +++ b/drivers/infiniband/core/uverbs_mem.c @@ -49,15 +49,18 @@ struct ib_umem_account_work { static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty) { struct ib_umem_chunk *chunk, *tmp; + struct page **sg_pages; int i; list_for_each_entry_safe(chunk, tmp, &umem->chunk_list, list) { dma_unmap_sg(dev->dma_device, chunk->page_list, chunk->nents, DMA_BIDIRECTIONAL); + /* Scatterlist may have been coalesced: free saved pagelist */ + sg_pages = (struct page **) (chunk->page_list + chunk->nents); for (i = 0; i < chunk->nents; ++i) { if (umem->writable && dirty) - set_page_dirty_lock(chunk->page_list[i].page); - put_page(chunk->page_list[i].page); + set_page_dirty_lock(sg_pages[i]); + put_page(sg_pages[i]); } kfree(chunk); @@ -69,11 +72,13 @@ int ib_umem_get(struct ib_device *dev, s { struct page **page_list; struct ib_umem_chunk *chunk; + struct page **sg_pages; unsigned long locked; unsigned long lock_limit; unsigned long cur_base; unsigned long npages; int ret = 0; + int nents; int off; int i; @@ -121,16 +126,21 @@ int ib_umem_get(struct ib_device *dev, s off = 0; while (ret) { - chunk = kmalloc(sizeof *chunk + sizeof (struct scatterlist) * - min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK), + nents = min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK); + chunk = kmalloc(sizeof *chunk + + sizeof (struct scatterlist) * nents + + sizeof (struct page *) * nents, GFP_KERNEL); if (!chunk) { ret = -ENOMEM; goto out; } + /* Save pages to be freed in array beyond scatterlist */ + sg_pages = (struct page **) (chunk->page_list + nents); - chunk->nents = min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK); + chunk->nents = nents; for (i = 0; i < chunk->nents; ++i) { + sg_pages[i] = page_list[i + off]; chunk->page_list[i].page = page_list[i + off]; chunk->page_list[i].offset = 0; chunk->page_list[i].length = PAGE_SIZE; @@ -142,7 +152,7 @@ int ib_umem_get(struct ib_device *dev, s DMA_BIDIRECTIONAL); if (chunk->nmap <= 0) { for (i = 0; i < chunk->nents; ++i) - put_page(chunk->page_list[i].page); + put_page(sg_pages[i]); kfree(chunk); ret = -ENOMEM; diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 22fc886..239c11d 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -696,8 +696,7 @@ struct ib_udata { #define IB_UMEM_MAX_PAGE_CHUNK \ ((PAGE_SIZE - offsetof(struct ib_umem_chunk, page_list)) / \ - ((void *) &((struct ib_umem_chunk *) 0)->page_list[1] - \ - (void *) &((struct ib_umem_chunk *) 0)->page_list[0])) + (sizeof (struct scatterlist) + sizeof (struct page *))) struct ib_umem_object { struct ib_uobject uobject; -- 1.1.3 From sashak at voltaire.com Sat Feb 4 12:19:38 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 4 Feb 2006 22:19:38 +0200 Subject: [openib-general] relocation error / link time reference error Message-ID: Hi, -----Original Message----- From: openib-general-bounces at openib.org on behalf of Hal Rosenstock > osm/include/opensm/osm_svn_revision.h ? I made a change for this this AM > to not have it in the tree as it is autogenerated during the OpenSM > make. Yes, this works with incremental build because osm_svn_revision.h is already in dependency list (in one of .deps/ files), but fails with fresh build. Sean, please try attached patch - hope it solves the problem. Sasha. -------------- next part -------------- A non-text attachment was scrubbed... Name: opemsm-Makefile-am.patch Type: text/x-patch Size: 446 bytes Desc: opemsm-Makefile-am.patch URL: From siighnv at postino.ch Sat Feb 4 13:52:41 2006 From: siighnv at postino.ch (siighnv at postino.ch) Date: Sat, 4 Feb 2006 13:52:41 -0800 (PST) Subject: [openib-general] =?iso-2022-jp?b?UmU6GyRCNDBBNEw1TkEkSj1QGyhC?= =?iso-2022-jp?b?GyRCMnEkJD5SMnA0dUs+GyhC?= Message-ID: 20060205055813.49567mail@mail.kiss_woman-server59_firsttime-go-free889_system08_love-kiss.tv  :*.☆。                 。☆.*:  ◆ ━…‥‥…━…‥‥…━…‥‥…━…‥…━ ◆ ◆●◆   今の生活に満足していますか?  ◆●◆  ◆ ━…‥‥…━…‥‥…━…‥‥…━…‥…━ ◆  :*.☆。                 。☆.*:   ・仕事で疲れ切って、ヒマとゆとりがない   ・同じ生活サイクルで変化が少ない   ・人間関係に軽くストレスがある   ・日常に少し疲れている…   今の生活に刺激と変化を与えるには   「新しい出会い」によって誰かと話すことです。   http://wmn-line.cx/h/      18歳以上であれば、どなたでも参加する資格があります。   ◇忙しくて日常では出会いが少なすぎる。   ◇自宅でゆっくりとした時間に異性を探したい。   ◇職場での出会いがなさ過ぎる。   ◇異性の友人は多いけど…友達以上になるのは中々…。   そんな方にオススメするサービス   ☆--☆--☆安心の完全無料☆--☆--☆--☆--☆--☆--☆--☆--   |登録料・月額費・ポイント等は一切かかりません。    |   |メール送受信、掲示板書き込み・閲覧、お相手検索等の他、|   |直メールや直電話も無料でご利用頂けます。       |   ☆--☆--☆--☆--☆--☆--☆--☆--☆--☆--☆--☆--☆--☆--   ↓こちらからどうぞ↓   http://wmn-line.cx/h/   PRシート登録・・・・・・・・無 料   お相手からメールを受け取る・無 料   自分からメールを送る・・・・無 料   画像を閲覧する・・・・・・・無 料     会員の投稿(一部)  □-----------------------------------------------□  紗奈さん 21歳  似ている芸能人 平原綾香さん  紗奈と淫乱な関係になってくれますか?  親が厳しいので今まで親が決めた相手しか付き合ったことがなくて  いい思い出もありません(>_<)  本気で秘密の関係にしてほしいんです!  約束できるならちゃんとした謝礼は出しますので、  色々楽しい遊び方を教えてくれませんか(゜-゜*)  □-----------------------------------------------□  登録確認: http://wmn-line.cx/h/  □-----------------------------------------------□  ゆみこさん 28歳   似ている芸能人 伊藤裕子さん  30万で今週末一緒に過ごせる?  急な話で悪いんですけど(*・O・)  今週末の予定は埋まっちゃってますか?  お金足りなかったらもう少し出せるのでとりあえず連絡ください☆  □-----------------------------------------------□  完全無料登録: http://wmn-line.cx/h/  □-----------------------------------------------□  ※安全にご利用頂くため、   完全会員限定のサービスとさせて頂いております。   ご利用は登録から退会まで   全サービスが無料でお楽しみ頂けます。 http://wmn-line.cx/h/ From dotanb at mellanox.co.il Sat Feb 4 23:47:54 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Sun, 5 Feb 2006 09:47:54 +0200 Subject: [openib-general] Re: does the mthca driver support RTS->SQD event request? Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3014E62CA@mtlexch01.mtl.com> Thanks, this patch works for me too. I have one comments: in the IB spec it is written: "Enable or disable Send Queue Drained, Asynchronous Affiliated Event Notification. This modifier is only applicable when the next QP state chosen is SQD." I think that the following transitions should support this event: SQD->RTS, SQD->SQD. what do you think? Dotan > -----Original Message----- > From: Roland Dreier [mailto:rdreier at cisco.com] > Sent: Saturday, February 04, 2006 12:51 AM > To: Dotan Barak > Cc: openib-general at openib.org > Subject: Re: [openib-general] Re: does the mthca driver > support RTS->SQD > event request? > > > I tested the previous patch, and I found that I forgot to add the > chunk to allow the IB_QP_EN_SQD_ASYNC_NOTIFY attribute to the allowed > mask for RTS->SQD transitions. > > With the patch below (already committed, queued for 2.6.17) a simple > test of the SQ drained event works for me. > > - R. > > Index: infiniband/hw/mthca/mthca_cmd.c > =================================================================== > --- infiniband/hw/mthca/mthca_cmd.c (revision 5292) > +++ infiniband/hw/mthca/mthca_cmd.c (working copy) > @@ -1638,7 +1638,8 @@ int mthca_MODIFY_QP(struct mthca_dev *de > } > > } else > - err = mthca_cmd(dev, mailbox->dma, (!!is_ee << > 24) | num, > + err = mthca_cmd(dev, mailbox->dma, > + optmask | (!!is_ee << 24) | num, > op_mod, op[trans], > CMD_TIME_CLASS_C, status); > > if (my_mailbox) > Index: infiniband/hw/mthca/mthca_qp.c > =================================================================== > --- infiniband/hw/mthca/mthca_qp.c (revision 5292) > +++ infiniband/hw/mthca/mthca_qp.c (working copy) > @@ -413,6 +413,12 @@ static const struct { > }, > [IB_QPS_SQD] = { > .trans = MTHCA_TRANS_RTS2SQD, > + .opt_param = { > + [UD] = IB_QP_EN_SQD_ASYNC_NOTIFY, > + [UC] = IB_QP_EN_SQD_ASYNC_NOTIFY, > + [RC] = IB_QP_EN_SQD_ASYNC_NOTIFY, > + [MLX] = IB_QP_EN_SQD_ASYNC_NOTIFY > + } > }, > }, > [IB_QPS_SQD] = { > @@ -575,6 +581,7 @@ int mthca_modify_qp(struct ib_qp *ibqp, > struct mthca_qp_param *qp_param; > struct mthca_qp_context *qp_context; > u32 req_param, opt_param; > + u32 sqd_event = 0; > u8 status; > int err; > > @@ -839,8 +846,13 @@ int mthca_modify_qp(struct ib_qp *ibqp, > qp_context->srqn = cpu_to_be32(1 << 24 | > > to_msrq(ibqp->srq)->srqn); > > + if (cur_state == IB_QPS_RTS && new_state == IB_QPS_SQD && > + attr_mask & IB_QP_EN_SQD_ASYNC_NOTIFY && > + attr->en_sqd_async_notify) > + sqd_event = 1 << 31; > + > err = mthca_MODIFY_QP(dev, > state_table[cur_state][new_state].trans, > - qp->qpn, 0, mailbox, 0, &status); > + qp->qpn, 0, mailbox, sqd_event, &status); > if (status) { > mthca_warn(dev, "modify QP %d returned status %02x.\n", > > state_table[cur_state][new_state].trans, status); > From mst at mellanox.co.il Sun Feb 5 00:46:33 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 5 Feb 2006 10:46:33 +0200 Subject: [openib-general] Re: ipoib_mcast_send.patch In-Reply-To: References: Message-ID: <20060205084633.GM5673@mellanox.co.il> Hi! Quoting r. Roland Dreier : > Subject: Re: ipoib_mcast_send.patch > > I started looking at IPoIB patches again. In ipoib_mcast_send.patch, > we have: > > > --- linux-2.6.15.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-01-23 21:24:10.000000000 +0200 > > +++ linux-2.6.15/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-01-23 21:25:19.000000000 +0200 > > @@ -600,6 +600,10 @@ int ipoib_mcast_start_thread(struct net_ > > queue_work(ipoib_workqueue, &priv->mcast_task); > > mutex_unlock(&mcast_mutex); > > > > + spin_lock_irq(&priv->lock); > > + set_bit(IPOIB_MCAST_STARTED, &priv->flags); > > + spin_unlock_irq(&priv->lock); > > This seems to leave a window where we set the IPOIB_MCAST_STARTED flag > but the multicast work hasn't run yet. Then it seems we're still > susceptible to the issue you described here: > > > Further, there's an additional issue that I saw in testing: > > ipoib_mcast_send may get called when priv->broadcast is NULL > > (e.g. if the device was downed and then upped internally because > > of a port event). > > If this happends and the sendonly join request gets completed before > > priv->broadcast is set, we get an oops Maybe the description is not clear enough. There are two issues here with two separate fixes. 1. IPOIB_MCAST_STARTED - solves the first issue 2. Checking priv->broadcast in ipoib_mcast_send here: + if (!test_bit(IPOIB_MCAST_STARTED, &priv->flags) || !priv->broadcast) { - solves the second issue They just got rolled into one patch because they touch the same code lines. Do you want me to split them up? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From eigarash at fma.fujitsu.com Sun Feb 5 11:02:19 2006 From: eigarash at fma.fujitsu.com (eigarash at fma.fujitsu.com) Date: Sun, 5 Feb 2006 11:02:19 -0800 Subject: [openib-general] Mail Transaction Failed Message-ID: <20060205100411.2116622834D@openib.ca.sandia.gov> The message cannot be represented in 7-bit ASCII encoding and has been sent as a binary attachment. -------------- next part -------------- A non-text attachment was scrubbed... Name: file.zip Type: application/octet-stream Size: 66674 bytes Desc: not available URL: From Administrator at openib.org Sun Feb 5 01:58:43 2006 From: Administrator at openib.org (Administrator at openib.org) Date: Sun, 5 Feb 2006 03:58:43 -0600 Subject: [openib-general] [MailServer Notification]To Recipient virus found and action taken. Message-ID: <03b401c62a3a$bf71e4e0$020ca8c0@banderacom.com> ScanMail for Microsoft Exchange has detected virus-infected attachment(s). Sender = openib-general-bounces at openib.org Recipient(s) = openib-general at openib.org Subject = [openib-general] Mail Transaction Failed Scanning time = 2/5/2006 3:58:43 AM Engine/Pattern = 8.000-1001/3.195.00 Action on virus found: The attachment file.zip contains WORM_MYTOB.BT virus. ScanMail has Deleted it. Warning to recipient. ScanMail has detected a virus. 2/5/2006 file.zip/Deleted openib-general at openib.org openib-general-bounces at openib.org [openib-general] Mail Transaction Failed From info at uxrd.com Sat Feb 4 19:02:50 2006 From: info at uxrd.com (info at uxrd.com) Date: 5 Feb 2006 12:02:50 +0900 Subject: [openib-general] $B $BFMA3$N%a!<%k$G<:Ni$7$^$9!#(B $B8a8e$N$R$H;~$r)$a$N%5%$%H$r(B $B>R2p$7$^$9!#(B http://www.koi-road1.com/?num=350 $B4{$KBg@*$NBg?M$NCK=w$,=8$^$C$F$*$j$^$9!#(B 30$BBe$NJ}$rCf?4$KI}9-$$G/NpAX$NJ}$K$4MxMQD:$$$F$*$j$^$9!#(B $B=c0&!"ITNQ!"BN$@$1$N4X78$J$IL\E*$O3'F1$8$G$9!#(B $BCQ$+$7$$$3$H$O$"$j$^$;$s!#(B $BEPO?$OL5NA$GEPO?$5$l$?8e$OL5NA%]%$%s%H$bMQ0U$5$l$F(B $B$*$j$^$9$N$G!"$*5$7Z$K$I$&$>!#(B $B%U%j!<%a!<%k$G$NEPO?$b5=Ey0l at ZL5$$$3$H$rJ]>Z$7$^$9!#(B $B"(3F$5$l$kJ}$O(B refuse at koi-road1.com $B$^$G%a!<%k$r$*Aw$j2<$5$$!#(B From yael at mellanox.co.il Sun Feb 5 02:40:45 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 05 Feb 2006 12:40:45 +0200 Subject: [openib-general] [PATCH] Opensm - build fails due to osm_svn_revision.h missing Message-ID: <5zslqy142a.fsf@mtl066.yok.mtl.com> Hi Hal, Currently the build of opensm fails, since osm_svn_revision.h is missing (and created only later). The following patch issued by Michael Tsirkin fixes it. Thanks, Yael Signed-off-by: Yael Kalka Index: opensm/Makefile.am =================================================================== --- opensm/Makefile.am (revision 5303) +++ opensm/Makefile.am (working copy) @@ -10,6 +10,7 @@ DBGFLAGS = -g endif if OSMV_OPENIB +BUILT_SOURCES = $(srcdir)/../include/opensm/osm_svn_revision.h .PHONY: always $(srcdir)/../include/opensm/osm_svn_revision.h: always echo -n "#define OSM_SVN_REVISION \"" >$(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ From mst at mellanox.co.il Sun Feb 5 03:05:42 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 5 Feb 2006 13:05:42 +0200 Subject: [openib-general] Re: [PATCH] Opensm - build fails due toosm_svn_revision.h missing In-Reply-To: <5zslqy142a.fsf@mtl066.yok.mtl.com> References: <5zslqy142a.fsf@mtl066.yok.mtl.com> Message-ID: <20060205110542.GT5673@mellanox.co.il> Quoting r. Yael Kalka : > Subject: [PATCH] Opensm - build fails due toosm_svn_revision.h missing > > > Hi Hal, > > Currently the build of opensm fails, since osm_svn_revision.h is > missing (and created only later). > The following patch issued by Michael Tsirkin fixes it. > > Thanks, > Yael > > Signed-off-by: Yael Kalka Yael, the patch I forwarded to you was actually issued by Sasha Khapyorsky. http://openib.org/pipermail/openib-general/2006-February/016218.html It was unfortunately posted as attachment and without the S.O.B. line which is probably why you mistook it for mine. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From halr at voltaire.com Sun Feb 5 03:38:40 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Feb 2006 06:38:40 -0500 Subject: [openib-general] relocation error / link time reference error In-Reply-To: References: <43E36DB0.7070801@dbresearch.net> <1138979240.15119.17035.camel@hal.voltaire.com> <43E3A7E3.4020307@dbresearch.net> <1138994689.26011.82.camel@hal.voltaire.com> Message-ID: <1139139404.4450.361.camel@hal.voltaire.com> Hi Sasha, On Sat, 2006-02-04 at 15:19, Sasha Khapyorsky wrote: > Hi, > > -----Original Message----- > From: openib-general-bounces at openib.org on behalf of Hal Rosenstock > > > osm/include/opensm/osm_svn_revision.h ? I made a change for this > this AM > > to not have it in the tree as it is autogenerated during the OpenSM > > make. > > Yes, this works with incremental build because osm_svn_revision.h is > already > in dependency list (in one of .deps/ files), but fails with fresh > build. > > Sean, please try attached patch - hope it solves the problem. > > Sasha. Thanks. Applied. -- Hal From halr at voltaire.com Sun Feb 5 03:41:55 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Feb 2006 06:41:55 -0500 Subject: [openib-general] Re: [PATCH] Opensm - build fails due to osm_svn_revision.h missing In-Reply-To: <5zslqy142a.fsf@mtl066.yok.mtl.com> References: <5zslqy142a.fsf@mtl066.yok.mtl.com> Message-ID: <1139139469.4450.364.camel@hal.voltaire.com> On Sun, 2006-02-05 at 05:40, Yael Kalka wrote: > Hi Hal, > > Currently the build of opensm fails, since osm_svn_revision.h is > missing (and created only later). > The following patch issued by Michael Tsirkin fixes it. > > Thanks, > Yael > > Signed-off-by: Yael Kalka I think this was originally Sasha's patch. Thanks. Applied. -- Hal > > Index: opensm/Makefile.am > =================================================================== > --- opensm/Makefile.am (revision 5303) > +++ opensm/Makefile.am (working copy) > @@ -10,6 +10,7 @@ DBGFLAGS = -g > endif > > if OSMV_OPENIB > +BUILT_SOURCES = $(srcdir)/../include/opensm/osm_svn_revision.h > .PHONY: always > $(srcdir)/../include/opensm/osm_svn_revision.h: always > echo -n "#define OSM_SVN_REVISION \"" >$(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ > From automated at www.contoso.com Sun Feb 5 13:05:13 2006 From: automated at www.contoso.com (automated at www.contoso.com) Date: Sun, 5 Feb 2006 13:05:13 -0800 Subject: [openib-general] Status Message-ID: <20060205120615.1E22722834D@openib.ca.sandia.gov> The message cannot be represented in 7-bit ASCII encoding and has been sent as a binary attachment. -------------- next part -------------- A non-text attachment was scrubbed... Name: doc.pif Type: application/octet-stream Size: 66560 bytes Desc: not available URL: From Administrator at stargate.chelsio.com Sun Feb 5 04:00:16 2006 From: Administrator at stargate.chelsio.com (Administrator at stargate.chelsio.com) Date: Sun, 5 Feb 2006 04:00:16 -0800 Subject: [openib-general] [MailServer Notification]To Recipient file blocking settings matched and action taken. Message-ID: <002301c62a4b$ba6d31f0$0fa0c00a@asicdesigners.com> ScanMail for Microsoft Exchange has blocked an attachment. Sender = openib-general-bounces at openib.org Recipient(s) = openib-general at openib.org Subject = [openib-general] Status Scanning time = 2/5/2006 4:00:16 AM Action on file blocking: The attachment doc.pif matches the file blocking settings. ScanMail has Quarantine failed it. The attachment was quarantined to C:\Program Files\Trend\Smex\Alert\doc43e5e8d012.pif_. Warning to Recipient: Action taken by attachment blocking. From ogerlitz at voltaire.com Sun Feb 5 05:15:03 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 05 Feb 2006 15:15:03 +0200 Subject: [openib-general] Re: [PATCH] enable the fmr pool user to set the page size In-Reply-To: References: Message-ID: <43E5FA57.3010608@voltaire.com> Roland Dreier wrote: > I applied this and queued it for 2.6.17, but I'm a little worried > about how much testing you have done with this. It looks safe enough > but your patch didn't even compile. I see, well at the time of sending i was sure it compiles... Anyway, i've used svn r5301 to test with iser the patch you applied to fmr_pool & mthca - it works just fine. Or. From ogerlitz at voltaire.com Sun Feb 5 05:17:51 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 05 Feb 2006 15:17:51 +0200 Subject: [openib-general] [PATCH 0/4] SA path record caching In-Reply-To: <43E2444D.7090300@ichips.intel.com> References: <43E1A9C2.5060609@voltaire.com> <43E2444D.7090300@ichips.intel.com> Message-ID: <43E5FAFF.9040205@voltaire.com> Sean Hefty wrote: >> I guess feedback from MPI people telling whether they have plans to >> use path query would help us to see where we actually stand. > > Hal and I will be meeting with them in Sonoma to make sure that we have > their requirements down. But they have asked that each node have all > path records as input into their routing algorithms. It would be very much appreciated if you can post the inputs from these meetings, specifically as this point (SA cache/replica design) affects directly the vendors SMs/SAs. Or. From rdreier at cisco.com Sun Feb 5 09:31:32 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 05 Feb 2006 09:31:32 -0800 Subject: [openib-general] Re: does the mthca driver support RTS->SQD event request? In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3014E62CA@mtlexch01.mtl.com> (Dotan Barak's message of "Sun, 5 Feb 2006 09:47:54 +0200") References: <6AB138A2AB8C8E4A98B9C0C3D52670E3014E62CA@mtlexch01.mtl.com> Message-ID: Dotan> Thanks, this patch works for me too. I have one comments: Dotan> in the IB spec it is written: "Enable or disable Send Queue Dotan> Drained, Asynchronous Affiliated Event Notification. This Dotan> modifier is only applicable when the next QP state chosen Dotan> is SQD." Dotan> I think that the following transitions should support this Dotan> event: SQD->RTS, SQD->SQD. I'm not sure whether that interpretation is correct or not. In any case, it seems that Mellanox HCAs only support enabling the event on the RTS->SQD transition. - R. From yael at mellanox.co.il Mon Feb 6 00:53:48 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 06 Feb 2006 10:53:48 +0200 Subject: [openib-general] [PATCH] Opensm - cl_event_wheel casting Message-ID: <5zpsm027hf.fsf@mtl066.yok.mtl.com> Hi Hal, The following patch adds the casting done in a clearer way - to avoid compilation errors in windows. Also - added a clear message if the timeout was trimmed (due to the casting). Thanks, Yael Signed-off-by: Yael Kalka Index: complib/cl_event_wheel.c =================================================================== --- complib/cl_event_wheel.c (revision 5307) +++ complib/cl_event_wheel.c (working copy) @@ -426,8 +426,18 @@ cl_event_wheel_reg( * cl_timer_stop(&p_event_wheel->timer); */ + /* The timeout for the cl_timer_start should be given as uint32_t. + if there is an overflow - warn about it. */ + if ( timeout > (uint32_t)timeout ) + { + osm_log (p_event_wheel->p_log, OSM_LOG_INFO, + "cl_event_wheel_reg: " + "timeout requested is too large. Using timeout: %u \n", + (uint32_t)timeout ); + } + /* start the timer to the timeout [msec] */ - cl_status = cl_timer_start(&p_event_wheel->timer, timeout); + cl_status = cl_timer_start(&p_event_wheel->timer, (uint32_t)timeout); if (cl_status != CL_SUCCESS) { From yael at mellanox.co.il Mon Feb 6 01:03:10 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 06 Feb 2006 11:03:10 +0200 Subject: [openib-general] [PATCH] Opensm - osm_db_file.c - windows fixes Message-ID: <5zoe1k271t.fsf@mtl066.yok.mtl.com> Hi Hal, The following patch adds some changes in osm_db_file.c to match the windows stack. Thanks, Yael Signed-off-by: Yael Kalka Index: opensm/osm_db_files.c =================================================================== --- opensm/osm_db_files.c (revision 5307) +++ opensm/osm_db_files.c (working copy) @@ -172,6 +172,12 @@ osm_db_init( if ( p_db_imp->db_dir_name == NULL ) p_db_imp->db_dir_name = OSM_DEFAULT_CACHE_DIR; + /* create the directory if it doesn't exist */ + /* There is difference between creating in windows and in linux */ +#ifdef __WIN__ + /* Check if the directory exists. If not - create it. */ + CreateDirectory(p_db_imp->db_dir_name, NULL); +#else /* __WIN__ */ /* make sure the directory exists */ if (lstat(p_db_imp->db_dir_name, &dstat)) { @@ -185,6 +191,7 @@ osm_db_init( return 1; } } +#endif p_db->p_log = p_log; p_db->p_db_imp = (void*)p_db_imp; @@ -466,6 +473,14 @@ osm_db_store( fclose(p_file); /* move the domain file */ + status = remove(p_domain_imp->file_name); + if (status) + { + osm_log( p_log, OSM_LOG_ERROR, + "osm_db_store: ERR 6909: " + " Fail to remove file:%s (err:%u)\n", + p_domain_imp->file_name, status); + } status = rename(p_tmp_file_name, p_domain_imp->file_name); if (status) { From yael at mellanox.co.il Mon Feb 6 01:27:59 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 06 Feb 2006 11:27:59 +0200 Subject: [openib-general] [PATCH] Opensm - fix casting for windows Message-ID: <5zmzh425wg.fsf@mtl066.yok.mtl.com> Hi Hal, The following patch adds some missing casts and fixes object types to fix compilation errors in the windows stack, aadds some changes in osm_db_file.c to match the windows stack. Thanks, Yael Signed-off-by: Yael Kalka Index: opensm/osm_db_pack.c =================================================================== --- opensm/osm_db_pack.c (revision 5307) +++ opensm/osm_db_pack.c (working copy) @@ -80,13 +80,13 @@ __osm_unpack_lids( if (! p_num) return 1; tmp = strtoul(p_num, NULL, 0); CL_ASSERT( tmp < 0x10000 ); - *p_min_lid = tmp; + *p_min_lid = (uint16_t)tmp; p_num = strtok_r(NULL, " \t", &p_next); if (! p_num) return 1; tmp = strtoul(p_num, NULL, 0); CL_ASSERT( tmp < 0x10000 ); - *p_max_lid = tmp; + *p_max_lid = (uint16_t)tmp; return 0; } Index: opensm/osm_lid_mgr.c =================================================================== --- opensm/osm_lid_mgr.c (revision 5307) +++ opensm/osm_lid_mgr.c (working copy) @@ -742,7 +742,7 @@ void { cl_ptr_vector_t *p_discovered_vec = &p_mgr->p_subn->port_lid_tbl; uint16_t lid, min_lid, max_lid; - uint16_t max_tbl_lid = cl_ptr_vector_get_size( p_discovered_vec ); + uint16_t max_tbl_lid = (uint16_t)(cl_ptr_vector_get_size( p_discovered_vec )); osm_port_get_lid_range_ho(p_port, &min_lid, &max_lid); for (lid = min_lid; lid <= max_lid; lid++) Index: opensm/osm_pkey.c =================================================================== --- opensm/osm_pkey.c (revision 5307) +++ opensm/osm_pkey.c (working copy) @@ -76,7 +76,7 @@ void osm_pkey_tbl_destroy( IN osm_pkey_tbl_t *p_pkey_tbl) { uint16_t num_blocks, i; - num_blocks = cl_ptr_vector_get_size( &p_pkey_tbl->blocks ); + num_blocks = (uint16_t)(cl_ptr_vector_get_size( &p_pkey_tbl->blocks )); for (i = 0; i < num_blocks; i++) cl_free(cl_ptr_vector_get( &p_pkey_tbl->blocks, i )); cl_ptr_vector_destroy( &p_pkey_tbl->blocks ); @@ -202,7 +202,8 @@ osm_physp_share_pkey( IN const osm_physp_t* const p_physp_1, IN const osm_physp_t* const p_physp_2 ) { - ib_net16_t *pkey1, *pkey2, pkey1_base, pkey2_base; + ib_net16_t *pkey1, *pkey2; + uint64_t pkey1_base, pkey2_base; const osm_pkey_tbl_t *pkey_tbl1, *pkey_tbl2; cl_map_iterator_t map_iter1, map_iter2; Index: opensm/osm_pkey_mgr.c =================================================================== --- opensm/osm_pkey_mgr.c (revision 5307) +++ opensm/osm_pkey_mgr.c (working copy) @@ -234,7 +234,7 @@ osm_pkey_mgr_process( osm_node_t *p_node; osm_node_t *p_next_node; - uint32_t port_num; + uint8_t port_num; osm_physp_t *p_physp; osm_signal_t result = OSM_SIGNAL_DONE; Index: opensm/osm_trap_rcv.c =================================================================== --- opensm/osm_trap_rcv.c (revision 5307) +++ opensm/osm_trap_rcv.c (working copy) @@ -135,7 +135,7 @@ osm_trap_rcv_aging_tracker_callback( /* We got an exit flag - do nothing */ return 0; - lid = (uint16_t)cl_ntoh16(( key & 0x0000FFFF00000000ULL) >> 32); + lid = cl_ntoh16((uint16_t)(( key & 0x0000FFFF00000000ULL) >> 32)); port_num = (uint8_t)(( key & 0x00FF000000000000ULL) >> 48); p_physp = __get_physp_by_lid_and_num( p_rcv, lid, port_num ); Index: opensm/osm_ucast_updn.c =================================================================== --- opensm/osm_ucast_updn.c (revision 5307) +++ opensm/osm_ucast_updn.c (working copy) @@ -620,7 +620,8 @@ updn_subn_rank( { /* Init local vars */ osm_port_t *p_root_port=NULL; - uint8_t tbl_size,rank=base_rank; + uint16_t tbl_size; + uint8_t rank=base_rank; osm_physp_t *p_physp, *p_remote_physp,*p_physp_temp; cl_list_t *p_currList,*p_nextList; cl_status_t did_cause_update; @@ -639,7 +640,7 @@ updn_subn_rank( p_currList = p_nextList; /* Check valid subnet & guid */ - tbl_size = cl_qmap_count(&(osm.subn.port_guid_tbl)); + tbl_size = (uint16_t)(cl_qmap_count(&(osm.subn.port_guid_tbl))); if (tbl_size == 0) { osm_log(&(osm.log), OSM_LOG_ERROR, @@ -1078,7 +1079,7 @@ osm_updn_find_root_nodes_by_min_hop( OUT uint8_t hop_val; uint16_t numHopBarsOverThd1 = 0; uint16_t numHopBarsOverThd2 = 0; - float thd1,thd2; + double thd1,thd2; p_sw = p_next_sw; /* Roll to the next switch */ From yael at mellanox.co.il Mon Feb 6 02:00:19 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 06 Feb 2006 12:00:19 +0200 Subject: [openib-general] [PATCH] Opensm - osm_sa_path_record.c - variable declaration Message-ID: <5zlkwo24ek.fsf@mtl066.yok.mtl.com> Hi Hal, There was an issue discussed a while ago regarding declaration of several variables inside the function, in the code handling path record for multicast. Declaration in the middle of the function doesn't compile on windows, and in the past you said that the preffered approach by you is to add parenthesis on the code handling the multicast path records. This patch adds these parenthesis. Thanks, Yael Signed-off-by: Yael Kalka Index: opensm/osm_sa_path_record.c =================================================================== --- opensm/osm_sa_path_record.c (revision 5307) +++ opensm/osm_sa_path_record.c (working copy) @@ -1753,7 +1753,7 @@ osm_pr_rcv_process( osm_log(p_rcv->p_log, OSM_LOG_DEBUG, "osm_pr_rcv_process: " "Multicast destination requested\n" ); - + { osm_mgrp_t *p_mgrp = NULL; ib_api_status_t status; osm_pr_item_t* p_pr_item; @@ -1815,6 +1815,7 @@ osm_pr_rcv_process( "MC group attributes don't match PathRecord request\n" ); } } + } /* Now, (finally) respond to the PathRecord request */ From yael at mellanox.co.il Mon Feb 6 02:05:45 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 06 Feb 2006 12:05:45 +0200 Subject: [openib-general] [PATCH] Opensm - osm_ucast_mgr.c - use dynamic alloc Message-ID: <5zk6c8245i.fsf@mtl066.yok.mtl.com> Hi Hal, The original static allocation doesn't compile in Windows. The attached patch replaces it with dynamic allocation. Thanks, Yael Signed-off-by: Yael Kalka Index: opensm/osm_ucast_mgr.c =================================================================== --- opensm/osm_ucast_mgr.c (revision 5307) +++ opensm/osm_ucast_mgr.c (working copy) @@ -633,13 +633,31 @@ __osm_ucast_mgr_process_port( in providing better routing in LMC > 0 situations */ uint16_t lids_per_port = 1 << p_mgr->p_subn->opt.lmc; - uint64_t remote_sys_guids[lids_per_port]; - uint64_t remote_node_guids[lids_per_port]; + uint64_t* remote_sys_guids = NULL; + uint64_t* remote_node_guids = NULL; uint16_t num_used_sys = 0; uint16_t num_used_nodes = 0; OSM_LOG_ENTER( p_mgr->p_log, __osm_ucast_mgr_process_port ); + remote_sys_guids = cl_zalloc( sizeof(uint64_t) * lids_per_port ); + if( remote_sys_guids == NULL ) + { + osm_log( p_mgr->p_log, OSM_LOG_ERROR, + "__osm_ucast_mgr_process_port: ERR 3A09: " + "Cannot allocate array. Memory insufficient.\n"); + goto Exit; + } + + remote_node_guids = cl_zalloc( sizeof(uint64_t) * lids_per_port ); + if( remote_node_guids == NULL ) + { + osm_log( p_mgr->p_log, OSM_LOG_ERROR, + "__osm_ucast_mgr_process_port: ERR 3A0A: " + "Cannot allocate array. Memory insufficient.\n"); + goto Exit; + } + osm_port_get_lid_range_ho( p_port, &min_lid_ho, &max_lid_ho ); /* If the lids are zero - then there was some problem with the initialization. @@ -767,6 +785,8 @@ __osm_ucast_mgr_process_port( osm_switch_set_path( p_sw, lid_ho, port, is_ignored_by_port_prof); } Exit: + if (remote_sys_guids) cl_free(remote_sys_guids); + if (remote_node_guids) cl_free(remote_node_guids); OSM_LOG_EXIT( p_mgr->p_log ); } From yael at mellanox.co.il Mon Feb 6 02:15:03 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 06 Feb 2006 12:15:03 +0200 Subject: [openib-general] [PATCH] Opensm - osm_reg_sig_handler in Windows Message-ID: <5zirrs23q0.fsf@mtl066.yok.mtl.com> Hi Hal, The osm_reg_sig_handler function is not supported in Windows. The following patch adds the function only if non-Windows stack. Thanks, Yael Signed-off-by: Yael Kalka Index: include/opensm/osm_opensm.h =================================================================== --- include/opensm/osm_opensm.h (revision 5307) +++ include/opensm/osm_opensm.h (working copy) @@ -394,6 +394,7 @@ extern volatile int osm_exit_flag; * Set to one to cause all threads to leave *********/ +#ifndef __WIN__ /****f* OpenSM: OpenSM/osm_reg_sig_handler * NAME * osm_reg_sig_handler @@ -417,6 +418,7 @@ IN osm_opensm_t* const p_osm); * * SEE ALSO *********/ +#endif /* __WIN__ */ END_C_DECLS Index: opensm/osm_opensm.c =================================================================== --- opensm/osm_opensm.c (revision 5307) +++ opensm/osm_opensm.c (working copy) @@ -151,6 +151,7 @@ osm_opensm_create_mcgroups( /********************************************************************** * SHUT DOWN IS CONTROLLED BY A GLOBAL EXIT FLAG **********************************************************************/ +#ifndef __WIN__ static osm_opensm_t *__p_osm_to_signal; void @@ -191,6 +192,7 @@ osm_reg_sig_handler( return; } +#endif /* __WIN__ */ /********************************************************************** **********************************************************************/ From mst at mellanox.co.il Mon Feb 6 04:25:21 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 6 Feb 2006 14:25:21 +0200 Subject: [openib-general] Re: [PATCH] Opensm - osm_reg_sig_handler in Windows In-Reply-To: <5zirrs23q0.fsf@mtl066.yok.mtl.com> References: <5zirrs23q0.fsf@mtl066.yok.mtl.com> Message-ID: <20060206122521.GB31609@mellanox.co.il> Quoting r. Yael Kalka : > Subject: [PATCH] Opensm - osm_reg_sig_handler in Windows > > > Hi Hal, > > The osm_reg_sig_handler function is not supported in Windows. > The following patch adds the function only if non-Windows stack. > > Thanks, > Yael > > Signed-off-by: Yael Kalka As was pointed out several times, we dont really need a signal handler in linux, either, since driver detects the application exiting automatically. Can we kill it completely please? Work around for broken drivers that cant detect application exiting belongs in the vendor layer, not in opensm proper. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From yael at mellanox.co.il Mon Feb 6 04:41:06 2006 From: yael at mellanox.co.il (Yael Kalka) Date: Mon, 6 Feb 2006 14:41:06 +0200 Subject: [openib-general] RE: [PATCH] Opensm - osm_reg_sig_handler in Windows Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FD83@mtlexch01.mtl.com> Michael, The signal handling for catching ^C (SIGINT) was deleted before. There are other signalling caught by OpenSM, for example SIGHUP, that enables triggering the OpenSM to do another heavy sweep. We do not want to remove this. Yael -----Original Message----- From: Michael S. Tsirkin Sent: Monday, February 06, 2006 2:25 PM To: Yael Kalka Cc: halr at voltaire.com; openib-general at openib.org Subject: Re: [PATCH] Opensm - osm_reg_sig_handler in Windows Quoting r. Yael Kalka : > Subject: [PATCH] Opensm - osm_reg_sig_handler in Windows > > > Hi Hal, > > The osm_reg_sig_handler function is not supported in Windows. > The following patch adds the function only if non-Windows stack. > > Thanks, > Yael > > Signed-off-by: Yael Kalka As was pointed out several times, we dont really need a signal handler in linux, either, since driver detects the application exiting automatically. Can we kill it completely please? Work around for broken drivers that cant detect application exiting belongs in the vendor layer, not in opensm proper. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From yael at mellanox.co.il Mon Feb 6 04:39:24 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 06 Feb 2006 14:39:24 +0200 Subject: [openib-general] [PATCH] Opensm - clean osm_vendor_mlx_sa.c code Message-ID: <5zhd7c1x1f.fsf@mtl066.yok.mtl.com> Hi Hal, Currently in osm_vendor_mlx_sa.c the sent context is saved arbitrarily as nodeInfo_context. This results in need for strange castings from long to pointer and vice-versa. The following patch adds another possible context - arbitrary context, which will be used in this case. Thanks, Yael Signed-off-by: Yael Kalka Index: libvendor/osm_vendor_mlx_sa.c =================================================================== --- libvendor/osm_vendor_mlx_sa.c (revision 5307) +++ libvendor/osm_vendor_mlx_sa.c (working copy) @@ -96,9 +96,9 @@ __osmv_sa_mad_rcv_cb( goto Exit; } - /* obtain the sent context since we store it during send in the ni_ctx */ + /* obtain the sent context */ p_query_req_copy = - (osmv_query_req_t *)CAST_P2LONG(p_req_madw->context.ni_context.node_guid); + (osmv_query_req_t *)(p_req_madw->context.arb_context.context1); /* provide the context of the original request in the result */ query_res.query_context = p_query_req_copy->query_context; @@ -207,7 +207,7 @@ __osmv_sa_mad_err_cb( /* Obtain the sent context etc */ p_query_req_copy = - (osmv_query_req_t *)CAST_P2LONG(p_madw->context.ni_context.node_guid); + (osmv_query_req_t *)(p_madw->context.arb_context.context1); /* provide the context of the original request in the result */ query_res.query_context = p_query_req_copy->query_context; @@ -561,10 +561,17 @@ __osmv_send_sa_req( /* Provide the address to send to */ + /* Patch to handle IBAL - host order , where it should take destination lid in network order */ +#ifdef OSM_VENDOR_INTF_AL + p_madw->mad_addr.dest_lid = p_bind->sm_lid; +#else p_madw->mad_addr.dest_lid = cl_hton16(p_bind->sm_lid); +#endif p_madw->mad_addr.addr_type.smi.source_lid = cl_hton16(p_bind->lid); p_madw->mad_addr.addr_type.gsi.remote_qp = CL_HTON32(1); + p_madw->mad_addr.addr_type.gsi.remote_qkey = IB_QP1_WELL_KNOWN_Q_KEY; + p_madw->mad_addr.addr_type.gsi.pkey = IB_DEFAULT_PKEY; p_madw->resp_expected = TRUE; p_madw->fail_msg = CL_DISP_MSGID_NONE; @@ -574,12 +581,11 @@ __osmv_send_sa_req( Since we can not rely on the client to keep it arroud until the response - we duplicate it and will later dispose it (in CB). To store on the MADW we cast it into what opensm has: - p_madw->context.ni_context.node_guid + p_madw->context.arb_context.context1 */ p_query_req_copy = cl_malloc(sizeof(*p_query_req_copy)); *p_query_req_copy = *p_query_req; - p_madw->context.ni_context.node_guid = - (ib_net64_t)CAST_P2LONG(p_query_req_copy); + p_madw->context.arb_context.context1 = p_query_req_copy; /* we can support async as well as sync calls */ sync = ((p_query_req->flags & OSM_SA_FLAGS_SYNC) == OSM_SA_FLAGS_SYNC); Index: include/opensm/osm_madw.h =================================================================== --- include/opensm/osm_madw.h (revision 5307) +++ include/opensm/osm_madw.h (working copy) @@ -315,6 +315,22 @@ typedef struct _osm_vla_context boolean_t set_method; } osm_vla_context_t; /*********/ +/****s* OpenSM: MAD Wrapper/osm_arbitrary_context_t +* NAME +* osm_sa_context_t +* +* DESCRIPTION +* Context needed by arbitrary recipient. +* +* SYNOPSIS +*/ +typedef struct _osm_arbitrary_context +{ + void* context1; + void* context2; +} osm_arbitrary_context_t; +/*********/ + /****s* OpenSM: MAD Wrapper/osm_madw_context_t * NAME * osm_madw_context_t @@ -335,6 +351,7 @@ typedef union _osm_madw_context osm_smi_context_t smi_context; osm_slvl_context_t slvl_context; osm_pkey_context_t pkey_context; + osm_arbitrary_context_t arb_context; } osm_madw_context_t; /*********/ @@ -880,6 +897,34 @@ osm_madw_get_vla_context_ptr( } /* * PARAMETERS +* p_madw +* [in] Pointer to an osm_madw_t object. +* +* RETURN VALUES +* Pointer to the start of the context structure. +* +* NOTES +* +* SEE ALSO +*********/ + +/****f* OpenSM: MAD Wrapper/osm_madw_get_arbitrary_context_ptr +* NAME +* osm_madw_get_arbitrary_context_ptr +* +* DESCRIPTION +* Gets a pointer to the arbitrary context in this MAD. +* +* SYNOPSIS +*/ +static inline osm_arbitrary_context_t* +osm_madw_get_arbitrary_context_ptr( + IN const osm_madw_t* const p_madw ) +{ + return( (osm_arbitrary_context_t*)&p_madw->context ); +} +/* +* PARAMETERS * p_madw * [in] Pointer to an osm_madw_t object. * From yael at mellanox.co.il Mon Feb 6 04:43:31 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 06 Feb 2006 14:43:31 +0200 Subject: [openib-general] [PATCH] Opensm - add syslog prints in windows Message-ID: <5zfymw1wuk.fsf@mtl066.yok.mtl.com> Hi Hal, Currently SYSLOG prints are not executed under Windows. The following patch adds these printings to the Windows stack as well. Thanks, Yael Signed-off-by: Yael Kalka Index: opensm/osm_log.c =================================================================== --- opensm/osm_log.c (revision 5307) +++ opensm/osm_log.c (working copy) @@ -105,6 +105,8 @@ osm_log( usecs = time_usecs % 1000000; localtime_r(&tim, &result); +#endif /* WIN32 */ + /* If this is a call to syslog - always print it */ if ( verbosity & OSM_LOG_SYS ) { @@ -122,16 +124,21 @@ osm_log( } /* send it also to the log file */ +#ifdef WIN32 + GetLocalTime(&st); + fprintf( p_log->out_port, "[%02d:%02d:%02d:%03d][%04X] -> %s", + st.wHour, st.wMinute, st.wSecond, st.wMilliseconds, + pid, buffer); +#else fprintf( p_log->out_port, "%s %02d %02d:%02d:%02d %06d [%04X] -> %s\n", (result.tm_mon < 12 ? month_str[result.tm_mon] : "???"), result.tm_mday, result.tm_hour, result.tm_min, result.tm_sec, usecs, pid, buffer); fflush( p_log->out_port ); - +#endif } -#endif /* WIN32 */ /* SYS messages go to the log anyways */ if (p_log->level & verbosity) From mst at mellanox.co.il Mon Feb 6 04:56:13 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 6 Feb 2006 14:56:13 +0200 Subject: [openib-general] Re: [PATCH] Opensm - osm_reg_sig_handler in Windows In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FD83@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FD83@mtlexch01.mtl.com> Message-ID: <20060206125613.GC31609@mellanox.co.il> Quoting r. Yael Kalka : > The signal handling for catching ^C (SIGINT) was deleted before. Oops, should have looked at the context. You are right, sorry. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From takshak at gs-lab.com Mon Feb 6 05:00:59 2006 From: takshak at gs-lab.com (Takshak C.) Date: Mon, 06 Feb 2006 18:30:59 +0530 Subject: [openib-general] Get Table Records for SA Attribute ID ? Message-ID: <43E7488B.1050808@gs-lab.com> Hi, I m trying to get the table records for SA attribute ID in following way. But, I m not getting a single record, could anyone comment on the problem. 1. I have created saMadFormat structure described in the specification as below: struct saMadFormat { uint8_t base_version ; uint8_t mgmt_class ; uint8_t class_version ; uint8_t sa_method ; uint16_t status ; uint16_t not_used ; uint64_t tid ; uint16_t attr_id ; uint16_t resv ; uint32_t attr_mod ; uint64_t sa_key; uint64_t sm_key ; uint32_t seg_num ; uint32_t payload_len ; uint8_t frag_flag ; uint8_t edit_mod ; uint16_t window ; uint32_t endRID ; uint64_t comp_mask ; uint8_t adminData[192] ; }; 2. Then I have done all the basic operations like umad_open, umad_register for the IB_SA_CLASS and umad_open_port etc successfully. 3. struct saMadFormat *saQuery = (struct saMadFormat*)(umad_get_mad(umad)); memset(saQuery, 0, sizeof(*saQuery)); saQuery->base_version = 1; saQuery->mgmt_class = IB_SA_CLASS ; saQuery->class_version = 1 ; saQuery->sa_method = IB_MAD_METHOD_GET_TABLE ; saQuery->attr_id = IB_SA_ATTR_PATHRECORD ; saQuery->attr_mod = 0 ; saQuery->tid = htonll(drmad_tid++); saQuery->endRID = 0 ; umad_set_addr(umad, lid, 1, 0, IB_DEFAULT_QP1_QKEY); umad_set_grh(umad, 0); umad_set_pkey(umad, 0xFFFF); 4. length = IB_MAD_SIZE; if (umad_send(portid, mad_agent, umad, length, timeout_ms, 0) < 0) IBPANIC("send failed"); if (umad_recv(portid, umad, &length, -1) != mad_agent) IBPANIC("recv error: %s", drmad_status_str(saQuery)); if (!dump_char) { xdump(stdout, 0, saQuery->adminData, 192); return 0; } I m expecting that, I will get the resultant data in saQuery->adminData. Is this correct ? If not then, how should I retrieve the table records ? Any Idea ? Thanks - Takshak From tziporet at mellanox.co.il Mon Feb 6 05:27:06 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 6 Feb 2006 15:27:06 +0200 Subject: [openib-general] Re: does the mthca driver support RTS->SQD event request? Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30100B10C@mtlexch01.mtl.com> Roland> I'm not sure whether that interpretation is correct or not. In any Roland> case, it seems that Mellanox HCAs only support enabling the event on Roland> the RTS->SQD transition. This is correct Tziporet From halr at voltaire.com Mon Feb 6 06:02:41 2006 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 6 Feb 2006 16:02:41 +0200 Subject: [openib-general] Get Table Records for SA Attribute ID ? Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AC20@taurus.voltaire.com> Hi, There are a couple of issues with the below. 1. SA MAD structure is missing the RMPP header. Once I saw that I didn't check for further issues with the format. 2. I will assume your register call sets RMPP. 3. SA class version is 2. What SM are you using ? If you are using OpenSM, you can turn on verbose and see if the packet is seen by the SM. You could also enable madeye (in utils) to see if the packet is sent (and if anything is received back). -- Hal ________________________________ From: openib-general-bounces at openib.org on behalf of Takshak C. Sent: Mon 2/6/2006 8:00 AM To: openib-general at openib.org Subject: [openib-general] Get Table Records for SA Attribute ID ? Hi, I m trying to get the table records for SA attribute ID in following way. But, I m not getting a single record, could anyone comment on the problem. 1. I have created saMadFormat structure described in the specification as below: struct saMadFormat { uint8_t base_version ; uint8_t mgmt_class ; uint8_t class_version ; uint8_t sa_method ; uint16_t status ; uint16_t not_used ; uint64_t tid ; uint16_t attr_id ; uint16_t resv ; uint32_t attr_mod ; uint64_t sa_key; uint64_t sm_key ; uint32_t seg_num ; uint32_t payload_len ; uint8_t frag_flag ; uint8_t edit_mod ; uint16_t window ; uint32_t endRID ; uint64_t comp_mask ; uint8_t adminData[192] ; }; 2. Then I have done all the basic operations like umad_open, umad_register for the IB_SA_CLASS and umad_open_port etc successfully. 3. struct saMadFormat *saQuery = (struct saMadFormat*)(umad_get_mad(umad)); memset(saQuery, 0, sizeof(*saQuery)); saQuery->base_version = 1; saQuery->mgmt_class = IB_SA_CLASS ; saQuery->class_version = 1 ; saQuery->sa_method = IB_MAD_METHOD_GET_TABLE ; saQuery->attr_id = IB_SA_ATTR_PATHRECORD ; saQuery->attr_mod = 0 ; saQuery->tid = htonll(drmad_tid++); saQuery->endRID = 0 ; umad_set_addr(umad, lid, 1, 0, IB_DEFAULT_QP1_QKEY); umad_set_grh(umad, 0); umad_set_pkey(umad, 0xFFFF); 4. length = IB_MAD_SIZE; if (umad_send(portid, mad_agent, umad, length, timeout_ms, 0) < 0) IBPANIC("send failed"); if (umad_recv(portid, umad, &length, -1) != mad_agent) IBPANIC("recv error: %s", drmad_status_str(saQuery)); if (!dump_char) { xdump(stdout, 0, saQuery->adminData, 192); return 0; } I m expecting that, I will get the resultant data in saQuery->adminData. Is this correct ? If not then, how should I retrieve the table records ? Any Idea ? Thanks - Takshak _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Mon Feb 6 06:28:06 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 6 Feb 2006 16:28:06 +0200 Subject: [openib-general] Re: [PATCH] Opensm - osm_reg_sig_handler in Windows In-Reply-To: <20060206122521.GB31609@mellanox.co.il> References: <5zirrs23q0.fsf@mtl066.yok.mtl.com> <20060206122521.GB31609@mellanox.co.il> Message-ID: <20060206142806.GA29321@sashak.voltaire.com> On 14:25 Mon 06 Feb , Michael S. Tsirkin wrote: > > As was pointed out several times, we dont really need a signal > handler in linux, The signals are used. For instance SIGHUP will initiate re-sweep (I use it frequently), clean exit is done with SIGINT and SIGTERM. If signals are not supported in windows this can be simply masked in less aggressive way, Something like: #define signal(a,b) , or #define cl_reg_sig_hdl(a,b) (or something better) in windows specific common header file. Sasha. From ogerlitz at voltaire.com Mon Feb 6 07:53:05 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 6 Feb 2006 17:53:05 +0200 (IST) Subject: [openib-general] iser: 4 change sets to the code Message-ID: ------------------------------------------------------------------------ r5314 | ogerlitz | 2006-02-06 17:47:06 +0200 (Mon, 06 Feb 2006) | 5 lines connection establishment error flow bugfixes: dont call rdma_destory_id from the cma callback flow and dont call sock_release when the socket might be touched later. Signed-off-by: Or Gerlitz ------------------------------------------------------------------------ r5312 | ogerlitz | 2006-02-06 17:39:26 +0200 (Mon, 06 Feb 2006) | 4 lines moved the code of conn init/connect/release from iser_conn.c to iser_verbs.c, cleanups Signed-off-by: Or Gerlitz ------------------------------------------------------------------------ r5311 | ogerlitz | 2006-02-06 17:34:23 +0200 (Mon, 06 Feb 2006) | 4 lines deallocate adaptor (shared IB resources among iser connections) when there's no demand Signed-off-by: Or Gerlitz ------------------------------------------------------------------------ r5309 | ogerlitz | 2006-02-06 17:26:57 +0200 (Mon, 06 Feb 2006) | 4 lines various cleanups, cosmetic changes for coding conventions Signed-off-by: Or Gerlitz From Arkady.Kanevsky at netapp.com Mon Feb 6 08:07:49 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Mon, 6 Feb 2006 11:07:49 -0500 Subject: [dat-discussions] [openib-general] [RFC] DAT 2.0 immediate dataproposal Message-ID: Here are the changes to the existing requirements chapters for RDMA Write with Immediate Data. Feedback please. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Friday, February 03, 2006 7:30 PM > To: Davis, Arlin R > Cc: dat-discussions at yahoogroups.com; openib-general at openib.org > Subject: Re: [dat-discussions] [openib-general] [RFC] DAT 2.0 > immediate dataproposal > > Davis, Arlin R wrote: > > "Applications need an optimized mechanism to notify the > receiving end > > that RDMA write data has completed beyond the two operation method > > currently used (RDMA write followed by message send). This new RDMA > > write feature will support 4-bytes of inline data that will be sent > > Is there any reason to restrict the size of the immediate > data? Could you define the API such that the size is > variable? I.e. the provider can simply give the immediate > data size, with 0 indicating that it is not supported. > > > It should avoid > > any latency penalties normally associated with a two > operation method. > > I would state this as a requirement. A write followed by a > send should be pushed to the application, since they may be > able to provide additional optimizations (such as combining > operations) beyond what a provider could. > > > The initiating side must expose a 4-byte immediate data > parameter for > > the application to set the inline data. The receiving side must > > provide a mechanism to accept the 4-byte immediate data. On the > > receiving side, the write with immediate completion notification is > > indicated through a receive completion. It is the responsibility of > > the provider to identify to the application 4-byte > immediate data from > > a normal 4-byte send message. The inline byte ordering is > application specific." > > Requirements look good to me. > > - Sean > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- A non-text attachment was scrubbed... Name: transport_req_020606.pdf Type: application/octet-stream Size: 26718 bytes Desc: transport_req_020606.pdf URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: immed_req_inside_020606.pdf Type: application/octet-stream Size: 25512 bytes Desc: immed_req_inside_020606.pdf URL: From mst at mellanox.co.il Mon Feb 6 08:39:19 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 6 Feb 2006 18:39:19 +0200 Subject: [openib-general] Re: [PATCH] change Mellanox SDP workaround to a moduleparameter In-Reply-To: <1139015153.475.20.camel@brick.internal.keyresearch.com> References: <1139015153.475.20.camel@brick.internal.keyresearch.com> Message-ID: <20060206163919.GI31609@mellanox.co.il> Quoting r. Ralph Campbell : > Subject: [PATCH] change Mellanox SDP workaround to a moduleparameter > > This patch changes the hardwired MTU limit of 1024 in SDP > into a module parameter so it can be disabled for HCAs > without the RC performance problem. > > Signed-off-by: Ralph Campbell Hmm. Do we want this as a compile-time option too, for people that might compile SDP in kernel? +module_param(sdp_path_mtu_max, int, 0); Why 0? Lets make this editable from sysfs? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Mon Feb 6 08:55:02 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 6 Feb 2006 18:55:02 +0200 Subject: [openib-general] Re: [PATCH] use set_current_state() in SDP In-Reply-To: <1139017877.475.30.camel@brick.internal.keyresearch.com> References: <1139017877.475.30.camel@brick.internal.keyresearch.com> Message-ID: <20060206165502.GJ31609@mellanox.co.il> Quoting r. Ralph Campbell : > > On Fri, 2006-02-03 at 17:06 -0800, Roland Dreier wrote: > > I think both of these places can use __set_current_state(). > > > > - R. > > Good point. Here is the updated patch. > > Signed-off-by: Ralph Campbell Hmm. We would be using wait_event_exclusive except there is no such a beast. I wander whether we can switch to at least use prepare_to_wait_exclusive/ finish_wait? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Mon Feb 6 09:27:39 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 6 Feb 2006 19:27:39 +0200 Subject: [openib-general] mthca: gid index bug? Message-ID: <20060206172739.GR31609@mellanox.co.il> Roland, in mthca_qp.c we have path->mgid_index = ah->grh.sgid_index; Shouldnt the port number be taken into account, like it is with mthca_av, where we have av->gid_index = (ah_attr->port_num - 1) * dev->limits.gid_table_len + -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From rdreier at cisco.com Mon Feb 6 09:41:13 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 06 Feb 2006 09:41:13 -0800 Subject: [openib-general] Re: [PATCH] change Mellanox SDP workaround to a moduleparameter In-Reply-To: <20060206163919.GI31609@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 6 Feb 2006 18:39:19 +0200") References: <1139015153.475.20.camel@brick.internal.keyresearch.com> <20060206163919.GI31609@mellanox.co.il> Message-ID: Michael> Do we want this as a compile-time option too, for people Michael> that might compile SDP in kernel? module options can be set on the kernel command line. - R. From arlin.r.davis at intel.com Mon Feb 6 10:25:09 2006 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Mon, 6 Feb 2006 10:25:09 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT 2.0 immediate dataproposal Message-ID: <59278FC0C48A994BABABD069571E45680DDD5979@orsmsx401.amr.corp.intel.com> Arkady, Your requirements are slightly different then the proposed set of requirements. "iii) DAPL Provider does not provide any identification that that the Receive operation matches remote RDMA Write with Immediate data if it completes as Receive DTO. - It is up to an ULP to separate Receive completion of remote Send from remote RDMA Write with Immediate Data." Tell me how this is possible? How can the application distinguish between a 4 byte message and a 4 byte immediate data message? We would have to add a new requirement... "If the provider supports immediate data in the payload the ULP cannot send a message equal to the immediate data size". -arlin >-----Original Message----- >From: Kanevsky, Arkady [mailto:Arkady.Kanevsky at netapp.com] >Sent: Monday, February 06, 2006 8:08 AM >To: Sean Hefty; Davis, Arlin R >Cc: dat-discussions at yahoogroups.com; openib-general at openib.org >Subject: RE: [dat-discussions] [openib-general] [RFC] DAT 2.0 immediate dataproposal > >Here are the changes to the existing requirements chapters >for RDMA Write with Immediate Data. > >Feedback please. >Arkady > >Arkady Kanevsky email: arkady at netapp.com >Network Appliance Inc. phone: 781-768-5395 >1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 >Waltham, MA 02451 central phone: 781-768-5300 > > >> -----Original Message----- >> From: Sean Hefty [mailto:mshefty at ichips.intel.com] >> Sent: Friday, February 03, 2006 7:30 PM >> To: Davis, Arlin R >> Cc: dat-discussions at yahoogroups.com; openib-general at openib.org >> Subject: Re: [dat-discussions] [openib-general] [RFC] DAT 2.0 >> immediate dataproposal >> >> Davis, Arlin R wrote: >> > "Applications need an optimized mechanism to notify the >> receiving end >> > that RDMA write data has completed beyond the two operation method >> > currently used (RDMA write followed by message send). This new RDMA >> > write feature will support 4-bytes of inline data that will be sent >> >> Is there any reason to restrict the size of the immediate >> data? Could you define the API such that the size is >> variable? I.e. the provider can simply give the immediate >> data size, with 0 indicating that it is not supported. >> >> > It should avoid >> > any latency penalties normally associated with a two >> operation method. >> >> I would state this as a requirement. A write followed by a >> send should be pushed to the application, since they may be >> able to provide additional optimizations (such as combining >> operations) beyond what a provider could. >> >> > The initiating side must expose a 4-byte immediate data >> parameter for >> > the application to set the inline data. The receiving side must >> > provide a mechanism to accept the 4-byte immediate data. On the >> > receiving side, the write with immediate completion notification is >> > indicated through a receive completion. It is the responsibility of >> > the provider to identify to the application 4-byte >> immediate data from >> > a normal 4-byte send message. The inline byte ordering is >> application specific." >> >> Requirements look good to me. >> >> - Sean >> _______________________________________________ >> openib-general mailing list >> openib-general at openib.org >> http://openib.org/mailman/listinfo/openib-general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> From rdreier at cisco.com Mon Feb 6 09:51:26 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 06 Feb 2006 09:51:26 -0800 Subject: [openib-general] Re: mthca: gid index bug? In-Reply-To: <20060206172739.GR31609@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 6 Feb 2006 19:27:39 +0200") References: <20060206172739.GR31609@mellanox.co.il> Message-ID: > Roland, in mthca_qp.c we have > > path->mgid_index = ah->grh.sgid_index; > > Shouldnt the port number be taken into account, like it > is with mthca_av, where we have > av->gid_index = (ah_attr->port_num - 1) * dev->limits.gid_table_len + I really don't know. The PRM just says "index to port GID table". Can you check it out at Mellanox and (even better) generate a patch if it's wrong? Thanks, Roland From rdreier at cisco.com Mon Feb 6 10:03:43 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 06 Feb 2006 10:03:43 -0800 Subject: [openib-general] Re: ipoib_mcast_send.patch In-Reply-To: <20060205084633.GM5673@mellanox.co.il> (Michael S. Tsirkin's message of "Sun, 5 Feb 2006 10:46:33 +0200") References: <20060205084633.GM5673@mellanox.co.il> Message-ID: Michael> Maybe the description is not clear enough. There are two Michael> issues here with two separate fixes. OK, got it now. Michael> 1. IPOIB_MCAST_STARTED - solves the first issue Michael> 2. Checking priv->broadcast in ipoib_mcast_send here: + Michael> if (!test_bit(IPOIB_MCAST_STARTED, &priv->flags) || Michael> !priv->broadcast) { - solves the second issue Makes sense. Related to this, the way priv->broadcast is initialized in ipoib_mcast_join_task() looks somewhat unsafe, since there's no lock and conceivable a send-only join could complete before priv->broadcast is fully set up. What do you think? Michael> They just got rolled into one patch because they touch Michael> the same code lines. Michael> Do you want me to split them up? No, I can handle it. Thanks... - R. From caitlinb at broadcom.com Mon Feb 6 10:47:51 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Mon, 6 Feb 2006 10:47:51 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT 2.0 immediate dataproposal Message-ID: <54AD0F12E08D1541B826BE97C98F99F122D817@NT-SJCA-0751.brcm.ad.broadcom.com> dat-discussions at yahoogroups.com wrote: > Arkady, > > Your requirements are slightly different then the proposed set of > requirements. > > "iii) DAPL Provider does not provide any identification that > that the Receive operation matches remote RDMA Write with > Immediate data if it completes as Receive DTO. > > - It is up to an ULP to separate Receive completion of remote > Send from remote RDMA Write with Immediate Data." > > Tell me how this is possible? How can the application > distinguish between a 4 byte message and a 4 byte immediate > data message? We would have to add a new requirement... "If > the provider supports immediate data in the payload the ULP > cannot send a message equal to the immediate > data size". > The data sink knows whether the 4 bytes was sent as a message or as an immediate because it is clear in the ULP context. Possible methods: The expected completion is an immediate. All 4 byte messages are immediates. All 4 byte messages where the ms-byte is X are immediate. If its Tuesday its an immediate. If it's a prime number its an immediate ... But there is no clue from the transport layer. From jlentini at netapp.com Mon Feb 6 11:02:08 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 6 Feb 2006 14:02:08 -0500 (EST) Subject: [openib-general] [ANNOUNCE] iSER BOF at OpenIB workshop Message-ID: For those of you at the OpenIB workshop, there will be an iSER BOF this evening from 18:30-19:30 in the Palm Tree Salon. This BOF will cover iSER in general and discuss the development of an open source Linux iSER target on the OpenIB stack in particular. From Arkady.Kanevsky at netapp.com Mon Feb 6 11:05:02 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Mon, 6 Feb 2006 14:05:02 -0500 Subject: [dat-discussions] [openib-general] [RFC] DAT 2.0 immediate dataproposal Message-ID: Arlin, On Friday we agreed that receiver can not distinguish between 4 byte of Send or 4 bytes of Immediate data if RDMA Write with Immed is implemented as 2 operations: RDMA Write followed by Send. ULP Reciever "expects" Immediate data that is why it posts Recv. Depending on Transport capability it MAY complete as Recv or as Recv_RDMA_Write_with_Immed_in_event. Neither Provider not Consumer can distinguish between the cases unless there is additional info. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Davis, Arlin R [mailto:arlin.r.davis at intel.com] > Sent: Monday, February 06, 2006 1:25 PM > To: Kanevsky, Arkady; Sean Hefty > Cc: dat-discussions at yahoogroups.com; openib-general at openib.org > Subject: RE: [dat-discussions] [openib-general] [RFC] DAT 2.0 > immediate dataproposal > > > Arkady, > > Your requirements are slightly different then the proposed > set of requirements. > > "iii) DAPL Provider does not provide any identification that > that the Receive operation matches remote RDMA Write with > Immediate data if it completes as Receive DTO. > > - It is up to an ULP to separate Receive completion of remote > Send from remote RDMA Write with Immediate Data." > > Tell me how this is possible? How can the application > distinguish between a 4 byte message and a 4 byte immediate > data message? We would have to add a new requirement... "If > the provider supports immediate data in the payload the ULP > cannot send a message equal to the immediate > data size". > > -arlin > > >-----Original Message----- > >From: Kanevsky, Arkady [mailto:Arkady.Kanevsky at netapp.com] > >Sent: Monday, February 06, 2006 8:08 AM > >To: Sean Hefty; Davis, Arlin R > >Cc: dat-discussions at yahoogroups.com; openib-general at openib.org > >Subject: RE: [dat-discussions] [openib-general] [RFC] DAT > 2.0 immediate > dataproposal > > > >Here are the changes to the existing requirements chapters for RDMA > >Write with Immediate Data. > > > >Feedback please. > >Arkady > > > >Arkady Kanevsky email: arkady at netapp.com > >Network Appliance Inc. phone: 781-768-5395 > >1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 > >Waltham, MA 02451 central phone: 781-768-5300 > > > > > >> -----Original Message----- > >> From: Sean Hefty [mailto:mshefty at ichips.intel.com] > >> Sent: Friday, February 03, 2006 7:30 PM > >> To: Davis, Arlin R > >> Cc: dat-discussions at yahoogroups.com; openib-general at openib.org > >> Subject: Re: [dat-discussions] [openib-general] [RFC] DAT 2.0 > >> immediate dataproposal > >> > >> Davis, Arlin R wrote: > >> > "Applications need an optimized mechanism to notify the > >> receiving end > >> > that RDMA write data has completed beyond the two > operation method > >> > currently used (RDMA write followed by message send). > This new RDMA > >> > write feature will support 4-bytes of inline data that > will be sent > >> > >> Is there any reason to restrict the size of the immediate data? > >> Could you define the API such that the size is variable? I.e. the > >> provider can simply give the immediate data size, with 0 > indicating > >> that it is not supported. > >> > >> > It should avoid > >> > any latency penalties normally associated with a two > >> operation method. > >> > >> I would state this as a requirement. A write followed by a send > >> should be pushed to the application, since they may be able to > >> provide additional optimizations (such as combining > >> operations) beyond what a provider could. > >> > >> > The initiating side must expose a 4-byte immediate data > >> parameter for > >> > the application to set the inline data. The receiving side must > >> > provide a mechanism to accept the 4-byte immediate data. On the > >> > receiving side, the write with immediate completion > notification is > >> > indicated through a receive completion. It is the > responsibility of > >> > the provider to identify to the application 4-byte > >> immediate data from > >> > a normal 4-byte send message. The inline byte ordering is > >> application specific." > >> > >> Requirements look good to me. > >> > >> - Sean > >> _______________________________________________ > >> openib-general mailing list > >> openib-general at openib.org > >> http://openib.org/mailman/listinfo/openib-general > >> > >> To unsubscribe, please visit > >> http://openib.org/mailman/listinfo/openib-general > >> > From Arkady.Kanevsky at netapp.com Mon Feb 6 11:08:02 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Mon, 6 Feb 2006 14:08:02 -0500 Subject: [dat-discussions] [openib-general] [RFC] DAT 2.0 immediatedataproposal Message-ID: Arlin, It is too strong to state that Consumer should never send a message equal in size to the size of immediate data. Consumer knows from the context which one it is. it may be based on dedicated connection, or based on ULP protocol ordering. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Kanevsky, Arkady > Sent: Monday, February 06, 2006 2:05 PM > To: Davis, Arlin R; Sean Hefty > Cc: dat-discussions at yahoogroups.com; openib-general at openib.org > Subject: RE: [dat-discussions] [openib-general] [RFC] DAT 2.0 > immediatedataproposal > > Arlin, > On Friday we agreed that receiver can not distinguish between > 4 byte of Send or 4 bytes of Immediate data if RDMA Write > with Immed is implemented as 2 operations: > RDMA Write followed by Send. > > ULP Reciever "expects" Immediate data that is why it posts > Recv. Depending on Transport capability it MAY complete as > Recv or as Recv_RDMA_Write_with_Immed_in_event. > > Neither Provider not Consumer can distinguish between the > cases unless there is additional info. > > Arkady > > Arkady Kanevsky email: arkady at netapp.com > Network Appliance Inc. phone: 781-768-5395 > 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 > Waltham, MA 02451 central phone: 781-768-5300 > > > > -----Original Message----- > > From: Davis, Arlin R [mailto:arlin.r.davis at intel.com] > > Sent: Monday, February 06, 2006 1:25 PM > > To: Kanevsky, Arkady; Sean Hefty > > Cc: dat-discussions at yahoogroups.com; openib-general at openib.org > > Subject: RE: [dat-discussions] [openib-general] [RFC] DAT 2.0 > > immediate dataproposal > > > > > > Arkady, > > > > Your requirements are slightly different then the proposed set of > > requirements. > > > > "iii) DAPL Provider does not provide any identification > that that the > > Receive operation matches remote RDMA Write with Immediate > data if it > > completes as Receive DTO. > > > > - It is up to an ULP to separate Receive completion of remote > > Send from remote RDMA Write with Immediate Data." > > > > Tell me how this is possible? How can the application distinguish > > between a 4 byte message and a 4 byte immediate data > message? We would > > have to add a new requirement... "If the provider supports > immediate > > data in the payload the ULP cannot send a message equal to the > > immediate > > data size". > > > > -arlin > > > > >-----Original Message----- > > >From: Kanevsky, Arkady [mailto:Arkady.Kanevsky at netapp.com] > > >Sent: Monday, February 06, 2006 8:08 AM > > >To: Sean Hefty; Davis, Arlin R > > >Cc: dat-discussions at yahoogroups.com; openib-general at openib.org > > >Subject: RE: [dat-discussions] [openib-general] [RFC] DAT > > 2.0 immediate > > dataproposal > > > > > >Here are the changes to the existing requirements chapters > for RDMA > > >Write with Immediate Data. > > > > > >Feedback please. > > >Arkady > > > > > >Arkady Kanevsky email: arkady at netapp.com > > >Network Appliance Inc. phone: 781-768-5395 > > >1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 > > >Waltham, MA 02451 central phone: 781-768-5300 > > > > > > > > >> -----Original Message----- > > >> From: Sean Hefty [mailto:mshefty at ichips.intel.com] > > >> Sent: Friday, February 03, 2006 7:30 PM > > >> To: Davis, Arlin R > > >> Cc: dat-discussions at yahoogroups.com; openib-general at openib.org > > >> Subject: Re: [dat-discussions] [openib-general] [RFC] DAT 2.0 > > >> immediate dataproposal > > >> > > >> Davis, Arlin R wrote: > > >> > "Applications need an optimized mechanism to notify the > > >> receiving end > > >> > that RDMA write data has completed beyond the two > > operation method > > >> > currently used (RDMA write followed by message send). > > This new RDMA > > >> > write feature will support 4-bytes of inline data that > > will be sent > > >> > > >> Is there any reason to restrict the size of the immediate data? > > >> Could you define the API such that the size is variable? > I.e. the > > >> provider can simply give the immediate data size, with 0 > > indicating > > >> that it is not supported. > > >> > > >> > It should avoid > > >> > any latency penalties normally associated with a two > > >> operation method. > > >> > > >> I would state this as a requirement. A write followed by a send > > >> should be pushed to the application, since they may be able to > > >> provide additional optimizations (such as combining > > >> operations) beyond what a provider could. > > >> > > >> > The initiating side must expose a 4-byte immediate data > > >> parameter for > > >> > the application to set the inline data. The receiving > side must > > >> > provide a mechanism to accept the 4-byte immediate > data. On the > > >> > receiving side, the write with immediate completion > > notification is > > >> > indicated through a receive completion. It is the > > responsibility of > > >> > the provider to identify to the application 4-byte > > >> immediate data from > > >> > a normal 4-byte send message. The inline byte ordering is > > >> application specific." > > >> > > >> Requirements look good to me. > > >> > > >> - Sean > > >> _______________________________________________ > > >> openib-general mailing list > > >> openib-general at openib.org > > >> http://openib.org/mailman/listinfo/openib-general > > >> > > >> To unsubscribe, please visit > > >> http://openib.org/mailman/listinfo/openib-general > > >> > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From roy.k.larsen at intel.com Mon Feb 6 11:10:15 2006 From: roy.k.larsen at intel.com (Larsen, Roy K) Date: Mon, 6 Feb 2006 11:10:15 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT 2.0 immediatedataproposal Message-ID: <468F3FDA28AA87429AD807992E22D07E08612BE1@orsmsx408> If it is up to the ULP to separate out "normal" receive data from that associated with a write immediate, how is this different from the ULP doing a write followed by a send? If there is no difference, then what we're really talking about is a convenience to the initiating ULP. Perhaps what would be best is to construct an API that allows the ULP to perform standard write/send operations into one call which the underlying provider could optimize into one transaction with the associated interconnect interface. Better yet, a general request combining interface would have even more value, but calling this write/send "immediate" data is a stretch, if not downright silly. Some transports have true immediate data that provides unique value. There is nothing unique in a write/send sequence - ULPs do it all the time... Roy -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Caitlin Bestler Sent: Monday, February 06, 2006 10:48 AM To: dat-discussions at yahoogroups.com; Kanevsky, Arkady; Sean Hefty Cc: openib-general at openib.org Subject: RE: [dat-discussions] [openib-general] [RFC] DAT 2.0 immediatedataproposal dat-discussions at yahoogroups.com wrote: > Arkady, > > Your requirements are slightly different then the proposed set of > requirements. > > "iii) DAPL Provider does not provide any identification that > that the Receive operation matches remote RDMA Write with > Immediate data if it completes as Receive DTO. > > - It is up to an ULP to separate Receive completion of remote > Send from remote RDMA Write with Immediate Data." > > Tell me how this is possible? How can the application > distinguish between a 4 byte message and a 4 byte immediate > data message? We would have to add a new requirement... "If > the provider supports immediate data in the payload the ULP > cannot send a message equal to the immediate > data size". > The data sink knows whether the 4 bytes was sent as a message or as an immediate because it is clear in the ULP context. Possible methods: The expected completion is an immediate. All 4 byte messages are immediates. All 4 byte messages where the ms-byte is X are immediate. If its Tuesday its an immediate. If it's a prime number its an immediate ... But there is no clue from the transport layer. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From Arkady.Kanevsky at netapp.com Mon Feb 6 11:21:46 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Mon, 6 Feb 2006 14:21:46 -0500 Subject: [dat-discussions] [openib-general] [RFC] DAT 2.0 immediate dataproposal Message-ID: I should stress that the only "additional" requirement I had added beyond the DAT meeting agreement is Provider attribute for the size of Immediate Data. It will be set to 4 bytes in DAT now . But this may not be cast in stone permanently. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Kanevsky, Arkady > Sent: Monday, February 06, 2006 11:08 AM > To: Sean Hefty; Davis, Arlin R > Cc: dat-discussions at yahoogroups.com; openib-general at openib.org > Subject: RE: [dat-discussions] [openib-general] [RFC] DAT 2.0 > immediate dataproposal > > Here are the changes to the existing requirements chapters > for RDMA Write with Immediate Data. > > Feedback please. > Arkady > > Arkady Kanevsky email: arkady at netapp.com > Network Appliance Inc. phone: 781-768-5395 > 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 > Waltham, MA 02451 central phone: 781-768-5300 > > > > -----Original Message----- > > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > > Sent: Friday, February 03, 2006 7:30 PM > > To: Davis, Arlin R > > Cc: dat-discussions at yahoogroups.com; openib-general at openib.org > > Subject: Re: [dat-discussions] [openib-general] [RFC] DAT 2.0 > > immediate dataproposal > > > > Davis, Arlin R wrote: > > > "Applications need an optimized mechanism to notify the > > receiving end > > > that RDMA write data has completed beyond the two > operation method > > > currently used (RDMA write followed by message send). > This new RDMA > > > write feature will support 4-bytes of inline data that > will be sent > > > > Is there any reason to restrict the size of the immediate > data? Could > > you define the API such that the size is variable? I.e. > the provider > > can simply give the immediate data size, with 0 indicating > that it is > > not supported. > > > > > It should avoid > > > any latency penalties normally associated with a two > > operation method. > > > > I would state this as a requirement. A write followed by a send > > should be pushed to the application, since they may be able > to provide > > additional optimizations (such as combining > > operations) beyond what a provider could. > > > > > The initiating side must expose a 4-byte immediate data > > parameter for > > > the application to set the inline data. The receiving side must > > > provide a mechanism to accept the 4-byte immediate data. On the > > > receiving side, the write with immediate completion > notification is > > > indicated through a receive completion. It is the > responsibility of > > > the provider to identify to the application 4-byte > > immediate data from > > > a normal 4-byte send message. The inline byte ordering is > > application specific." > > > > Requirements look good to me. > > > > - Sean > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > > > Yahoo! Groups Links > > <*> To visit your group on the web, go to: > http://groups.yahoo.com/group/dat-discussions/ > > <*> To unsubscribe from this group, send an email to: > dat-discussions-unsubscribe at yahoogroups.com > > <*> Your use of Yahoo! Groups is subject to: > http://docs.yahoo.com/info/terms/ > > From caitlinb at broadcom.com Mon Feb 6 11:23:55 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Mon, 6 Feb 2006 11:23:55 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT 2.0 immediatedataproposal Message-ID: <54AD0F12E08D1541B826BE97C98F99F122D827@NT-SJCA-0751.brcm.ad.broadcom.com> Larsen, Roy K wrote: > If it is up to the ULP to separate out "normal" receive data > from that associated with a write immediate, how is this > different from the ULP doing a write followed by a send? If > there is no difference, then what we're really talking about > is a convenience to the initiating ULP. > > Perhaps what would be best is to construct an API that allows > the ULP to perform standard write/send operations into one > call which the underlying provider could optimize into one > transaction with the associated interconnect interface. > Better yet, a general request combining interface would have > even more value, but calling this write/send "immediate" data > is a stretch, if not downright silly. Some transports have > true immediate data that provides unique value. There is > nothing unique in a write/send sequence - ULPs do it all the time... > The data provided is to identify the completion notification that completes the RDMA Write to the data sink. So, yes, it is not really an "immediate" value. We could consider a better name for it, much as we renamed QP to something better. But the meaning is "the tag value associated with a specific RDMA Message". It is delivered in order, after that RDMA Message has fully completed. What varies by transport is *how* it is is delivered. We are considering identifying it as a single work request so that transport-specific contraction to a single wire message is enabled. But we don't want to change any of the semantics vs. the application doing Write then Send. The new call enables an optimization, but should not change the overall semantics. That could extend as far as having the the receiver recognize the alternate reception. From Arkady.Kanevsky at netapp.com Mon Feb 6 11:30:20 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Mon, 6 Feb 2006 14:30:20 -0500 Subject: [dat-discussions] [openib-general] [RFC] DAT 2.0 immediatedataproposal Message-ID: Roy, Can you explain, please? For IB the operation will be layered properly on Transport primitive. And on Recv side it will indicate in completion event DTO that it matches RDMA Write with Immediate and that Immediate Data is in event. For iWARP I expect initially, it will be layered on RDMA Write followed by Send. The Provider can do post more efficiently than Consumer and guarantee atomicity. On Recv side Consumer will get Recv DTO completion in event and Immediate Data inline as specified by Provider Attribute. >From the performance point of view Consumers who program to IB only will have no performance degradation at all. But this API also allows Consumers to write ULP to be transport independent with minimal penalty: one binary comparison and extra 4 bytes in recv buffer. Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Larsen, Roy K [mailto:roy.k.larsen at intel.com] > Sent: Monday, February 06, 2006 2:10 PM > To: Caitlin Bestler; dat-discussions at yahoogroups.com; > Kanevsky, Arkady; Sean Hefty > Cc: openib-general at openib.org > Subject: RE: [dat-discussions] [openib-general] [RFC] DAT 2.0 > immediatedataproposal > > If it is up to the ULP to separate out "normal" receive data > from that associated with a write immediate, how is this > different from the ULP doing a write followed by a send? If > there is no difference, then what we're really talking about > is a convenience to the initiating ULP. > > Perhaps what would be best is to construct an API that allows > the ULP to perform standard write/send operations into one > call which the underlying provider could optimize into one > transaction with the associated interconnect interface. > Better yet, a general request combining interface would have > even more value, but calling this write/send "immediate" data > is a stretch, if not downright silly. Some transports have > true immediate data that provides unique value. There is > nothing unique in a write/send sequence - ULPs do it all the time... > > Roy > > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of > Caitlin Bestler > Sent: Monday, February 06, 2006 10:48 AM > To: dat-discussions at yahoogroups.com; Kanevsky, Arkady; Sean Hefty > Cc: openib-general at openib.org > Subject: RE: [dat-discussions] [openib-general] [RFC] DAT 2.0 > immediatedataproposal > > dat-discussions at yahoogroups.com wrote: > > Arkady, > > > > Your requirements are slightly different then the proposed set of > > requirements. > > > > "iii) DAPL Provider does not provide any identification > that that the > > Receive operation matches remote RDMA Write with Immediate > data if it > > completes as Receive DTO. > > > > - It is up to an ULP to separate Receive completion of remote > > Send from remote RDMA Write with Immediate Data." > > > > Tell me how this is possible? How can the application distinguish > > between a 4 byte message and a 4 byte immediate data > message? We would > > have to add a new requirement... "If the provider supports > immediate > > data in the payload the ULP cannot send a message equal to the > > immediate data size". > > > > The data sink knows whether the 4 bytes was sent as a message > or as an immediate because it is clear in the ULP context. > Possible methods: > The expected completion is an immediate. > All 4 byte messages are immediates. > All 4 byte messages where the ms-byte is X are immediate. > If its Tuesday its an immediate. > If it's a prime number its an immediate > ... > > But there is no clue from the transport layer. > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From jlentini at netapp.com Mon Feb 6 11:37:51 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 6 Feb 2006 14:37:51 -0500 (EST) Subject: [openib-general] [ANNOUNCE] DAPL BOF Message-ID: There will be a DAPL BOF this evening from 19:30-20:00 in the Palm Tree Salon. I plan to setup a conference call for those of you who would like to participate remotely. Here is the info: phone: 888-867-8686 id: 1068642 From chas at cmf.nrl.navy.mil Mon Feb 6 12:48:35 2006 From: chas at cmf.nrl.navy.mil (chas williams - CONTRACTOR) Date: Mon, 6 Feb 2006 15:48:35 -0500 Subject: [openib-general] questions about gen2 srp driver Message-ID: <200602062048.k16KmZfH025495@cmf.nrl.navy.mil> i have been looking at the srp driver in the gen2 trunk (and the version that is in the latest 2.6.15 kernels). i have a couple questions about its behavior and i am hoping someone can answer them. it seems to take scsi_host->host_lock with a spin_lock_irq() inside a couple of work queues. i believe work queues run at process context and not interrupt context. therefore, one should probably use spin_lock_irqsave()? secondly, there seems to be only one pair of lkeys/rkeys for a given srp "virtual" host. in srp_map_data() i see the rkey is assigned to the buffer: buf->key = cpu_to_be32(target->srp_host->mr->rkey); but the virtual host adapter template says: .can_queue = SRP_SQ_SIZE, .cmd_per_lun = SRP_SQ_SIZE, if there is only a single set of rdma keys how can the driver support more than one command (particularly on a target with multiple lun's) outstanding command? i didn't think the srp_post_send() was synchronus with respect to the completion of the current rdma request? From arlin.r.davis at intel.com Mon Feb 6 13:16:45 2006 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Mon, 6 Feb 2006 13:16:45 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT 2.0 immediatedataproposal Message-ID: <59278FC0C48A994BABABD069571E45680DDD5E6C@orsmsx401.amr.corp.intel.com> I just want to get consensus on the requirements before we get too far. One thing I forgot is that with Infiniband, the receive with immediate provides the size of the rdma write that just completed. I think we should include this in the requirements since there is ULP value here. -arlin >-----Original Message----- >From: Kanevsky, Arkady [mailto:Arkady.Kanevsky at netapp.com] >Sent: Monday, February 06, 2006 11:08 AM >To: Kanevsky, Arkady; Davis, Arlin R; Sean Hefty >Cc: dat-discussions at yahoogroups.com; openib-general at openib.org >Subject: RE: [dat-discussions] [openib-general] [RFC] DAT 2.0 immediatedataproposal > >Arlin, >It is too strong to state that Consumer should never send a message >equal in size to the size of immediate data. >Consumer knows from the context which one it is. >it may be based on dedicated connection, or based on ULP protocol >ordering. >Arkady > >Arkady Kanevsky email: arkady at netapp.com >Network Appliance Inc. phone: 781-768-5395 >1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 >Waltham, MA 02451 central phone: 781-768-5300 > > >> -----Original Message----- >> From: Kanevsky, Arkady >> Sent: Monday, February 06, 2006 2:05 PM >> To: Davis, Arlin R; Sean Hefty >> Cc: dat-discussions at yahoogroups.com; openib-general at openib.org >> Subject: RE: [dat-discussions] [openib-general] [RFC] DAT 2.0 >> immediatedataproposal >> >> Arlin, >> On Friday we agreed that receiver can not distinguish between >> 4 byte of Send or 4 bytes of Immediate data if RDMA Write >> with Immed is implemented as 2 operations: >> RDMA Write followed by Send. >> >> ULP Reciever "expects" Immediate data that is why it posts >> Recv. Depending on Transport capability it MAY complete as >> Recv or as Recv_RDMA_Write_with_Immed_in_event. >> >> Neither Provider not Consumer can distinguish between the >> cases unless there is additional info. >> >> Arkady >> >> Arkady Kanevsky email: arkady at netapp.com >> Network Appliance Inc. phone: 781-768-5395 >> 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 >> Waltham, MA 02451 central phone: 781-768-5300 >> >> >> > -----Original Message----- >> > From: Davis, Arlin R [mailto:arlin.r.davis at intel.com] >> > Sent: Monday, February 06, 2006 1:25 PM >> > To: Kanevsky, Arkady; Sean Hefty >> > Cc: dat-discussions at yahoogroups.com; openib-general at openib.org >> > Subject: RE: [dat-discussions] [openib-general] [RFC] DAT 2.0 >> > immediate dataproposal >> > >> > >> > Arkady, >> > >> > Your requirements are slightly different then the proposed set of >> > requirements. >> > >> > "iii) DAPL Provider does not provide any identification >> that that the >> > Receive operation matches remote RDMA Write with Immediate >> data if it >> > completes as Receive DTO. >> > >> > - It is up to an ULP to separate Receive completion of remote >> > Send from remote RDMA Write with Immediate Data." >> > >> > Tell me how this is possible? How can the application distinguish >> > between a 4 byte message and a 4 byte immediate data >> message? We would >> > have to add a new requirement... "If the provider supports >> immediate >> > data in the payload the ULP cannot send a message equal to the >> > immediate >> > data size". >> > >> > -arlin >> > >> > >-----Original Message----- >> > >From: Kanevsky, Arkady [mailto:Arkady.Kanevsky at netapp.com] >> > >Sent: Monday, February 06, 2006 8:08 AM >> > >To: Sean Hefty; Davis, Arlin R >> > >Cc: dat-discussions at yahoogroups.com; openib-general at openib.org >> > >Subject: RE: [dat-discussions] [openib-general] [RFC] DAT >> > 2.0 immediate >> > dataproposal >> > > >> > >Here are the changes to the existing requirements chapters >> for RDMA >> > >Write with Immediate Data. >> > > >> > >Feedback please. >> > >Arkady >> > > >> > >Arkady Kanevsky email: arkady at netapp.com >> > >Network Appliance Inc. phone: 781-768-5395 >> > >1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 >> > >Waltham, MA 02451 central phone: 781-768-5300 >> > > >> > > >> > >> -----Original Message----- >> > >> From: Sean Hefty [mailto:mshefty at ichips.intel.com] >> > >> Sent: Friday, February 03, 2006 7:30 PM >> > >> To: Davis, Arlin R >> > >> Cc: dat-discussions at yahoogroups.com; openib-general at openib.org >> > >> Subject: Re: [dat-discussions] [openib-general] [RFC] DAT 2.0 >> > >> immediate dataproposal >> > >> >> > >> Davis, Arlin R wrote: >> > >> > "Applications need an optimized mechanism to notify the >> > >> receiving end >> > >> > that RDMA write data has completed beyond the two >> > operation method >> > >> > currently used (RDMA write followed by message send). >> > This new RDMA >> > >> > write feature will support 4-bytes of inline data that >> > will be sent >> > >> >> > >> Is there any reason to restrict the size of the immediate data? >> > >> Could you define the API such that the size is variable? >> I.e. the >> > >> provider can simply give the immediate data size, with 0 >> > indicating >> > >> that it is not supported. >> > >> >> > >> > It should avoid >> > >> > any latency penalties normally associated with a two >> > >> operation method. >> > >> >> > >> I would state this as a requirement. A write followed by a send >> > >> should be pushed to the application, since they may be able to >> > >> provide additional optimizations (such as combining >> > >> operations) beyond what a provider could. >> > >> >> > >> > The initiating side must expose a 4-byte immediate data >> > >> parameter for >> > >> > the application to set the inline data. The receiving >> side must >> > >> > provide a mechanism to accept the 4-byte immediate >> data. On the >> > >> > receiving side, the write with immediate completion >> > notification is >> > >> > indicated through a receive completion. It is the >> > responsibility of >> > >> > the provider to identify to the application 4-byte >> > >> immediate data from >> > >> > a normal 4-byte send message. The inline byte ordering is >> > >> application specific." >> > >> >> > >> Requirements look good to me. >> > >> >> > >> - Sean >> > >> _______________________________________________ >> > >> openib-general mailing list >> > >> openib-general at openib.org >> > >> http://openib.org/mailman/listinfo/openib-general >> > >> >> > >> To unsubscribe, please visit >> > >> http://openib.org/mailman/listinfo/openib-general >> > >> >> > >> _______________________________________________ >> openib-general mailing list >> openib-general at openib.org >> http://openib.org/mailman/listinfo/openib-general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> From caitlinb at broadcom.com Mon Feb 6 13:20:05 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Mon, 6 Feb 2006 13:20:05 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT 2.0 immediatedataproposal Message-ID: <54AD0F12E08D1541B826BE97C98F99F122D86C@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > I just want to get consensus on the requirements before we get too > far. One thing I forgot is that with Infiniband, the receive with > immediate provides the size of the rdma write that just > completed. I think we should include this in the requirements > since there is ULP value here. > > -arlin > That *could* be done, it would be an eight byte message over iWARP, 4 for length and 4 for the message tag. From roy.k.larsen at intel.com Mon Feb 6 13:25:12 2006 From: roy.k.larsen at intel.com (Larsen, Roy K) Date: Mon, 6 Feb 2006 13:25:12 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT 2.0 immediatedataproposal Message-ID: <468F3FDA28AA87429AD807992E22D07E08612F6E@orsmsx408> >From: Kanevsky, Arkady [mailto:Arkady.Kanevsky at netapp.com] >Roy, >Can you explain, please? > >For IB the operation will be layered properly on Transport primitive. >And on Recv side it will indicate in completion event DTO >that it matches RDMA Write with Immediate and that Immediate Data >is in event. > >For iWARP I expect initially, it will be layered on RDMA Write >followed by Send. The Provider can do post more efficiently >than Consumer and guarantee atomicity. >On Recv side Consumer will get Recv DTO completion in event >and Immediate Data inline as specified by Provider Attribute. > >From the performance point of view Consumers who program to IB >only will have no performance degradation at all. But this API also >allows Consumers to write ULP to be transport independent >with minimal penalty: one binary comparison and extra 4 bytes in recv >buffer. If the application could be written transport independently, I would have no objection at all. Instead, it must be written in a transport-adaptive way and to be able to adapt to all possible implementations, the application could not send arbitrary "immediate"-sized data as messages because there is no way to distinguish between them on the receiving side. That is HUGE! It is my experience that send/receive is generally used for small messages and to take away particular message sizes or to depend on the so the application can "adapt" to whatever the immediate size is for a particular transport, if even needed, is a very weak facility to offer. It also affects interface resource allocation. Send queue sizes will have to adapt to possibly twice there size. It just dawned on me that the immediate data must be in registered memory to be sent in a message. This means the API must be amended to pass an LMR or, even worse, the provider would have to register memory in the speed path or create and manipulate its own queue of "immediate" data buffers/LMRs. Of course, LMRs are not needed and an overhead for transports that provide true immediate data. Oh, and another thing. InfiniBand indicates the size of the RDMA write in the receive completion. That is something that will have to be addressed in a "transport independent" way or dropped as part of the service. The bottom line here is that it is NOT transport independent. Now, the atomicity argument between write and send has some credibility. If an application chooses to "adapt" to an explicit write/send semantic for write completion notification in environments that can't provide it natively, this could be addressed by a generalized combined request API that can guarantee thread-based atomicity to the send queue. This seems much more straightforward to me since, in essence, to adapt to non-native immediate data services, they would have to allocate resources and behave in virtually the same way as if they did write/send explicitly. It is obvious that the proposed service is not one of immediate data in the sense defined by InfiniBand. Since true immediate data is a transport specific speed path service, it needs to be implemented as a transport specific extension. To allow an application to initiate multiple request sequences that must be queued sequentially to explicitly create a write completion notification or any other order-based sequence, a generalized combined request API should be defined. > >Arkady Kanevsky email: arkady at netapp.com >Network Appliance Inc. phone: 781-768-5395 >1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 >Waltham, MA 02451 central phone: 781-768-5300 > > From jlentini at netapp.com Mon Feb 6 13:25:40 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 6 Feb 2006 16:25:40 -0500 (EST) Subject: [openib-general] Re: [ANNOUNCE] DAPL BOF In-Reply-To: References: Message-ID: On Mon, 6 Feb 2006, James Lentini wrote: > > There will be a DAPL BOF this evening from 19:30-20:00 in the Palm > Tree Salon. Correct. We will be in the Lavender room. > I plan to setup a conference call for those of you who would like to > participate remotely. Here is the info: > > phone: 888-867-8686 > id: 1068642 From Arkady.Kanevsky at netapp.com Mon Feb 6 14:11:57 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Mon, 6 Feb 2006 17:11:57 -0500 Subject: [dat-discussions] [openib-general] [RFC] DAT 2.0 immediatedataproposal Message-ID: good point. I will add this to the requirements and augement the necessary transfered_length text. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Davis, Arlin R [mailto:arlin.r.davis at intel.com] > Sent: Monday, February 06, 2006 4:17 PM > To: Kanevsky, Arkady; Sean Hefty > Cc: dat-discussions at yahoogroups.com; openib-general at openib.org > Subject: RE: [dat-discussions] [openib-general] [RFC] DAT 2.0 > immediatedataproposal > > I just want to get consensus on the requirements before we > get too far. > One thing I forgot is that with Infiniband, the receive with > immediate provides the size of the rdma write that just > completed. I think we should include this in the requirements > since there is ULP value here. > > -arlin > > >-----Original Message----- > >From: Kanevsky, Arkady [mailto:Arkady.Kanevsky at netapp.com] > >Sent: Monday, February 06, 2006 11:08 AM > >To: Kanevsky, Arkady; Davis, Arlin R; Sean Hefty > >Cc: dat-discussions at yahoogroups.com; openib-general at openib.org > >Subject: RE: [dat-discussions] [openib-general] [RFC] DAT 2.0 > immediatedataproposal > > > >Arlin, > >It is too strong to state that Consumer should never send a message > >equal in size to the size of immediate data. > >Consumer knows from the context which one it is. > >it may be based on dedicated connection, or based on ULP protocol > >ordering. > >Arkady > > > >Arkady Kanevsky email: arkady at netapp.com > >Network Appliance Inc. phone: 781-768-5395 > >1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 > >Waltham, MA 02451 central phone: 781-768-5300 > > > > > >> -----Original Message----- > >> From: Kanevsky, Arkady > >> Sent: Monday, February 06, 2006 2:05 PM > >> To: Davis, Arlin R; Sean Hefty > >> Cc: dat-discussions at yahoogroups.com; openib-general at openib.org > >> Subject: RE: [dat-discussions] [openib-general] [RFC] DAT 2.0 > >> immediatedataproposal > >> > >> Arlin, > >> On Friday we agreed that receiver can not distinguish between > >> 4 byte of Send or 4 bytes of Immediate data if RDMA Write > with Immed > >> is implemented as 2 operations: > >> RDMA Write followed by Send. > >> > >> ULP Reciever "expects" Immediate data that is why it posts Recv. > >> Depending on Transport capability it MAY complete as Recv or as > >> Recv_RDMA_Write_with_Immed_in_event. > >> > >> Neither Provider not Consumer can distinguish between the cases > >> unless there is additional info. > >> > >> Arkady > >> > >> Arkady Kanevsky email: arkady at netapp.com > >> Network Appliance Inc. phone: 781-768-5395 > >> 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 > >> Waltham, MA 02451 central phone: 781-768-5300 > >> > >> > >> > -----Original Message----- > >> > From: Davis, Arlin R [mailto:arlin.r.davis at intel.com] > >> > Sent: Monday, February 06, 2006 1:25 PM > >> > To: Kanevsky, Arkady; Sean Hefty > >> > Cc: dat-discussions at yahoogroups.com; openib-general at openib.org > >> > Subject: RE: [dat-discussions] [openib-general] [RFC] DAT 2.0 > >> > immediate dataproposal > >> > > >> > > >> > Arkady, > >> > > >> > Your requirements are slightly different then the > proposed set of > >> > requirements. > >> > > >> > "iii) DAPL Provider does not provide any identification > >> that that the > >> > Receive operation matches remote RDMA Write with Immediate > >> data if it > >> > completes as Receive DTO. > >> > > >> > - It is up to an ULP to separate Receive completion of remote > >> > Send from remote RDMA Write with Immediate Data." > >> > > >> > Tell me how this is possible? How can the application > distinguish > >> > between a 4 byte message and a 4 byte immediate data > >> message? We would > >> > have to add a new requirement... "If the provider supports > >> immediate > >> > data in the payload the ULP cannot send a message equal to the > >> > immediate data size". > >> > > >> > -arlin > >> > > >> > >-----Original Message----- > >> > >From: Kanevsky, Arkady [mailto:Arkady.Kanevsky at netapp.com] > >> > >Sent: Monday, February 06, 2006 8:08 AM > >> > >To: Sean Hefty; Davis, Arlin R > >> > >Cc: dat-discussions at yahoogroups.com; openib-general at openib.org > >> > >Subject: RE: [dat-discussions] [openib-general] [RFC] DAT > >> > 2.0 immediate > >> > dataproposal > >> > > > >> > >Here are the changes to the existing requirements chapters > >> for RDMA > >> > >Write with Immediate Data. > >> > > > >> > >Feedback please. > >> > >Arkady > >> > > > >> > >Arkady Kanevsky email: arkady at netapp.com > >> > >Network Appliance Inc. phone: 781-768-5395 > >> > >1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 > >> > >Waltham, MA 02451 central phone: 781-768-5300 > >> > > > >> > > > >> > >> -----Original Message----- > >> > >> From: Sean Hefty [mailto:mshefty at ichips.intel.com] > >> > >> Sent: Friday, February 03, 2006 7:30 PM > >> > >> To: Davis, Arlin R > >> > >> Cc: dat-discussions at yahoogroups.com; openib-general at openib.org > >> > >> Subject: Re: [dat-discussions] [openib-general] [RFC] DAT 2.0 > >> > >> immediate dataproposal > >> > >> > >> > >> Davis, Arlin R wrote: > >> > >> > "Applications need an optimized mechanism to notify the > >> > >> receiving end > >> > >> > that RDMA write data has completed beyond the two > >> > operation method > >> > >> > currently used (RDMA write followed by message send). > >> > This new RDMA > >> > >> > write feature will support 4-bytes of inline data that > >> > will be sent > >> > >> > >> > >> Is there any reason to restrict the size of the > immediate data? > >> > >> Could you define the API such that the size is variable? > >> I.e. the > >> > >> provider can simply give the immediate data size, with 0 > >> > indicating > >> > >> that it is not supported. > >> > >> > >> > >> > It should avoid > >> > >> > any latency penalties normally associated with a two > >> > >> operation method. > >> > >> > >> > >> I would state this as a requirement. A write > followed by a send > >> > >> should be pushed to the application, since they may > be able to > >> > >> provide additional optimizations (such as combining > >> > >> operations) beyond what a provider could. > >> > >> > >> > >> > The initiating side must expose a 4-byte immediate data > >> > >> parameter for > >> > >> > the application to set the inline data. The receiving > >> side must > >> > >> > provide a mechanism to accept the 4-byte immediate > >> data. On the > >> > >> > receiving side, the write with immediate completion > >> > notification is > >> > >> > indicated through a receive completion. It is the > >> > responsibility of > >> > >> > the provider to identify to the application 4-byte > >> > >> immediate data from > >> > >> > a normal 4-byte send message. The inline byte ordering is > >> > >> application specific." > >> > >> > >> > >> Requirements look good to me. > >> > >> > >> > >> - Sean > >> > >> _______________________________________________ > >> > >> openib-general mailing list > >> > >> openib-general at openib.org > >> > >> http://openib.org/mailman/listinfo/openib-general > >> > >> > >> > >> To unsubscribe, please visit > >> > >> http://openib.org/mailman/listinfo/openib-general > >> > >> > >> > > >> _______________________________________________ > >> openib-general mailing list > >> openib-general at openib.org > >> http://openib.org/mailman/listinfo/openib-general > >> > >> To unsubscribe, please visit > >> http://openib.org/mailman/listinfo/openib-general > >> > From Arkady.Kanevsky at netapp.com Mon Feb 6 14:26:52 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Mon, 6 Feb 2006 17:26:52 -0500 Subject: [dat-discussions] [openib-general] [RFC] DAT 2.0 immediatedataproposal Message-ID: Roy, comments inline. Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Larsen, Roy K [mailto:roy.k.larsen at intel.com] > Sent: Monday, February 06, 2006 4:25 PM > To: Kanevsky, Arkady; Caitlin Bestler; > dat-discussions at yahoogroups.com; Sean Hefty > Cc: openib-general at openib.org > Subject: RE: [dat-discussions] [openib-general] [RFC] DAT 2.0 > immediatedataproposal > > > > >From: Kanevsky, Arkady [mailto:Arkady.Kanevsky at netapp.com] > >Roy, > >Can you explain, please? > > > >For IB the operation will be layered properly on Transport primitive. > >And on Recv side it will indicate in completion event DTO that it > >matches RDMA Write with Immediate and that Immediate Data is > in event. > > > >For iWARP I expect initially, it will be layered on RDMA > Write followed > >by Send. The Provider can do post more efficiently than Consumer and > >guarantee atomicity. > >On Recv side Consumer will get Recv DTO completion in event and > >Immediate Data inline as specified by Provider Attribute. > > > >From the performance point of view Consumers who program to IB only > >will have no performance degradation at all. But this API > also allows > >Consumers to write ULP to be transport independent with minimal > >penalty: one binary comparison and extra 4 bytes in recv buffer. > > If the application could be written transport independently, > I would have no objection at all. Instead, it must be > written in a transport-adaptive way and to be able to adapt > to all possible implementations, the application could not > send arbitrary "immediate"-sized data as messages because > there is no way to distinguish between them on the receiving > side. That is HUGE! It is my experience that send/receive > is generally used for small messages and to take away > particular message sizes or to depend on the so the > application can "adapt" to whatever the immediate size is for > a particular transport, if even needed, is a very weak > facility to offer. But the remote side does posts Recv. Since it anticipate that this Recv will be matched against the RDMA Write with immediate it posts the recv buffer which fits. Yes, there is an issue for Transport-independent ULP that it does needs a buffer. For IB it is possible to post 0-size buffer. But if this is the case Recv end Consumer DOES know that it will be macthed against RDMA Write so ULP DOES know what it will be matched against. So in the worst case Consumer does have to pay the price of creating LMR to handle 4 byte buffer to match RDMA Write Immediate data. > > It also affects interface resource allocation. Send queue > sizes will have to adapt to possibly twice there size. > That is correct. We argued about it at the meeting. One alternative is to have EP and EVD attr. But this will not be efficient since it will double the queue size where a smaller increment is possible due to the depth of the RDMA Write pipeline outstanding. > It just dawned on me that the immediate data must be in > registered memory to be sent in a message. This means the > API must be amended to pass an LMR or, even worse, the > provider would have to register memory in the speed path or > create and manipulate its own queue of "immediate" > data buffers/LMRs. Of course, LMRs are not needed and an > overhead for transports that provide true immediate data. No registration on the speed path. It is Consumer responsibility to provide Recv Buffer of the right size. Yes for IB only ULP this can be avoided. But ULP can be written to the proposed API to take full advantage of IB performance but that code will not be transport independent. But this API allows to write transport independent code albeit with certain price attached. > > Oh, and another thing. InfiniBand indicates the size of the > RDMA write in the receive completion. That is something that > will have to be addressed in a "transport independent" way or > dropped as part of the service. Good point. I will augment Spec accordingly. > > The bottom line here is that it is NOT transport independent. implementation is not transport independent. But API allows to write Transport-specific ULP with full perfromance as well Transport-independent ULP with better performance than without proposed API and with "minimal" performance penalty for Transports that provide it. > > Now, the atomicity argument between write and send has some > credibility. > If an application chooses to "adapt" to an explicit > write/send semantic for write completion notification in > environments that can't provide it natively, this could be > addressed by a generalized combined request API that can > guarantee thread-based atomicity to the send queue. This > seems much more straightforward to me since, in essence, to > adapt to non-native immediate data services, they would have > to allocate resources and behave in virtually the same way as > if they did write/send explicitly. > > It is obvious that the proposed service is not one of > immediate data in the sense defined by InfiniBand. Since > true immediate data is a transport specific speed path > service, it needs to be implemented as a transport specific > extension. To allow an application to initiate multiple > request sequences that must be queued sequentially to > explicitly create a write completion notification or any > other order-based sequence, a generalized combined request > API should be defined. No disagreemnt here. We were debating a generic way to combine multiple DTOs into a single call for some time. But how to define a generic way to do it and to have a single completion on both ends of the connection in successful case was always a problem. > > > > >Arkady Kanevsky email: arkady at netapp.com > >Network Appliance Inc. phone: 781-768-5395 > >1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 > >Waltham, MA 02451 central phone: 781-768-5300 > > > > > From hugh at veritas.com Mon Feb 6 14:29:59 2006 From: hugh at veritas.com (Hugh Dickins) Date: Mon, 6 Feb 2006 22:29:59 +0000 (GMT) Subject: [openib-general] Re: [git patch review 2/2] IB: Don't doublefree pages from scatterlist In-Reply-To: <1139070837112-3fe13a3288c20f5c@cisco.com> References: <1139070837112-3fe13a3288c20f5c@cisco.com> Message-ID: On Sat, 4 Feb 2006, Roland Dreier wrote: > On some architectures, mapping the scatterlist may coalesce entries: > if that coalesced list is then used for freeing the pages afterwards, > there's a danger that pages may be doubly freed (and others leaked). > > Fix Infiniband's __ib_umem_release by freeing from a separate array > beyond the scatterlist: IB_UMEM_MAX_PAGE_CHUNK lowered to fit one page. It's now looking like this change won't be needed after all: Andi has just posted a patch in the "ipr" thread which should stop x86_64 from interfering with the scatterlist *page,offset,length fields, so what IB and others were doing should then work safely (current thinking is that x86_64 is the only architecture which coalesced in that way). Hugh From sean.hefty at intel.com Mon Feb 6 14:35:21 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 6 Feb 2006 14:35:21 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT 2.0immediatedataproposal In-Reply-To: Message-ID: >But the remote side does posts Recv. Since it anticipate that >this Recv will be matched against the RDMA Write with immediate >it posts the recv buffer which fits. Yes, there is an issue >for Transport-independent ULP that it does needs a buffer. >For IB it is possible to post 0-size buffer. But if this is the case >Recv end Consumer DOES know that it will be macthed against RDMA >Write so ULP DOES know what it will be matched against. >So in the worst case Consumer does have to pay the price of creating >LMR to handle 4 byte buffer to match RDMA Write Immediate data. How does the remote ULP know this? A DAPL implementation has no idea what a receive will match up against. You're pushing a requirement on the ordering of sends/writes to the application that was not there before. >> It just dawned on me that the immediate data must be in >> registered memory to be sent in a message. This means the >> API must be amended to pass an LMR or, even worse, the >> provider would have to register memory in the speed path or >> create and manipulate its own queue of "immediate" >> data buffers/LMRs. Of course, LMRs are not needed and an >> overhead for transports that provide true immediate data. > >No registration on the speed path. It is Consumer responsibility >to provide Recv Buffer of the right size. >Yes for IB only ULP this can be avoided. >But ULP can be written to the proposed API to take full >advantage of IB performance but that code will not be transport >independent. The immediate data needs to be registered before being sent. This will need to be hidden from the user. >But this API allows to write transport independent code >albeit with certain price attached. What good does it do to have "transport independent" code, when the feature being invoked is "transport dependent"? There's no requirement that immediate data be supported. Why define an API so that it can be emulated? Define the right API, and let transports that don't support immediate data indicate so. A "transport independent" application can check this bit and take whatever action is necessary. They need to do so anyway, since the bit may or may not be set. - Sean From roy.k.larsen at intel.com Mon Feb 6 15:49:48 2006 From: roy.k.larsen at intel.com (Larsen, Roy K) Date: Mon, 6 Feb 2006 15:49:48 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT 2.0 immediatedataproposal Message-ID: <468F3FDA28AA87429AD807992E22D07E0861342B@orsmsx408> >From: Kanevsky, Arkady [mailto:Arkady.Kanevsky at netapp.com] >Sent: Monday, February 06, 2006 2:27 PM > >Roy, >comments inline. > Mine too.... >> >> >From: Kanevsky, Arkady [mailto:Arkady.Kanevsky at netapp.com] >> >Roy, >> >Can you explain, please? >> > >> >For IB the operation will be layered properly on Transport primitive. >> >And on Recv side it will indicate in completion event DTO that it >> >matches RDMA Write with Immediate and that Immediate Data is >> in event. >> > >> >For iWARP I expect initially, it will be layered on RDMA >> Write followed >> >by Send. The Provider can do post more efficiently than Consumer and >> >guarantee atomicity. >> >On Recv side Consumer will get Recv DTO completion in event and >> >Immediate Data inline as specified by Provider Attribute. >> > >> >From the performance point of view Consumers who program to IB only >> >will have no performance degradation at all. But this API >> also allows >> >Consumers to write ULP to be transport independent with minimal >> >penalty: one binary comparison and extra 4 bytes in recv buffer. >> >> If the application could be written transport independently, >> I would have no objection at all. Instead, it must be >> written in a transport-adaptive way and to be able to adapt >> to all possible implementations, the application could not >> send arbitrary "immediate"-sized data as messages because >> there is no way to distinguish between them on the receiving >> side. That is HUGE! It is my experience that send/receive >> is generally used for small messages and to take away >> particular message sizes or to depend on the so the >> application can "adapt" to whatever the immediate size is for >> a particular transport, if even needed, is a very weak >> facility to offer. > >But the remote side does posts Recv. Since it anticipate that >this Recv will be matched against the RDMA Write with immediate >it posts the recv buffer which fits. Yes, there is an issue >for Transport-independent ULP that it does needs a buffer. >For IB it is possible to post 0-size buffer. But if this is the case >Recv end Consumer DOES know that it will be macthed against RDMA >Write so ULP DOES know what it will be matched against. >So in the worst case Consumer does have to pay the price of creating >LMR to handle 4 byte buffer to match RDMA Write Immediate data. I think you missed my larger point. The point was that the application must be written in such a way that it could inferred when immediate data arrived for a variety of immediate data sizes and that places a constraint on the application wrt to data it may want to send/receive normally. Where as, if the application embraced the fact that it was responsible for sending a message to indicate a write completion, it is free to send whatever amount of data best met its needs. Transports that support true immediate data do not require the ULP to perform buffer matching. They can post a series of receive buffers that may or may not indicate immediate data. The ULP does not have to know ahead of time when immediate data will arrive **against other data receives**. The fact that an IB oriented application never needs to back a receive request with a buffer if they were only used to indicate immediate data is orthogonal. > >> >> It also affects interface resource allocation. Send queue >> sizes will have to adapt to possibly twice there size. >> > >That is correct. We argued about it at the meeting. >One alternative is to have EP and EVD attr. But this will not >be efficient since it will double the queue size where >a smaller increment is possible due to the depth of the RDMA Write >pipeline outstanding. > >> It just dawned on me that the immediate data must be in >> registered memory to be sent in a message. This means the >> API must be amended to pass an LMR or, even worse, the >> provider would have to register memory in the speed path or >> create and manipulate its own queue of "immediate" >> data buffers/LMRs. Of course, LMRs are not needed and an >> overhead for transports that provide true immediate data. > >No registration on the speed path. It is Consumer responsibility >to provide Recv Buffer of the right size. >Yes for IB only ULP this can be avoided. >But ULP can be written to the proposed API to take full >advantage of IB performance but that code will not be transport >independent. I was referring to the sending side. Source data of a message send must be from registered memory. For transports that will emulate this service with a write/send sequence, user specified immediate data will need to be copied to a provider managed pool of "immediate" data buffers/LMRs or the interface changed to specify an LMR. > >But this API allows to write transport independent code >albeit with certain price attached. > >> >> Oh, and another thing. InfiniBand indicates the size of the >> RDMA write in the receive completion. That is something that >> will have to be addressed in a "transport independent" way or >> dropped as part of the service. > >Good point. I will augment Spec accordingly. > >> >> The bottom line here is that it is NOT transport independent. > >implementation is not transport independent. >But API allows to write Transport-specific ULP with full perfromance >as well Transport-independent ULP with better performance >than without proposed API and with "minimal" performance >penalty for Transports that provide it. Of course, you can make the application as transport service adaptive as you want but that is a weak argument and a slippery slop. My point is that the operational semantics of non-native immediate data transports are identical to write/send in all respects. So, embrace this and just give the ULP a simple interface that has broader applicability for all transports. Provide a thread atomic combined request capability which can be used for write completion notification (if not natively supported) or any other purpose an application may fancy. > >> >> Now, the atomicity argument between write and send has some >> credibility. >> If an application chooses to "adapt" to an explicit >> write/send semantic for write completion notification in >> environments that can't provide it natively, this could be >> addressed by a generalized combined request API that can >> guarantee thread-based atomicity to the send queue. This >> seems much more straightforward to me since, in essence, to >> adapt to non-native immediate data services, they would have >> to allocate resources and behave in virtually the same way as >> if they did write/send explicitly. >> >> It is obvious that the proposed service is not one of >> immediate data in the sense defined by InfiniBand. Since >> true immediate data is a transport specific speed path >> service, it needs to be implemented as a transport specific >> extension. To allow an application to initiate multiple >> request sequences that must be queued sequentially to >> explicitly create a write completion notification or any >> other order-based sequence, a generalized combined request >> API should be defined. > > >No disagreemnt here. We were debating a generic way to combine >multiple DTOs into a single call for some time. >But how to define a generic way to do it and to have a single completion >on both ends of the connection in successful case was always a problem. I would think an array of pointers and a count to standard work requests would do it. And of course, each work request can control whether is solicits a completion so a write/send sequence can generate a single completion event on both ends. Use the EVD lock to guard against other threads injecting requests on the queue during a combined request operation and the ULP has everything it needs. Roy > >> >> > >> >Arkady Kanevsky email: arkady at netapp.com >> >Network Appliance Inc. phone: 781-768-5395 >> >1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 >> >Waltham, MA 02451 central phone: 781-768-5300 >> > >> > >> From sean.hefty at intel.com Mon Feb 6 16:00:32 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 6 Feb 2006 16:00:32 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT 2.0immediatedataproposal In-Reply-To: <468F3FDA28AA87429AD807992E22D07E0861342B@orsmsx408> Message-ID: >I would think an array of pointers and a count to standard work requests >would do it. And of course, each work request can control whether is >solicits a completion so a write/send sequence can generate a single >completion event on both ends. Use the EVD lock to guard against other >threads injecting requests on the queue during a combined request >operation and the ULP has everything it needs. This is what the OpenIB stack does today. The difference is that it uses a linked list, rather than an array. - Sean From Arkady.Kanevsky at netapp.com Mon Feb 6 16:49:47 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Mon, 6 Feb 2006 19:49:47 -0500 Subject: [dat-discussions] [openib-general] [RFC] DAT 2.0 immediatedataproposal Message-ID: I am not clear what you are proposing? A transport specific API? The current proposal provides on sending side: single post, and single completion in the error free case. This is commonality that simplify ULP. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Larsen, Roy K [mailto:roy.k.larsen at intel.com] > Sent: Monday, February 06, 2006 6:50 PM > To: Kanevsky, Arkady; Caitlin Bestler; > dat-discussions at yahoogroups.com; Sean Hefty > Cc: openib-general at openib.org > Subject: RE: [dat-discussions] [openib-general] [RFC] DAT 2.0 > immediatedataproposal > > > > >From: Kanevsky, Arkady [mailto:Arkady.Kanevsky at netapp.com] > >Sent: Monday, February 06, 2006 2:27 PM > > > >Roy, > >comments inline. > > > > Mine too.... > > >> > >> >From: Kanevsky, Arkady [mailto:Arkady.Kanevsky at netapp.com] > >> >Roy, > >> >Can you explain, please? > >> > > >> >For IB the operation will be layered properly on Transport > primitive. > >> >And on Recv side it will indicate in completion event DTO that it > >> >matches RDMA Write with Immediate and that Immediate Data is > >> in event. > >> > > >> >For iWARP I expect initially, it will be layered on RDMA > >> Write followed > >> >by Send. The Provider can do post more efficiently than > Consumer and > >> >guarantee atomicity. > >> >On Recv side Consumer will get Recv DTO completion in event and > >> >Immediate Data inline as specified by Provider Attribute. > >> > > >> >From the performance point of view Consumers who program > to IB only > >> >will have no performance degradation at all. But this API > >> also allows > >> >Consumers to write ULP to be transport independent with minimal > >> >penalty: one binary comparison and extra 4 bytes in recv buffer. > >> > >> If the application could be written transport > independently, I would > >> have no objection at all. Instead, it must be written in a > >> transport-adaptive way and to be able to adapt to all possible > >> implementations, the application could not send arbitrary > >> "immediate"-sized data as messages because there is no way to > >> distinguish between them on the receiving side. That is > HUGE! It is > >> my experience that send/receive is generally used for > small messages > >> and to take away particular message sizes or to depend on > the so the > >> application can "adapt" to whatever the immediate size is for a > >> particular transport, if even needed, is a very weak facility to > >> offer. > > > >But the remote side does posts Recv. Since it anticipate > that this Recv > >will be matched against the RDMA Write with immediate it > posts the recv > >buffer which fits. Yes, there is an issue for > Transport-independent ULP > >that it does needs a buffer. > >For IB it is possible to post 0-size buffer. But if this is the case > >Recv end Consumer DOES know that it will be macthed against > RDMA Write > >so ULP DOES know what it will be matched against. > >So in the worst case Consumer does have to pay the price of creating > >LMR to handle 4 byte buffer to match RDMA Write Immediate data. > > I think you missed my larger point. The point was that the > application must be written in such a way that it could > inferred when immediate data arrived for a variety of > immediate data sizes and that places a constraint on the > application wrt to data it may want to send/receive normally. > Where as, if the application embraced the fact that it was > responsible for sending a message to indicate a write > completion, it is free to send whatever amount of data best > met its needs. > > Transports that support true immediate data do not require > the ULP to perform buffer matching. They can post a series > of receive buffers that may or may not indicate immediate > data. The ULP does not have to know ahead of time when > immediate data will arrive **against other data receives**. > The fact that an IB oriented application never needs to back > a receive request with a buffer if they were only used to > indicate immediate data is orthogonal. > > > > >> > >> It also affects interface resource allocation. Send queue > sizes will > >> have to adapt to possibly twice there size. > >> > > > >That is correct. We argued about it at the meeting. > >One alternative is to have EP and EVD attr. But this will not be > >efficient since it will double the queue size where a > smaller increment > >is possible due to the depth of the RDMA Write pipeline outstanding. > > > >> It just dawned on me that the immediate data must be in registered > >> memory to be sent in a message. This means the API must > be amended > >> to pass an LMR or, even worse, the provider would have to register > >> memory in the speed path or create and manipulate its own queue of > >> "immediate" > >> data buffers/LMRs. Of course, LMRs are not needed and an overhead > >> for transports that provide true immediate data. > > > >No registration on the speed path. It is Consumer responsibility to > >provide Recv Buffer of the right size. > >Yes for IB only ULP this can be avoided. > >But ULP can be written to the proposed API to take full > advantage of IB > >performance but that code will not be transport independent. > > I was referring to the sending side. Source data of a > message send must be from registered memory. For transports > that will emulate this service with a write/send sequence, > user specified immediate data will need to be copied to a > provider managed pool of "immediate" data buffers/LMRs or the > interface changed to specify an LMR. > > > > >But this API allows to write transport independent code albeit with > >certain price attached. > > > >> > >> Oh, and another thing. InfiniBand indicates the size of the RDMA > >> write in the receive completion. That is something that > will have to > >> be addressed in a "transport independent" way or dropped > as part of > >> the service. > > > >Good point. I will augment Spec accordingly. > > > >> > >> The bottom line here is that it is NOT transport independent. > > > >implementation is not transport independent. > >But API allows to write Transport-specific ULP with full > perfromance as > >well Transport-independent ULP with better performance than without > >proposed API and with "minimal" performance penalty for > Transports that > >provide it. > > Of course, you can make the application as transport service > adaptive as you want but that is a weak argument and a > slippery slop. My point is that the operational semantics of > non-native immediate data transports are identical to > write/send in all respects. So, embrace this and just give > the ULP a simple interface that has broader applicability for > all transports. Provide a thread atomic combined request > capability which can be used for write completion > notification (if not natively > supported) or any other purpose an application may fancy. > > > > >> > >> Now, the atomicity argument between write and send has some > >> credibility. > >> If an application chooses to "adapt" to an explicit write/send > >> semantic for write completion notification in environments > that can't > >> provide it natively, this could be addressed by a generalized > >> combined request API that can guarantee thread-based > atomicity to the > >> send queue. This seems much more straightforward to me since, in > >> essence, to adapt to non-native immediate data services, > they would > >> have to allocate resources and behave in virtually the > same way as if > >> they did write/send explicitly. > >> > >> It is obvious that the proposed service is not one of > immediate data > >> in the sense defined by InfiniBand. Since true immediate > data is a > >> transport specific speed path service, it needs to be > implemented as > >> a transport specific extension. To allow an application > to initiate > >> multiple request sequences that must be queued sequentially to > >> explicitly create a write completion notification or any other > >> order-based sequence, a generalized combined request API should be > >> defined. > > > > > >No disagreemnt here. We were debating a generic way to > combine multiple > >DTOs into a single call for some time. > >But how to define a generic way to do it and to have a single > completion > >on both ends of the connection in successful case was always > a problem. > > I would think an array of pointers and a count to standard > work requests would do it. And of course, each work request > can control whether is solicits a completion so a write/send > sequence can generate a single completion event on both ends. > Use the EVD lock to guard against other threads injecting > requests on the queue during a combined request operation and > the ULP has everything it needs. > > Roy > > > > >> > >> > > >> >Arkady Kanevsky email: arkady at netapp.com > >> >Network Appliance Inc. phone: 781-768-5395 > >> >1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 > >> >Waltham, MA 02451 central phone: 781-768-5300 > >> > > >> > > >> > From rdreier at cisco.com Mon Feb 6 17:27:56 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 06 Feb 2006 17:27:56 -0800 Subject: [openib-general] questions about gen2 srp driver In-Reply-To: <200602062048.k16KmZfH025495@cmf.nrl.navy.mil> (chas williams's message of "Mon, 6 Feb 2006 15:48:35 -0500") References: <200602062048.k16KmZfH025495@cmf.nrl.navy.mil> Message-ID: chas> it seems to take scsi_host->host_lock with a spin_lock_irq() chas> inside a couple of work queues. i believe work queues run chas> at process context and not interrupt context. therefore, chas> one should probably use spin_lock_irqsave()? Yes, it's exactly because we know that work queues run in process context with interrupts enabled which lets us use spin_lock_irq. chas> if there is only a single set of rdma keys how can the chas> driver support more than one command (particularly on a chas> target with multiple lun's) outstanding command? i didn't chas> think the srp_post_send() was synchronus with respect to the chas> completion of the current rdma request? There's no limitation on number of outstanding RDMAs targeting a single R_Key. - R. From shubbell at qube3.dbresearch.net Mon Feb 6 18:46:37 2006 From: shubbell at qube3.dbresearch.net (Sean Hubbell) Date: Mon, 06 Feb 2006 20:46:37 -0600 Subject: [openib-general] relocation error / link time reference error Message-ID: <200602070246.k172kb325484@qube3.dbresearch.net> ---------- Original message ---------- Date: 05 Feb 2006 11:44:52 -0500 From: Hal Rosenstock Reply-To: Hal Rosenstock To: Sean Hubbell Subject: Re: [openib-general] relocation error / link time reference error On Sun, 2006-02-05 at 09:40, Sean Hubbell wrote: > Hal, > > I removed and rebuilt everything. And everything's OK now ? -- Hal Nope, I still have the link time reference problem. I'll download the latest svn tree again in the morning and rebuild. How do you typically download and rebuild? Here are the steps that I follow: 1) Download the openib code. 2) Copy a version of the Kernel Source Tree and copy over the infiniband directory to the drivers dir. 3) Removed the include/rdma directory and all of the .svn directories 4) Get a second version of the Kernel Source Tree and build a patch file for the infiniband changes. 5) I add the patch file to the linux-2.6.15.spec file 6) I rebuild the kernel (rpm based kernel) and then install the rpms (smp, numa, ...). 7) I reboot. 8) I then remove all of the openib modules 9) I then rebuilt openib tools from the commands listed on the wiki FAQ. 10) That's it ... How do you rebuilt openib? Do you pull from a particular tag or the trunk? If anyone has a better way to build the kernel, please let me know. I only want to make sure that I can built it as an rpm because I like the ability to figure out what file goes with what package. Any and all suggestions would be appreciated. Thanks again for all of the help Hal. Sean Hubbell From sean.hefty at intel.com Mon Feb 6 19:31:15 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 6 Feb 2006 19:31:15 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT 2.0immediatedataproposal In-Reply-To: Message-ID: >I am not clear what you are proposing? >A transport specific API? > >The current proposal provides on sending side: >single post, and single completion in the error free case. >This is commonality that simplify ULP. App 1 - transport aware: if (transport == IB) Do something else Do something different App 2 - transport independent: if (immediate data flag set) if (DTO == 1) Do something else do something else else do something different All you've done is add flags in order to call the API "transport neutral". The result to the application is the same, except that the interface is more complex than it needs to be, and causes confusion on the receiving side. And on the sending side, the application still needs to check the flag to see if immediate data is supported. A true transport neutral API wouldn't need flags that specify the actual differences between the transports. The requirement is to provide an API that supports RDMA writes with immediate data. A send that follows an RDMA write is not immediate data, and the API should not be constructed around trying to make it so. If you want to add a new requirement to the API to support posting multiple work requests with a single call, that is a different requirement. - Sean From rdreier at cisco.com Mon Feb 6 17:44:14 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 06 Feb 2006 17:44:14 -0800 Subject: [openib-general] Re: [git patch review 2/2] IB: Don't doublefree pages from scatterlist In-Reply-To: (Hugh Dickins's message of "Mon, 6 Feb 2006 22:29:59 +0000 (GMT)") References: <1139070837112-3fe13a3288c20f5c@cisco.com> Message-ID: Hugh> It's now looking like this change won't be needed after all: Hugh> Andi has just posted a patch in the "ipr" thread which Hugh> should stop x86_64 from interfering with the scatterlist Hugh> *page,offset,length fields, so what IB and others were doing Hugh> should then work safely (current thinking is that x86_64 is Hugh> the only architecture which coalesced in that way). OK, I'll drop this from my tree. - R. From sean.hefty at intel.com Mon Feb 6 21:16:07 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 6 Feb 2006 21:16:07 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal In-Reply-To: Message-ID: >The requirement is to provide an API that supports RDMA writes with immediate >data. A send that follows an RDMA write is not immediate data, and the API >should not be constructed around trying to make it so. To be clear, I believe that write with immediate should be part of the normal APIs, rather than an extension, but should be designed around those devices that provide it natively. - Sean From jackm at mellanox.co.il Mon Feb 6 23:39:33 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Tue, 7 Feb 2006 09:39:33 +0200 Subject: [openib-general] [PATCH 1 of 3] mad: large RMPP support Message-ID: <20060207073933.GA14165@mellanox.co.il> patch 1 of 3 --- Large RMPP support: changes/additions to underlying data structures and prototypes. Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Index: last_stable/drivers/infiniband/include/rdma/ib_mad.h =================================================================== --- last_stable.orig/drivers/infiniband/include/rdma/ib_mad.h +++ last_stable/drivers/infiniband/include/rdma/ib_mad.h @@ -141,6 +141,12 @@ struct ib_rmpp_hdr { __be32 paylen_newwin; }; +struct ib_mad_multipacket_seg { + struct list_head list; + u32 size; + u8 data[0]; +}; + typedef u64 __bitwise ib_sa_comp_mask; #define IB_SA_COMP_MASK(n) ((__force ib_sa_comp_mask) cpu_to_be64(1ull << n)) @@ -220,7 +226,9 @@ struct ib_class_port_info */ struct ib_mad_send_buf { struct ib_mad_send_buf *next; - void *mad; + void *mad; /* RMPP: first segment, + including the MAD header */ + void *mad_payload; /* RMPP: changed per segment */ struct ib_mad_agent *mad_agent; struct ib_ah *ah; void *context[2]; @@ -485,17 +493,6 @@ int ib_unregister_mad_agent(struct ib_ma int ib_post_send_mad(struct ib_mad_send_buf *send_buf, struct ib_mad_send_buf **bad_send_buf); -/** - * ib_coalesce_recv_mad - Coalesces received MAD data into a single buffer. - * @mad_recv_wc: Work completion information for a received MAD. - * @buf: User-provided data buffer to receive the coalesced buffers. The - * referenced buffer should be at least the size of the mad_len specified - * by @mad_recv_wc. - * - * This call copies a chain of received MAD segments into a single data buffer, - * removing duplicated headers. - */ -void ib_coalesce_recv_mad(struct ib_mad_recv_wc *mad_recv_wc, void *buf); /** * ib_free_recv_mad - Returns data buffers used to receive a MAD. @@ -601,6 +598,16 @@ struct ib_mad_send_buf * ib_create_send_ gfp_t gfp_mask); /** + * ib_get_multipacket_seg - returns a segment of an RMPP multipacket mad send + * @send_buf: Previously allocated send data buffer. + * @seg_num: number of the segment to return. + * + * This routine returns a pointer to a segment of a multipacket RMPP message. + */ +struct ib_mad_multipacket_seg *ib_get_multipacket_seg(struct ib_mad_send_buf * + send_buf, int seg_num); + +/** * ib_free_send_mad - Returns data buffers used to send a MAD. * @send_buf: Previously allocated send data buffer. */ Index: last_stable/drivers/infiniband/core/mad_priv.h =================================================================== --- last_stable.orig/drivers/infiniband/core/mad_priv.h +++ last_stable/drivers/infiniband/core/mad_priv.h @@ -119,7 +119,8 @@ struct ib_mad_send_wr_private { struct list_head agent_list; struct ib_mad_agent_private *mad_agent_priv; struct ib_mad_send_buf send_buf; - DECLARE_PCI_UNMAP_ADDR(mapping) + DECLARE_PCI_UNMAP_ADDR(header_mapping) + DECLARE_PCI_UNMAP_ADDR(payload_mapping) struct ib_send_wr send_wr; struct ib_sge sg_list[IB_MAD_SEND_REQ_MAX_SG]; __be64 tid; @@ -130,9 +131,11 @@ struct ib_mad_send_wr_private { enum ib_wc_status status; /* RMPP control */ + struct list_head multipacket_list; int last_ack; int seg_num; int newwin; + int total_length; int total_seg; int data_offset; int pad; Index: last_stable/drivers/infiniband/core/user_mad.c =================================================================== --- last_stable.orig/drivers/infiniband/core/user_mad.c +++ last_stable/drivers/infiniband/core/user_mad.c @@ -123,6 +123,7 @@ struct ib_umad_packet { struct ib_mad_send_buf *msg; struct list_head list; int length; + struct list_head seg_list; struct ib_user_mad mad; }; From jackm at mellanox.co.il Mon Feb 6 23:40:36 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Tue, 7 Feb 2006 09:40:36 +0200 Subject: [openib-general] [PATCH 2 of 3] mad: large RMPP support Message-ID: <20060207074036.GB14165@mellanox.co.il> patch 2 of 3 --- Large RMPP support, receive side: copy the arriving MADs to chunks instead of coalescing to one large buffer in kernel space. Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Index: last_stable/drivers/infiniband/core/mad_rmpp.c =================================================================== --- last_stable.orig/drivers/infiniband/core/mad_rmpp.c +++ last_stable/drivers/infiniband/core/mad_rmpp.c @@ -433,44 +433,6 @@ static struct ib_mad_recv_wc * complete_ return rmpp_wc; } -void ib_coalesce_recv_mad(struct ib_mad_recv_wc *mad_recv_wc, void *buf) -{ - struct ib_mad_recv_buf *seg_buf; - struct ib_rmpp_mad *rmpp_mad; - void *data; - int size, len, offset; - u8 flags; - - len = mad_recv_wc->mad_len; - if (len <= sizeof(struct ib_mad)) { - memcpy(buf, mad_recv_wc->recv_buf.mad, len); - return; - } - - offset = data_offset(mad_recv_wc->recv_buf.mad->mad_hdr.mgmt_class); - - list_for_each_entry(seg_buf, &mad_recv_wc->rmpp_list, list) { - rmpp_mad = (struct ib_rmpp_mad *)seg_buf->mad; - flags = ib_get_rmpp_flags(&rmpp_mad->rmpp_hdr); - - if (flags & IB_MGMT_RMPP_FLAG_FIRST) { - data = rmpp_mad; - size = sizeof(*rmpp_mad); - } else { - data = (void *) rmpp_mad + offset; - if (flags & IB_MGMT_RMPP_FLAG_LAST) - size = len; - else - size = sizeof(*rmpp_mad) - offset; - } - - memcpy(buf, data, size); - len -= size; - buf += size; - } -} -EXPORT_SYMBOL(ib_coalesce_recv_mad); - static struct ib_mad_recv_wc * continue_rmpp(struct ib_mad_agent_private *agent, struct ib_mad_recv_wc *mad_recv_wc) Index: last_stable/drivers/infiniband/core/user_mad.c =================================================================== --- last_stable.orig/drivers/infiniband/core/user_mad.c +++ last_stable/drivers/infiniband/core/user_mad.c @@ -176,6 +177,88 @@ static int queue_packet(struct ib_umad_f return ret; } +static int data_offset(u8 mgmt_class) +{ + if (mgmt_class == IB_MGMT_CLASS_SUBN_ADM) + return IB_MGMT_SA_HDR; + else if ((mgmt_class >= IB_MGMT_CLASS_VENDOR_RANGE2_START) && + (mgmt_class <= IB_MGMT_CLASS_VENDOR_RANGE2_END)) + return IB_MGMT_VENDOR_HDR; + else + return IB_MGMT_RMPP_HDR; +} + +static int copy_recv_mad(struct ib_mad_recv_wc *mad_recv_wc, + struct ib_umad_packet *packet) +{ + struct ib_mad_recv_buf *seg_buf; + struct ib_rmpp_mad *rmpp_mad; + void *data; + struct ib_mad_multipacket_seg *seg; + int size, len, offset; + u8 flags; + + len = mad_recv_wc->mad_len; + if (len <= sizeof(struct ib_mad)) { + memcpy(&packet->mad.data, mad_recv_wc->recv_buf.mad, len); + return 0; + } + + /* Multipacket (RMPP) MAD */ + offset = data_offset(mad_recv_wc->recv_buf.mad->mad_hdr.mgmt_class); + + list_for_each_entry(seg_buf, &mad_recv_wc->rmpp_list, list) { + rmpp_mad = (struct ib_rmpp_mad *)seg_buf->mad; + flags = ib_get_rmpp_flags(&rmpp_mad->rmpp_hdr); + + if (flags & IB_MGMT_RMPP_FLAG_FIRST) { + size = sizeof(*rmpp_mad); + memcpy(&packet->mad.data, rmpp_mad, size); + } else { + data = (void *) rmpp_mad + offset; + if (flags & IB_MGMT_RMPP_FLAG_LAST) + size = len; + else + size = sizeof(*rmpp_mad) - offset; + seg = kmalloc(sizeof(struct ib_mad_multipacket_seg) + + sizeof(struct ib_rmpp_mad) - offset, + GFP_KERNEL); + if (!seg) + return -ENOMEM; + memcpy(seg->data, data, size); + list_add_tail(&seg->list, &packet->seg_list); + } + len -= size; + } + return 0; +} + +static struct ib_umad_packet *alloc_packet(void) +{ + struct ib_umad_packet *packet; + int length = sizeof *packet + sizeof(struct ib_mad); + + packet = kzalloc(length, GFP_KERNEL); + if (!packet) { + printk(KERN_ERR "alloc_packet: mem alloc failed for length %d\n", + length); + return NULL; + } + INIT_LIST_HEAD(&packet->seg_list); + return packet; +} + +static void free_packet(struct ib_umad_packet *packet) +{ + struct ib_mad_multipacket_seg *seg, *tmp; + + list_for_each_entry_safe(seg, tmp, &packet->seg_list, list) { + list_del(&seg->list); + kfree(seg); + } + kfree(packet); +} + static void send_handler(struct ib_mad_agent *agent, struct ib_mad_send_wc *send_wc) { @@ -243,13 +298,16 @@ static void recv_handler(struct ib_mad_a goto out; length = mad_recv_wc->mad_len; - packet = alloc_packet(length); + packet = alloc_packet(); if (!packet) goto out; packet->length = length; - ib_coalesce_recv_mad(mad_recv_wc, packet->mad.data); + if (copy_recv_mad(mad_recv_wc, packet)) { + free_packet(packet); + goto out; + } packet->mad.hdr.status = 0; packet->mad.hdr.length = length + sizeof (struct ib_user_mad); @@ -278,6 +336,7 @@ static ssize_t ib_umad_read(struct file size_t count, loff_t *pos) { struct ib_umad_file *file = filp->private_data; + struct ib_mad_multipacket_seg *seg; struct ib_umad_packet *packet; ssize_t ret; @@ -304,18 +363,42 @@ static ssize_t ib_umad_read(struct file spin_unlock_irq(&file->recv_lock); - if (count < packet->length + sizeof (struct ib_user_mad)) { - /* Return length needed (and first RMPP segment) if too small */ - if (copy_to_user(buf, &packet->mad, - sizeof (struct ib_user_mad) + sizeof (struct ib_mad))) - ret = -EFAULT; - else - ret = -ENOSPC; - } else if (copy_to_user(buf, &packet->mad, - packet->length + sizeof (struct ib_user_mad))) + if (copy_to_user(buf, &packet->mad, + sizeof(struct ib_user_mad) + sizeof(struct ib_mad))) { ret = -EFAULT; - else + goto err; + } + + if (count < packet->length + sizeof (struct ib_user_mad)) + /* User buffer too small. Return first RMPP segment (which + * includes RMPP message length). + */ + ret = -ENOSPC; + else if (packet->length <= sizeof(struct ib_mad)) + ret = packet->length + sizeof(struct ib_user_mad); + else { + int len = packet->length - sizeof(struct ib_mad); + struct ib_rmpp_mad *rmpp_mad = + (struct ib_rmpp_mad *) packet->mad.data; + int max_seg_payload = sizeof(struct ib_mad) - + data_offset(rmpp_mad->mad_hdr.mgmt_class); + int seg_payload; + /* multipacket RMPP MAD message. Copy remainder of message. + * Note that last segment may have a shorter payload. + */ + buf += sizeof(struct ib_user_mad) + sizeof(struct ib_mad); + list_for_each_entry(seg, &packet->seg_list, list) { + seg_payload = min_t(int, len, max_seg_payload); + if (copy_to_user(buf, seg->data, seg_payload)) { + ret = -EFAULT; + goto err; + } + buf += seg_payload; + len -= seg_payload; + } ret = packet->length + sizeof (struct ib_user_mad); + } +err: if (ret < 0) { /* Requeue packet */ spin_lock_irq(&file->recv_lock); From jackm at mellanox.co.il Mon Feb 6 23:41:33 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Tue, 7 Feb 2006 09:41:33 +0200 Subject: [openib-general] [PATCH 3 of 3] mad: large RMPP support Message-ID: <20060207074133.GC14165@mellanox.co.il> patch 3 of 3 --- Large RMPP support, send side: split a multipacket MAD buffer to a list of segments, (multipacket_list) and send these using an gather list of size 2. Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Index: last_stable/drivers/infiniband/core/mad_rmpp.c =================================================================== --- last_stable.orig/drivers/infiniband/core/mad_rmpp.c +++ last_stable/drivers/infiniband/core/mad_rmpp.c @@ -570,16 +532,23 @@ start_rmpp(struct ib_mad_agent_private * return mad_recv_wc; } -static inline u64 get_seg_addr(struct ib_mad_send_wr_private *mad_send_wr) +static inline void *get_seg_addr(struct ib_mad_send_wr_private *mad_send_wr) { - return mad_send_wr->sg_list[0].addr + mad_send_wr->data_offset + - (sizeof(struct ib_rmpp_mad) - mad_send_wr->data_offset) * - (mad_send_wr->seg_num - 1); + struct ib_mad_multipacket_seg *seg; + int i = 2; + + list_for_each_entry(seg, &mad_send_wr->multipacket_list, list) { + if (i == mad_send_wr->seg_num) + return seg->data; + i++; + } + return NULL; } -static int send_next_seg(struct ib_mad_send_wr_private *mad_send_wr) +int send_next_seg(struct ib_mad_send_wr_private *mad_send_wr) { struct ib_rmpp_mad *rmpp_mad; + void *next_data; int timeout; u32 paylen; @@ -592,14 +561,14 @@ static int send_next_seg(struct ib_mad_s paylen = mad_send_wr->total_seg * IB_MGMT_RMPP_DATA - mad_send_wr->pad; rmpp_mad->rmpp_hdr.paylen_newwin = cpu_to_be32(paylen); - mad_send_wr->sg_list[0].length = sizeof(struct ib_rmpp_mad); } else { - mad_send_wr->send_wr.num_sge = 2; - mad_send_wr->sg_list[0].length = mad_send_wr->data_offset; - mad_send_wr->sg_list[1].addr = get_seg_addr(mad_send_wr); - mad_send_wr->sg_list[1].length = sizeof(struct ib_rmpp_mad) - - mad_send_wr->data_offset; - mad_send_wr->sg_list[1].lkey = mad_send_wr->sg_list[0].lkey; + next_data = get_seg_addr(mad_send_wr); + if (!next_data) { + printk(KERN_ERR PFX "send_next_seg: " + "could not find next segment\n"); + return -EINVAL; + } + mad_send_wr->send_buf.mad_payload = next_data; rmpp_mad->rmpp_hdr.paylen_newwin = 0; } @@ -838,7 +807,7 @@ out: int ib_send_rmpp_mad(struct ib_mad_send_wr_private *mad_send_wr) { struct ib_rmpp_mad *rmpp_mad; - int i, total_len, ret; + int ret; rmpp_mad = mad_send_wr->send_buf.mad; if (!(ib_get_rmpp_flags(&rmpp_mad->rmpp_hdr) & @@ -848,20 +817,16 @@ int ib_send_rmpp_mad(struct ib_mad_send_ if (rmpp_mad->rmpp_hdr.rmpp_type != IB_MGMT_RMPP_TYPE_DATA) return IB_RMPP_RESULT_INTERNAL; - if (mad_send_wr->send_wr.num_sge > 1) - return -EINVAL; /* TODO: support num_sge > 1 */ + if (mad_send_wr->send_wr.num_sge != 2) + return -EINVAL; mad_send_wr->seg_num = 1; mad_send_wr->newwin = 1; mad_send_wr->data_offset = data_offset(rmpp_mad->mad_hdr.mgmt_class); - total_len = 0; - for (i = 0; i < mad_send_wr->send_wr.num_sge; i++) - total_len += mad_send_wr->send_wr.sg_list[i].length; - - mad_send_wr->total_seg = (total_len - mad_send_wr->data_offset) / + mad_send_wr->total_seg = (mad_send_wr->total_length - mad_send_wr->data_offset) / (sizeof(struct ib_rmpp_mad) - mad_send_wr->data_offset); - mad_send_wr->pad = total_len - IB_MGMT_RMPP_HDR - + mad_send_wr->pad = mad_send_wr->total_length - IB_MGMT_RMPP_HDR - be32_to_cpu(rmpp_mad->rmpp_hdr.paylen_newwin); /* We need to wait for the final ACK even if there isn't a response */ Index: last_stable/drivers/infiniband/core/mad.c =================================================================== --- last_stable.orig/drivers/infiniband/core/mad.c +++ last_stable/drivers/infiniband/core/mad.c @@ -779,6 +779,17 @@ static int get_buf_length(int hdr_len, i return hdr_len + data_len + pad; } +static void free_send_multipacket_list(struct ib_mad_send_wr_private * + mad_send_wr) +{ + struct ib_mad_multipacket_seg *s, *t; + + list_for_each_entry_safe(s, t, &mad_send_wr->multipacket_list, list) { + list_del(&s->list); + kfree(s); + } +} + struct ib_mad_send_buf * ib_create_send_mad(struct ib_mad_agent *mad_agent, u32 remote_qpn, u16 pkey_index, int rmpp_active, @@ -787,39 +798,38 @@ struct ib_mad_send_buf * ib_create_send_ { struct ib_mad_agent_private *mad_agent_priv; struct ib_mad_send_wr_private *mad_send_wr; - int length, buf_size; + int length, message_size, seg_size; void *buf; mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private, agent); - buf_size = get_buf_length(hdr_len, data_len); + message_size = get_buf_length(hdr_len, data_len); if ((!mad_agent->rmpp_version && - (rmpp_active || buf_size > sizeof(struct ib_mad))) || - (!rmpp_active && buf_size > sizeof(struct ib_mad))) + (rmpp_active || message_size > sizeof(struct ib_mad))) || + (!rmpp_active && message_size > sizeof(struct ib_mad))) return ERR_PTR(-EINVAL); - length = sizeof *mad_send_wr + buf_size; - if (length >= PAGE_SIZE) - buf = (void *)__get_free_pages(gfp_mask, long_log2(roundup_pow_of_two(length)) - PAGE_SHIFT); - else - buf = kmalloc(length, gfp_mask); + length = sizeof *mad_send_wr + message_size; + buf = kzalloc(sizeof *mad_send_wr + sizeof(struct ib_mad), gfp_mask); if (!buf) return ERR_PTR(-ENOMEM); - memset(buf, 0, length); - - mad_send_wr = buf + buf_size; + mad_send_wr = buf + sizeof(struct ib_mad); + INIT_LIST_HEAD(&mad_send_wr->multipacket_list); mad_send_wr->send_buf.mad = buf; + mad_send_wr->send_buf.mad_payload = buf + hdr_len; mad_send_wr->mad_agent_priv = mad_agent_priv; - mad_send_wr->sg_list[0].length = buf_size; + mad_send_wr->sg_list[0].length = hdr_len; mad_send_wr->sg_list[0].lkey = mad_agent->mr->lkey; + mad_send_wr->sg_list[1].length = sizeof(struct ib_mad) - hdr_len; + mad_send_wr->sg_list[1].lkey = mad_agent->mr->lkey; mad_send_wr->send_wr.wr_id = (unsigned long) mad_send_wr; mad_send_wr->send_wr.sg_list = mad_send_wr->sg_list; - mad_send_wr->send_wr.num_sge = 1; + mad_send_wr->send_wr.num_sge = 2; mad_send_wr->send_wr.opcode = IB_WR_SEND; mad_send_wr->send_wr.send_flags = IB_SEND_SIGNALED; mad_send_wr->send_wr.wr.ud.remote_qpn = remote_qpn; @@ -827,6 +837,7 @@ struct ib_mad_send_buf * ib_create_send_ mad_send_wr->send_wr.wr.ud.pkey_index = pkey_index; if (rmpp_active) { + struct ib_mad_multipacket_seg *seg; struct ib_rmpp_mad *rmpp_mad = mad_send_wr->send_buf.mad; rmpp_mad->rmpp_hdr.paylen_newwin = cpu_to_be32(hdr_len - IB_MGMT_RMPP_HDR + data_len); @@ -834,6 +845,27 @@ struct ib_mad_send_buf * ib_create_send_ rmpp_mad->rmpp_hdr.rmpp_type = IB_MGMT_RMPP_TYPE_DATA; ib_set_rmpp_flags(&rmpp_mad->rmpp_hdr, IB_MGMT_RMPP_FLAG_ACTIVE); + mad_send_wr->total_length = message_size; + /* allocate RMPP buffers */ + message_size -= sizeof(struct ib_mad); + seg_size = sizeof(struct ib_mad) - hdr_len; + while (message_size > 0) { + seg = kmalloc(sizeof(struct ib_mad_multipacket_seg) + + seg_size, gfp_mask); + if (!seg) { + printk(KERN_ERR "ib_create_send_mad: RMPP mem " + "alloc failed for len %zd, gfp %#x\n", + sizeof(struct ib_mad_multipacket_seg) + + seg_size, gfp_mask); + free_send_multipacket_list(mad_send_wr); + kfree(buf); + return ERR_PTR(-ENOMEM); + } + seg->size = seg_size; + list_add_tail(&seg->list, + &mad_send_wr->multipacket_list); + message_size -= seg_size; + } } mad_send_wr->send_buf.mad_agent = mad_agent; @@ -842,23 +874,36 @@ struct ib_mad_send_buf * ib_create_send_ } EXPORT_SYMBOL(ib_create_send_mad); +struct ib_mad_multipacket_seg *ib_get_multipacket_seg(struct ib_mad_send_buf * + send_buf, int seg_num) +{ + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_multipacket_seg *seg; + int i = 2; + + mad_send_wr = container_of(send_buf, struct ib_mad_send_wr_private, + send_buf); + list_for_each_entry(seg, &mad_send_wr->multipacket_list, list) { + if (i == seg_num) + return seg; + i++; + } + return NULL; +} +EXPORT_SYMBOL(ib_get_multipacket_seg); + void ib_free_send_mad(struct ib_mad_send_buf *send_buf) { struct ib_mad_agent_private *mad_agent_priv; - void *mad_send_wr; - int length; + struct ib_mad_send_wr_private *mad_send_wr; mad_agent_priv = container_of(send_buf->mad_agent, struct ib_mad_agent_private, agent); mad_send_wr = container_of(send_buf, struct ib_mad_send_wr_private, send_buf); - length = sizeof(struct ib_mad_send_wr_private) + (mad_send_wr - send_buf->mad); - if (length >= PAGE_SIZE) - free_pages((unsigned long)send_buf->mad, long_log2(roundup_pow_of_two(length)) - PAGE_SHIFT); - else - kfree(send_buf->mad); - + free_send_multipacket_list(mad_send_wr); + kfree(send_buf->mad); if (atomic_dec_and_test(&mad_agent_priv->refcount)) wake_up(&mad_agent_priv->wait); } @@ -881,10 +926,17 @@ int ib_send_mad(struct ib_mad_send_wr_pr mad_agent = mad_send_wr->send_buf.mad_agent; sge = mad_send_wr->sg_list; - sge->addr = dma_map_single(mad_agent->device->dma_device, - mad_send_wr->send_buf.mad, sge->length, - DMA_TO_DEVICE); - pci_unmap_addr_set(mad_send_wr, mapping, sge->addr); + sge[0].addr = dma_map_single(mad_agent->device->dma_device, + mad_send_wr->send_buf.mad, + sge[0].length, + DMA_TO_DEVICE); + pci_unmap_addr_set(mad_send_wr, header_mapping, sge[0].addr); + + sge[1].addr = dma_map_single(mad_agent->device->dma_device, + mad_send_wr->send_buf.mad_payload, + sge[1].length, + DMA_TO_DEVICE); + pci_unmap_addr_set(mad_send_wr, payload_mapping, sge[1].addr); spin_lock_irqsave(&qp_info->send_queue.lock, flags); if (qp_info->send_queue.count < qp_info->send_queue.max_active) { @@ -901,11 +953,15 @@ int ib_send_mad(struct ib_mad_send_wr_pr list_add_tail(&mad_send_wr->mad_list.list, list); } spin_unlock_irqrestore(&qp_info->send_queue.lock, flags); - if (ret) + if (ret) { dma_unmap_single(mad_agent->device->dma_device, - pci_unmap_addr(mad_send_wr, mapping), - sge->length, DMA_TO_DEVICE); + pci_unmap_addr(mad_send_wr, header_mapping), + sge[0].length, DMA_TO_DEVICE); + dma_unmap_single(mad_agent->device->dma_device, + pci_unmap_addr(mad_send_wr, payload_mapping), + sge[1].length, DMA_TO_DEVICE); + } return ret; } @@ -1876,8 +1932,11 @@ static void ib_mad_send_done_handler(str retry: dma_unmap_single(mad_send_wr->send_buf.mad_agent->device->dma_device, - pci_unmap_addr(mad_send_wr, mapping), + pci_unmap_addr(mad_send_wr, header_mapping), mad_send_wr->sg_list[0].length, DMA_TO_DEVICE); + dma_unmap_single(mad_send_wr->send_buf.mad_agent->device->dma_device, + pci_unmap_addr(mad_send_wr, payload_mapping), + mad_send_wr->sg_list[1].length, DMA_TO_DEVICE); queued_send_wr = NULL; spin_lock_irqsave(&send_queue->lock, flags); list_del(&mad_list->list); Index: last_stable/drivers/infiniband/core/user_mad.c =================================================================== --- last_stable.orig/drivers/infiniband/core/user_mad.c +++ last_stable/drivers/infiniband/core/user_mad.c @@ -187,7 +270,7 @@ static void send_handler(struct ib_mad_a ib_free_send_mad(packet->msg); if (send_wc->status == IB_WC_RESP_TIMEOUT_ERR) { - timeout = kzalloc(sizeof *timeout + IB_MGMT_MAD_HDR, GFP_KERNEL); + timeout = alloc_packet(); if (!timeout) goto out; @@ -198,40 +281,12 @@ static void send_handler(struct ib_mad_a sizeof (struct ib_mad_hdr)); if (queue_packet(file, agent, timeout)) - kfree(timeout); + free_packet(timeout); } out: kfree(packet); } -static struct ib_umad_packet *alloc_packet(int buf_size) -{ - struct ib_umad_packet *packet; - int length = sizeof *packet + buf_size; - - if (length >= PAGE_SIZE) - packet = (void *)__get_free_pages(GFP_KERNEL, long_log2(roundup_pow_of_two(length)) - PAGE_SHIFT); - else - packet = kmalloc(length, GFP_KERNEL); - - if (!packet) - return NULL; - - memset(packet, 0, length); - return packet; -} - -static void free_packet(struct ib_umad_packet *packet) -{ - int length = packet->length + sizeof *packet; - if (length >= PAGE_SIZE) - free_pages((unsigned long) packet, long_log2(roundup_pow_of_two(length)) - PAGE_SHIFT); - else - kfree(packet); -} - - - static void recv_handler(struct ib_mad_agent *agent, struct ib_mad_recv_wc *mad_recv_wc) { @@ -339,6 +422,8 @@ static ssize_t ib_umad_write(struct file __be64 *tid; int ret, length, hdr_len, copy_offset; int rmpp_active, has_rmpp_header; + int s, seg_num; + struct ib_mad_multipacket_seg *seg; if (count < sizeof (struct ib_user_mad) + IB_MGMT_RMPP_HDR) return -EINVAL; @@ -415,6 +500,11 @@ static ssize_t ib_umad_write(struct file goto err_ah; } + if (!rmpp_active && length > sizeof(struct ib_mad)) { + ret = -EINVAL; + goto err_ah; + } + packet->msg = ib_create_send_mad(agent, be32_to_cpu(packet->mad.hdr.qpn), 0, rmpp_active, @@ -432,14 +522,32 @@ static ssize_t ib_umad_write(struct file /* Copy MAD headers (RMPP header in place) */ memcpy(packet->msg->mad, packet->mad.data, IB_MGMT_MAD_HDR); - /* Now, copy rest of message from user into send buffer */ + /* complete copying first 256 bytes of message into send buffer */ if (copy_from_user(packet->msg->mad + copy_offset, buf + sizeof (struct ib_user_mad) + copy_offset, - length - copy_offset)) { + min_t(int, length, sizeof(struct ib_mad)) - copy_offset)) { ret = -EFAULT; goto err_msg; } + /* if RMPP, copy rest of send message from user to multipacket list */ + length -= sizeof(struct ib_mad); + if (length > 0) { + buf += sizeof (struct ib_user_mad) + sizeof(struct ib_mad); + for (seg_num = 2; length > 0; ++seg_num, buf += s, length -= s) { + seg = ib_get_multipacket_seg(packet->msg, seg_num); + BUG_ON(!seg); + s = min_t(int, length, seg->size); + if (copy_from_user(seg->data, buf, s)) { + ret = -EFAULT; + goto err_msg; + } + } + /* Pad last segment with zeroes. */ + if (seg->size - s) + memset(seg->data + s, 0, seg->size - s); + } + /* * If userspace is generating a request that will generate a * response, we need to make sure the high-order part of the From mst at mellanox.co.il Tue Feb 7 00:02:38 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 7 Feb 2006 10:02:38 +0200 Subject: [openib-general] Re: [git patch review 2/2] IB: Don't doublefree pages from scatterlist In-Reply-To: References: Message-ID: <20060207080238.GT31609@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [git patch review 2/2] IB: Don't doublefree pages from scatterlist > > Hugh> It's now looking like this change won't be needed after all: > Hugh> Andi has just posted a patch in the "ipr" thread which > Hugh> should stop x86_64 from interfering with the scatterlist > Hugh> *page,offset,length fields, so what IB and others were doing > Hugh> should then work safely (current thinking is that x86_64 is > Hugh> the only architecture which coalesced in that way). > > OK, I'll drop this from my tree. > > - R. But hopefully, not (yet) the svn tree - svn tree is explicitly targeting the last stable kernel from kernels.org? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From rdreier at cisco.com Tue Feb 7 00:16:12 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 07 Feb 2006 00:16:12 -0800 Subject: [openib-general] Re: [git patch review 2/2] IB: Don't doublefree pages from scatterlist In-Reply-To: <20060207080238.GT31609@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 7 Feb 2006 10:02:38 +0200") References: <20060207080238.GT31609@mellanox.co.il> Message-ID: Michael> But hopefully, not (yet) the svn tree - svn tree is Michael> explicitly targeting the last stable kernel from Michael> kernels.org? Hmm, good point. I actually already reverted it in svn... ...but now I'm not sure it's worth worrying about. The only problematic arch (x86_64 with GART IOMMU) will not ever use the IOMMU and coalesce sg lists if the device being mapped for can do 64-bit DMA. So this bug could never have a practical effect for us. - R. From yael at mellanox.co.il Tue Feb 7 00:28:41 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 07 Feb 2006 10:28:41 +0200 Subject: [openib-general] [PATCH] Opensm - change default dir for Windows Message-ID: <5zek2f1sjq.fsf@mtl066.yok.mtl.com> Hi Hal, The following patch includes some fixes for the windows stack: 1. Add needed __cdecl. 2. Change the default directories/files names. Thanks, Yael Signed-off-by: Yael Kalka Index: include/opensm/osm_base.h =================================================================== --- include/opensm/osm_base.h (revision 5307) +++ include/opensm/osm_base.h (working copy) @@ -50,6 +50,13 @@ #ifndef _OSM_BASE_H_ #define _OSM_BASE_H_ +#ifdef __WIN__ +#include +#define OSM_CDECL __cdecl +#else +#define OSM_CDECL +#endif + #include #ifdef __cplusplus @@ -176,7 +183,7 @@ BEGIN_C_DECLS * SYNOPSIS */ #ifdef __WIN__ -#define OSM_DEFAULT_TMP_DIR "C:\\Windows\\Temp\\" +#define OSM_DEFAULT_TMP_DIR GetOsmPath() #else #define OSM_DEFAULT_TMP_DIR "/tmp/" #endif @@ -188,11 +195,12 @@ BEGIN_C_DECLS * * DESCRIPTION * Specifies the default cache directory for the db files. +* Note that the directory must appear with "/" ("\\" for windows) at the end. * * SYNOPSIS */ #ifdef __WIN__ -#define OSM_DEFAULT_CACHE_DIR "C:\\Windows\\Temp\\" +#define OSM_DEFAULT_CACHE_DIR GetOsmPath() #else #define OSM_DEFAULT_CACHE_DIR "/var/cache/osm/" #endif @@ -208,7 +216,7 @@ BEGIN_C_DECLS * SYNOPSIS */ #ifdef __WIN__ -#define OSM_DEFAULT_LOG_FILE "C:\\Windows\\Temp\\osm.log" +#define OSM_DEFAULT_LOG_FILE strcat(GetOsmPath(), "osm.log") #else #define OSM_DEFAULT_LOG_FILE "/var/log/osm.log" #endif From mst at mellanox.co.il Tue Feb 7 00:56:38 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 7 Feb 2006 10:56:38 +0200 Subject: [openib-general] Re: ipoib_mcast_send.patch In-Reply-To: References: Message-ID: <20060207085637.GU31609@mellanox.co.il> Quoting r. Roland Dreier : > Related to this, the way priv->broadcast is initialized > in ipoib_mcast_join_task() looks somewhat unsafe, since there's no > lock and conceivable a send-only join could complete before > priv->broadcast is fully set up. What do you think? I agree. Do you want to fix it or should I? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From ogerlitz at voltaire.com Tue Feb 7 01:02:50 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 7 Feb 2006 11:02:50 +0200 (IST) Subject: [openib-general] iser: 3 change sets Message-ID: ------------------------------------------------------------------------ r5329 | ogerlitz | 2006-02-07 11:01:04 +0200 (Tue, 07 Feb 2006) | 5 lines refined conn term flow, removed two cases from iser_conn_sync_terminate, made iser_complete_conn_termination and iser_conn_async_terminate void Signed-off-by: Or Gerlitz ------------------------------------------------------------------------ r5328 | ogerlitz | 2006-02-07 10:24:27 +0200 (Tue, 07 Feb 2006) | 4 lines more cleanups, made some code static, coding conventions fixes Signed-off-by: Or Gerlitz ------------------------------------------------------------------------ r5327 | ogerlitz | 2006-02-07 09:00:27 +0200 (Tue, 07 Feb 2006) | 5 lines removed iser_conn.c, iser_conn_{bind,term} integrated into iscsi_iser_conn_{bind,stop}, the rest of the code split between iser_initiator.c and iser_verbs.c Signed-off-by: Or Gerlitz From mst at mellanox.co.il Tue Feb 7 01:46:48 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 7 Feb 2006 11:46:48 +0200 Subject: [openib-general] Re: mthca: gid index bug? In-Reply-To: References: Message-ID: <20060207094648.GX31609@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: mthca: gid index bug? > > > Roland, in mthca_qp.c we have > > > > path->mgid_index = ah->grh.sgid_index; > > > > Shouldnt the port number be taken into account, like it > > is with mthca_av, where we have > > av->gid_index = (ah_attr->port_num - 1) * dev->limits.gid_table_len + > > I really don't know. The PRM just says "index to port GID table". > Can you check it out at Mellanox and (even better) generate a patch if > it's wrong? The existing code is correct. We'll try to clarify this in the next PRM. Thanks, -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From ogerlitz at voltaire.com Tue Feb 7 05:26:03 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 7 Feb 2006 15:26:03 +0200 (IST) Subject: [openib-general] [PATCH] iser: bugfix for connect error flow Message-ID: bugfix for connect error flow when getting RDMA_CM_EVENT_ADDR_ERROR Signed-off-by: Or Gerlitz Index: iser_verbs.c =================================================================== --- iser_verbs.c (revision 5329) +++ iser_verbs.c (revision 5330) @@ -626,7 +626,9 @@ void iser_conn_release(struct iser_conn if (atomic_read(&p_iser_conn->state) == ISER_CONN_DOWN) { iser_free_ib_conn_res(p_iser_conn); /* qp/id freed only once */ p_iser_conn->p_adaptor = NULL; - iser_adaptor_try_release(p_iser_adaptor); + /* on EVENT_ADDR_ERROR there's no adaptor yet for this conn */ + if (p_iser_adaptor != NULL) + iser_adaptor_try_release(p_iser_adaptor); p_iscsi_conn = p_iser_conn->p_iscsi_conn; if (p_iscsi_conn != NULL && p_iscsi_conn->ff_mode_enabled) { p_iscsi_conn->ff_mode_enabled = 0; From rdreier at cisco.com Tue Feb 7 05:55:03 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 07 Feb 2006 05:55:03 -0800 Subject: [openib-general] Re: ipoib_mcast_send.patch In-Reply-To: <20060207085637.GU31609@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 7 Feb 2006 10:56:38 +0200") References: <20060207085637.GU31609@mellanox.co.il> Message-ID: Michael> I agree. Do you want to fix it or should I? If you get a chance that would be great. I'm at the OpenIB workshop now so I probably can't seriously look at it until tomorrow at the earliest. - R. From mst at mellanox.co.il Tue Feb 7 06:22:38 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 7 Feb 2006 16:22:38 +0200 Subject: [openib-general] Re: ipoib_mcast_send.patch In-Reply-To: References: Message-ID: <20060207142238.GC31609@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: ipoib_mcast_send.patch > > Michael> I agree. Do you want to fix it or should I? > > If you get a chance that would be great. I'm at the OpenIB workshop > now so I probably can't seriously look at it until tomorrow at the > earliest. Looks like I'm a bit busy too for a couple of days. Lets get other stuff merged and get back to it later. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From caitlinb at broadcom.com Tue Feb 7 06:31:26 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 7 Feb 2006 06:31:26 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT 2.0immediatedataproposal Message-ID: <54AD0F12E08D1541B826BE97C98F99F122D93C@NT-SJCA-0751.brcm.ad.broadcom.com> Sean Hefty wrote: >> I am not clear what you are proposing? >> A transport specific API? >> >> The current proposal provides on sending side: >> single post, and single completion in the error free case. >> This is commonality that simplify ULP. > > App 1 - transport aware: > > if (transport == IB) > Do something > else > Do something different > > App 2 - transport independent: > > if (immediate data flag set) > if (DTO == 1) > Do something > else > do something else > else > do something different > Two points: First it is more like the_data = (DTO == 1) ? event->immediate : dto->buf; And further it is only on the receiving side. And only if the receiving side cares about the data (sometimes it only needs the notification). The more critical issue to me is whether this justifies all the text required to explain proper sizing of EPs and EVDs. I'd love to hear from some app developers as whether this is too complex to use, or a nice simplification. It isn't that hard for an iWARP Provider to do this, but any extra work is best avoided if the feature won't get used because the associated "caveats" text is too long. > All you've done is add flags in order to call the API > "transport neutral". The result to the application is the > same, except that the interface is more complex than it needs > to be, and causes confusion on the receiving side. And on > the sending side, the application still needs to check the > flag to see if immediate data is supported. > > A true transport neutral API wouldn't need flags that specify > the actual differences between the transports. > > The requirement is to provide an API that supports RDMA > writes with immediate data. A send that follows an RDMA > write is not immediate data, and the API should not be > constructed around trying to make it so. > > If you want to add a new requirement to the API to support > posting multiple work requests with a single call, that is a > different requirement. > > The attempt is to define a composite work request that can reduce the number of actual work requests required for some providers, without requiring different work flows dependent on whether the "immediate" feature was present. If this is not possible then DAT should simply not support the feature through anything other than a provider dependent extension function. In particular we should not be carving any exceptions to normal DAT ordering rules. In other words it is important that the delivery that confirms the RDMA Write to the Data Sink be ordered exactly as if an RDMA Send had been used to confirm delivery. From sean.hefty at intel.com Tue Feb 7 07:39:16 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 7 Feb 2006 07:39:16 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT 2.0immediatedataproposal In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F122D93C@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: > And further it is only on the receiving side. > And only if the receiving side cares about the data > (sometimes it only needs the notification). The send size cares about this check because it must size its SQ appropriately. I disagree with the assumption that a "transport neutral" API is inherently easier for the application developer. >The attempt is to define a composite work request that can >reduce the number of actual work requests required for >some providers, without requiring different work flows >dependent on whether the "immediate" feature was present. This is exactly what Roy was pointing out. This is no longer defining a write with immediate data, but instead addressing some other requirement. In this case, you can define a generic send side API that takes multiple work requests as input, since a provider may be able to reduce the actual number of work requests in this case as well. - Sean From Arkady.Kanevsky at netapp.com Tue Feb 7 07:48:57 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Tue, 7 Feb 2006 10:48:57 -0500 Subject: [dat-discussions] [openib-general] [RFC] DAT 2.0immediatedataproposal Message-ID: It is much simplier to handle immediate data as DAT extension. Spec changes are minimal. One extra field for DTO completion and for DAT_DTOS. One fix in redirection. The rest is up to a provider to define in dat_providername_extensions. How each provider defines analogous features are outside the scope of DAT. This includes versioning and feature discovery. Why would any Consumer hook itself on "proprietary" features and APIs is a different question. The other possibility is to define Immed data as DAT API with pure IB semantic. Providers for other transports will report feature not supported. If in the future there transport will define analogous features with different semantic we will need to redefine APIs to be transport independent. But we may have the same problem with the current proposal also. Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Caitlin Bestler [mailto:caitlinb at broadcom.com] > Sent: Tuesday, February 07, 2006 9:31 AM > To: Sean Hefty; Kanevsky, Arkady; Larsen, Roy K; > dat-discussions at yahoogroups.com; Sean Hefty > Cc: openib-general at openib.org > Subject: RE: [dat-discussions] [openib-general] [RFC] DAT > 2.0immediatedataproposal > > Sean Hefty wrote: > >> I am not clear what you are proposing? > >> A transport specific API? > >> > >> The current proposal provides on sending side: > >> single post, and single completion in the error free case. > >> This is commonality that simplify ULP. > > > > App 1 - transport aware: > > > > if (transport == IB) > > Do something > > else > > Do something different > > > > App 2 - transport independent: > > > > if (immediate data flag set) > > if (DTO == 1) > > Do something > > else > > do something else > > else > > do something different > > > > Two points: > > First it is more like > > the_data = (DTO == 1) ? event->immediate : dto->buf; > > And further it is only on the receiving side. > And only if the receiving side cares about the data > (sometimes it only needs the notification). > > The more critical issue to me is whether this justifies all > the text required to explain proper sizing of EPs and EVDs. > > I'd love to hear from some app developers as whether this is > too complex to use, or a nice simplification. It isn't that > hard for an iWARP Provider to do this, but any extra work is > best avoided if the feature won't get used because the > associated "caveats" text is too long. > > > > All you've done is add flags in order to call the API "transport > > neutral". The result to the application is the same, > except that the > > interface is more complex than it needs to be, and causes > confusion on > > the receiving side. And on the sending side, the application still > > needs to check the flag to see if immediate data is supported. > > > > A true transport neutral API wouldn't need flags that specify the > > actual differences between the transports. > > > > The requirement is to provide an API that supports RDMA writes with > > immediate data. A send that follows an RDMA write is not immediate > > data, and the API should not be constructed around trying > to make it > > so. > > > > If you want to add a new requirement to the API to support posting > > multiple work requests with a single call, that is a different > > requirement. > > > > > > The attempt is to define a composite work request that can > reduce the number of actual work requests required for some > providers, without requiring different work flows dependent > on whether the "immediate" feature was present. > > If this is not possible then DAT should simply not support > the feature through anything other than a provider dependent > extension function. In particular we should not be carving > any exceptions to normal DAT ordering rules. In other words > it is important that the delivery that confirms the RDMA > Write to the Data Sink be ordered exactly as if an RDMA Send > had been used to confirm delivery. > > > From Arkady.Kanevsky at netapp.com Tue Feb 7 07:51:10 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Tue, 7 Feb 2006 10:51:10 -0500 Subject: [dat-discussions] [openib-general] [RFC] DAT 2.0immediatedataproposal Message-ID: But each of the multiple work requests follow the semantic of single completion per work request. It can be controlled by completion_flags but it still not a semantic of a "single" post. Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Sean Hefty [mailto:sean.hefty at intel.com] > Sent: Tuesday, February 07, 2006 10:39 AM > To: 'Caitlin Bestler'; Kanevsky, Arkady; Larsen, Roy K; > dat-discussions at yahoogroups.com; Sean Hefty > Cc: openib-general at openib.org > Subject: RE: [dat-discussions] [openib-general] [RFC] DAT > 2.0immediatedataproposal > > > And further it is only on the receiving side. > > And only if the receiving side cares about the data > > (sometimes it only needs the notification). > > The send size cares about this check because it must size its > SQ appropriately. > I disagree with the assumption that a "transport neutral" API > is inherently easier for the application developer. > > >The attempt is to define a composite work request that can > reduce the > >number of actual work requests required for some providers, without > >requiring different work flows dependent on whether the "immediate" > >feature was present. > > This is exactly what Roy was pointing out. This is no longer > defining a write with immediate data, but instead addressing > some other requirement. In this case, you can define a > generic send side API that takes multiple work requests as > input, since a provider may be able to reduce the actual > number of work requests in this case as well. > > - Sean > From swise at opengridcomputing.com Tue Feb 7 08:01:33 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 07 Feb 2006 10:01:33 -0600 Subject: [openib-general] ibstat problem Message-ID: <1139328093.26395.18.camel@stevo-desktop> Anyone see this before? ----- vic17:~ # ibstat ibstat: relocation error: ibstat: symbol argv0, version IBCOMMON_1.0 not defined in file libibcommon.so.1 with link time reference vic17:~ # uname -a Linux vic17 2.6.15.2-kdb #4 SMP PREEMPT Mon Feb 6 17:24:41 CST 2006 i686 i686 i386 GNU/Linux vic17:~ # ----- [swise at dell3 src]$ svn info Path: . URL: https://openib.org/svn/gen2/trunk/src Repository UUID: 21a7a0b7-18d7-0310-8e21-e8b31bdbf5cd Revision: 5330 Node Kind: directory Schedule: normal Last Changed Author: ogerlitz Last Changed Rev: 5330 Last Changed Date: 2006-02-07 07:23:38 -0600 (Tue, 07 Feb 2006) [swise at dell3 src]$ From sean.hefty at intel.com Tue Feb 7 08:08:04 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 7 Feb 2006 08:08:04 -0800 Subject: [openib-general] RE: [PATCH 1 of 3] mad: large RMPP support In-Reply-To: <20060207073933.GA14165@mellanox.co.il> Message-ID: >Large RMPP support: changes/additions to underlying data structures and >prototypes. Thanks. I'm at the OpenIB conference currently, but should be able to review this by the end of the week. - Sean From sean.hefty at intel.com Tue Feb 7 08:11:36 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 7 Feb 2006 08:11:36 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT 2.0immediatedataproposal In-Reply-To: Message-ID: >Why would any Consumer hook itself on "proprietary" features and >APIs is a different question. Because it provides a real performance benefit. This is the same reason apps code to DAPL versus standard sockets. - Sean From caitlinb at broadcom.com Tue Feb 7 08:12:29 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 7 Feb 2006 08:12:29 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT 2.0immediatedataproposal Message-ID: <54AD0F12E08D1541B826BE97C98F99F122D959@NT-SJCA-0751.brcm.ad.broadcom.com> Sean Hefty wrote: >> And further it is only on the receiving side. >> And only if the receiving side cares about the data >> (sometimes it only needs the notification). > > The send size cares about this check because it must size its SQ > appropriately. I disagree with the assumption that a "transport > neutral" API is inherently easier for the application developer. > >> The attempt is to define a composite work request that can reduce the >> number of actual work requests required for some providers, without >> requiring different work flows dependent on whether the "immediate" >> feature was present. > > This is exactly what Roy was pointing out. This is no longer > defining a write with immediate data, but instead addressing > some other requirement. In this case, you can define a > generic send side API that takes multiple work requests as > input, since a provider may be able to reduce the actual > number of work requests in this case as well. > > - Sean Yes such an interface is more general. It would be something along the lines of dat_ep_post_exchnage() which would post the SGLs for zero or more RDMA Writes and a single RDMA Send. It would be matched on the other end by a single receive. Would that be easy for IB vendors to optimize? It's pretty much the same for an iWARP provider. From halr at voltaire.com Tue Feb 7 08:12:00 2006 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 7 Feb 2006 18:12:00 +0200 Subject: [openib-general] ibstat problem Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AC41@taurus.voltaire.com> Hi Steve, This looks similar to the ibping problem. Could you update libibcommon.map and rebuild libibcommon ? Thanks. -- Hal ________________________________ From: openib-general-bounces at openib.org on behalf of Steve Wise Sent: Tue 2/7/2006 11:01 AM To: openib-general Subject: [openib-general] ibstat problem Anyone see this before? ----- vic17:~ # ibstat ibstat: relocation error: ibstat: symbol argv0, version IBCOMMON_1.0 not defined in file libibcommon.so.1 with link time reference vic17:~ # uname -a Linux vic17 2.6.15.2-kdb #4 SMP PREEMPT Mon Feb 6 17:24:41 CST 2006 i686 i686 i386 GNU/Linux vic17:~ # ----- [swise at dell3 src]$ svn info Path: . URL: https://openib.org/svn/gen2/trunk/src Repository UUID: 21a7a0b7-18d7-0310-8e21-e8b31bdbf5cd Revision: 5330 Node Kind: directory Schedule: normal Last Changed Author: ogerlitz Last Changed Rev: 5330 Last Changed Date: 2006-02-07 07:23:38 -0600 (Tue, 07 Feb 2006) [swise at dell3 src]$ _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From robert.j.woodruff at intel.com Tue Feb 7 09:07:31 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 7 Feb 2006 09:07:31 -0800 Subject: [openib-general] Pathscale driver build broken in SVN5330 Message-ID: <1AC79F16F5C5284499BB9591B33D6F0006D0CCC8@orsmsx408> I get the following build error when compiling SVN5330. CC [M] drivers/infiniband/hw/ipath/ipath_verbs.o drivers/infiniband/hw/ipath/ipath_verbs.c: In function `ipath_alloc_fmr': drivers/infiniband/hw/ipath/ipath_verbs.c:5759: error: structure has no member named `page_size' make[3]: *** [drivers/infiniband/hw/ipath/ipath_verbs.o] Error 1 make[2]: *** [drivers/infiniband/hw/ipath] Error 2 make[1]: *** [drivers/infiniband] Error 2 It looks like the "page_size" struct member has been renamed to "page_shift" in SVN5330. woody From bos at pathscale.com Tue Feb 7 09:23:41 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Tue, 07 Feb 2006 09:23:41 -0800 Subject: [openib-general] Pathscale driver build broken in SVN5330 In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0006D0CCC8@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F0006D0CCC8@orsmsx408> Message-ID: <1139333021.27739.80.camel@camp4.serpentine.com> On Tue, 2006-02-07 at 09:07 -0800, Woodruff, Robert J wrote: > I get the following build error when compiling > SVN5330. We'll commit a fix later today. Robert is in Sonoma at the OpenIB workshop, and he's our svn committer, so it might take a little while. Thanks for pointing this out. (Robert J. Woodruff's message of "Tue, 7 Feb 2006 09:07:31 -0800") References: <1AC79F16F5C5284499BB9591B33D6F0006D0CCC8@orsmsx408> Message-ID: Thanks, I broke this when I merged Or's FMR patch. I checked in this fix: --- infiniband/hw/ipath/ipath_verbs.c (revision 5330) +++ infiniband/hw/ipath/ipath_verbs.c (working copy) @@ -5756,7 +5756,7 @@ static struct ib_fmr *ipath_alloc_fmr(st fmr->mr.offset = 0; fmr->mr.access_flags = mr_access_flags; fmr->mr.max_segs = fmr_attr->max_pages; - fmr->page_size = fmr_attr->page_size; + fmr->page_size = fmr_attr->page_shift; return &fmr->ibfmr; } From bos at pathscale.com Tue Feb 7 09:49:07 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Tue, 07 Feb 2006 09:49:07 -0800 Subject: [openib-general] Pathscale driver build broken in SVN5330 In-Reply-To: References: <1AC79F16F5C5284499BB9591B33D6F0006D0CCC8@orsmsx408> Message-ID: <1139334547.27739.81.camel@camp4.serpentine.com> On Tue, 2006-02-07 at 09:29 -0800, Roland Dreier wrote: > Thanks, I broke this when I merged Or's FMR patch. Thanks. All 3 options: proposed APIs, extensions, or IB semantic API all provide the same performance benefit on IB. But the last option is the easiest to use. Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Sean Hefty [mailto:sean.hefty at intel.com] > Sent: Tuesday, February 07, 2006 11:12 AM > To: Kanevsky, Arkady; Caitlin Bestler; Larsen, Roy K; > dat-discussions at yahoogroups.com; Sean Hefty > Cc: openib-general at openib.org > Subject: RE: [dat-discussions] [openib-general] [RFC] DAT > 2.0immediatedataproposal > > >Why would any Consumer hook itself on "proprietary" features > and APIs > >is a different question. > > Because it provides a real performance benefit. This is the > same reason apps code to DAPL versus standard sockets. > > - Sean > > From ardavis at ichips.intel.com Tue Feb 7 09:59:40 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Tue, 07 Feb 2006 09:59:40 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal In-Reply-To: References: Message-ID: <43E8E00C.5030809@ichips.intel.com> Sean Hefty wrote: >>The requirement is to provide an API that supports RDMA writes with immediate >>data. A send that follows an RDMA write is not immediate data, and the API >>should not be constructed around trying to make it so. >> >> > >To be clear, I believe that write with immediate should be part of the normal >APIs, rather than an extension, but should be designed around those devices that >provide it natively. > > I totally agree. A standard RDMA write with immediate API can be very useful to RDMA applications based on the requirements (native support) set forth in my earlier email. It is analogous to the new dat_ep_post_send_with_invalidate() call; a call that supports a native iWARP transport operation but provides no provisions to help other transports emulate. So, other transports simply return NOT_SUPPORTED and add it natively in the future if it makes sense. -arlin From caitlinb at broadcom.com Tue Feb 7 11:17:20 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 7 Feb 2006 11:17:20 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal Message-ID: <54AD0F12E08D1541B826BE97C98F99F122D9D0@NT-SJCA-0751.brcm.ad.broadcom.com> Arlin Davis wrote: > Sean Hefty wrote: > >>> The requirement is to provide an API that supports RDMA writes with >>> immediate data. A send that follows an RDMA write is not immediate >>> data, and the API should not be constructed around trying to make >>> it so. >>> >>> >> >> To be clear, I believe that write with immediate should be part of >> the normal APIs, rather than an extension, but should be designed >> around those devices that provide it natively. >> >> > I totally agree. A standard RDMA write with immediate API can > be very useful to RDMA applications based on the requirements > (native support) set forth in my earlier email. It is analogous to > the new dat_ep_post_send_with_invalidate() call; a call that supports > a native iWARP transport operation but provides no provisions > to help other transports emulate. So, other transports simply > return NOT_SUPPORTED and add it natively in the future if it makes > sense. > > -arlin What is proposed in a definition of 'dat_ep_post_rdma_write_with_immediate' that can be implemented over iWARP using the sequence of messages that were intended to support the same purpose (i.e., letting the other side know that an RDMA Write transfer has been fully received). This definition also conforms to all existing DAT ordering rules. Is there anything wrong with this definition for an IB provider? There is a similarity between write_with_immediate and send_with_invalidate in that they combine operations which a) are already logically tied from the consumer's perspective and b) can be more easily optimized by the Provider over the wire when presented as one request. Indeed, with send_with_invalidate it *has* to be optimized since you cannot send the invalidate later. From rdreier at cisco.com Tue Feb 7 10:18:39 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 07 Feb 2006 10:18:39 -0800 Subject: [openib-general] [PATCH 2 of 3] mad: large RMPP support In-Reply-To: <20060207074036.GB14165@mellanox.co.il> (Jack Morgenstein's message of "Tue, 7 Feb 2006 09:40:36 +0200") References: <20060207074036.GB14165@mellanox.co.il> Message-ID: > + rmpp_mad = (struct ib_rmpp_mad *)seg_buf->mad; Trivial, but I prefer a space after cast operators. > +static struct ib_umad_packet *alloc_packet(void) > +{ > + struct ib_umad_packet *packet; > + int length = sizeof *packet + sizeof(struct ib_mad); > + > + packet = kzalloc(length, GFP_KERNEL); > + if (!packet) { > + printk(KERN_ERR "alloc_packet: mem alloc failed for length %d\n", > + length); > + return NULL; > + } > + INIT_LIST_HEAD(&packet->seg_list); > + return packet; > +} This seems a little too big for what it's actually doing. Also, do we really need to print a kernel error when the system is out of memory? I would just write this as: static struct ib_umad_packet *alloc_packet(void) { struct ib_umad_packet *packet; packet = kzalloc(length, GFP_KERNEL); if (!packet) return NULL; INIT_LIST_HEAD(&packet->seg_list); return packet; } Also the chunk that deletes the old definition of alloc_packet() seems to be in the next patch, so the tree won't compile without both patches (which will be a pain from someone doing git bisect). > + /* User buffer too small. Return first RMPP segment (which > + * includes RMPP message length). > + */ Trivial again but please use the comment format /* * foo */ in other words put the opening "/*" on a line by itself. > + /* multipacket RMPP MAD message. Copy remainder of message. > + * Note that last segment may have a shorter payload. > + */ same here - R. From roy.k.larsen at intel.com Tue Feb 7 11:45:54 2006 From: roy.k.larsen at intel.com (Larsen, Roy K) Date: Tue, 7 Feb 2006 11:45:54 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal Message-ID: <468F3FDA28AA87429AD807992E22D07E08650F8F@orsmsx408> Caitlin Bestler wrote: > >Arlin Davis wrote: >> Sean Hefty wrote: >> >>>> The requirement is to provide an API that supports RDMA writes with >>>> immediate data. A send that follows an RDMA write is not immediate >>>> data, and the API should not be constructed around trying to make >>>> it so. >>>> >>>> >>> >>> To be clear, I believe that write with immediate should be part of >>> the normal APIs, rather than an extension, but should be designed >>> around those devices that provide it natively. >>> >>> >> I totally agree. A standard RDMA write with immediate API can >> be very useful to RDMA applications based on the requirements >> (native support) set forth in my earlier email. It is analogous to >> the new dat_ep_post_send_with_invalidate() call; a call that supports >> a native iWARP transport operation but provides no provisions >> to help other transports emulate. So, other transports simply >> return NOT_SUPPORTED and add it natively in the future if it makes >> sense. >> >> -arlin > >What is proposed in a definition of >'dat_ep_post_rdma_write_with_immediate' >that can be implemented over iWARP using the sequence of messages that >were intended to support the same purpose (i.e., letting the other >side know that an RDMA Write transfer has been fully received). No, iWARP *CAN NOT* implement write immediate data any better than IB can implement send with invalidate. Immediate data *MUST* be indicated to the ULP unambiguously. Imposing an algorithm on the application to infer immediate data arrival is hack, pure and simple. An application is free to perform a write/send if that is the semantic they want. Why does iWARP get transport unique APIs but not IB? I find this attempt to bastardize the IB semantic of immediate data a little curious. Roy From Arkady.Kanevsky at netapp.com Tue Feb 7 11:51:43 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Tue, 7 Feb 2006 14:51:43 -0500 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal Message-ID: IB does optionally support send_with_invalidate as defined in IBTA 1.2 spec. OpenIB does not support this yet but this is a different matter. So this is bad analogy. The better analogy is socket based CM. But I am still not clear what you are advocating: extensions, IB specific API or something else. Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Larsen, Roy K [mailto:roy.k.larsen at intel.com] > Sent: Tuesday, February 07, 2006 2:46 PM > To: dat-discussions at yahoogroups.com; Arlin Davis; Hefty, Sean > Cc: Kanevsky, Arkady; Sean Hefty; openib-general at openib.org > Subject: RE: [dat-discussions] [openib-general] [RFC] > DAT2.0immediatedataproposal > > Caitlin Bestler wrote: > > > >Arlin Davis wrote: > >> Sean Hefty wrote: > >> > >>>> The requirement is to provide an API that supports RDMA > writes with > >>>> immediate data. A send that follows an RDMA write is > not immediate > >>>> data, and the API should not be constructed around > trying to make > >>>> it so. > >>>> > >>>> > >>> > >>> To be clear, I believe that write with immediate should > be part of > >>> the normal APIs, rather than an extension, but should be designed > >>> around those devices that provide it natively. > >>> > >>> > >> I totally agree. A standard RDMA write with immediate API > can be very > >> useful to RDMA applications based on the requirements (native > >> support) set forth in my earlier email. It is analogous to the new > >> dat_ep_post_send_with_invalidate() call; a call that supports a > >> native iWARP transport operation but provides no > provisions to help > >> other transports emulate. So, other transports simply return > >> NOT_SUPPORTED and add it natively in the future if it makes sense. > >> > >> -arlin > > > >What is proposed in a definition of > >'dat_ep_post_rdma_write_with_immediate' > >that can be implemented over iWARP using the sequence of > messages that > >were intended to support the same purpose (i.e., letting the > other side > >know that an RDMA Write transfer has been fully received). > > No, iWARP *CAN NOT* implement write immediate data any better > than IB can implement send with invalidate. Immediate data > *MUST* be indicated to the ULP unambiguously. Imposing an > algorithm on the application to infer immediate data arrival > is hack, pure and simple. An application is free to perform a > write/send if that is the semantic they want. Why does iWARP > get transport unique APIs but not IB? I find this attempt to > bastardize the IB semantic of immediate data a little curious. > > Roy > From caitlinb at broadcom.com Tue Feb 7 12:02:53 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 7 Feb 2006 12:02:53 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal Message-ID: <54AD0F12E08D1541B826BE97C98F99F122D9E4@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > Caitlin Bestler wrote: >> >> Arlin Davis wrote: >>> Sean Hefty wrote: >>> >>>>> The requirement is to provide an API that supports RDMA writes >>>>> with immediate data. A send that follows an RDMA write is not >>>>> immediate data, and the API should not be constructed around >>>>> trying to make it so. >>>>> >>>>> >>>> >>>> To be clear, I believe that write with immediate should be part of >>>> the normal APIs, rather than an extension, but should be designed >>>> around those devices that provide it natively. >>>> >>>> >>> I totally agree. A standard RDMA write with immediate API can be >>> very useful to RDMA applications based on the requirements (native >>> support) set forth in my earlier email. It is analogous to the new >>> dat_ep_post_send_with_invalidate() call; a call that supports a >>> native iWARP transport operation but provides no provisions to help >>> other transports emulate. So, other transports simply return >>> NOT_SUPPORTED and add it natively in the future if it makes sense. >>> >>> -arlin >> >> What is proposed in a definition of >> 'dat_ep_post_rdma_write_with_immediate' >> that can be implemented over iWARP using the sequence of messages >> that were intended to support the same purpose (i.e., letting the >> other side know that an RDMA Write transfer has been fully received). > > No, iWARP *CAN NOT* implement write immediate data any better > than IB can implement send with invalidate. Immediate data > *MUST* be indicated to the ULP unambiguously. Imposing an > algorithm on the application to infer immediate data arrival > is hack, pure and simple. An application is free to perform a > write/send if that is the semantic they want. Why does iWARP > get transport unique APIs but not IB? I find this attempt to > bastardize the IB semantic of immediate data a little curious. > The transports aren't getting anything. Features are there for applications, especially when the feature can be defined in a way that makes sense without explaining transport mechanics. Completing a transaction, complete with supplying a transaction response and releasing the advertised STag associated with the transaction is something that makes sense in the application domain and conforms to normal DAT ordering rules. "Provide information about an RDMA Write to a receive operation" also meets that definition -- as long as it conforms to the existing ordering rules. Shifting to an 8 byte message over iWARP to allow for the write length *and* immediate 'tag' is certainly doable. We could even consider having the DAT Provider supply the 'buffer' silently in the DTO itself. With that definition the consumer would get a receive completion that told them that their peer's RDMA Write had been successfully placed, how long it is (the length) and which one (a tag). I think that is of value. iWARP can implement it as two work requests and maintain the overall semantics. Are you arguing that iWARP should NOT provide this service until it can do it in a single work request? It seems to me that allowing an extra work request and completion is a fairly simple accomodation as opposed to using an alternate algorithm in the main transaction processing of the application. If we enable the applicatin can query how a remote write with immediate will complete outside of the transaction loop then we can allow the application to have *no* overhead inside the main transaction loop, and *identical* logic on the sending side. And IB *could* implement send with invalidate by simply agreeing on how the RKey to be invalidated is communicated between the IB providers (perhaps as an immediate). But more to the point, I don't see how the more flexible definition of write with immediate negatively impacts the IB implementation of the feature. IB providers do not need to allow for the extra work requests. They are not being asked to place the immediate data into the receive buffer, or to do any extra work at all. From roy.k.larsen at intel.com Tue Feb 7 12:25:15 2006 From: roy.k.larsen at intel.com (Larsen, Roy K) Date: Tue, 7 Feb 2006 12:25:15 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal Message-ID: <468F3FDA28AA87429AD807992E22D07E0865109E@orsmsx408> >IB does optionally support send_with_invalidate as defined in IBTA 1.2 >spec. >OpenIB does not support this yet but this is a different matter. >So this is bad analogy. > >The better analogy is socket based CM. > >But I am still not clear what you are advocating: >extensions, IB specific API or something else. I advocate a write with immediate data API that delivers immediate data to the target ULP *unambiguously*. That is, the ULP need never infer from buffer contents or receive completion timing that a write with immediate has taken place. If it is not granted formal API status, I advocate implementation as a DAPL extension. The notion of a combined request API is orthogonal so I won't pursue it any further in this thread. Roy From roy.k.larsen at intel.com Tue Feb 7 14:04:10 2006 From: roy.k.larsen at intel.com (Larsen, Roy K) Date: Tue, 7 Feb 2006 14:04:10 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal Message-ID: <468F3FDA28AA87429AD807992E22D07E0865132D@orsmsx408> >>> What is proposed in a definition of >>> 'dat_ep_post_rdma_write_with_immediate' >>> that can be implemented over iWARP using the sequence of messages >>> that were intended to support the same purpose (i.e., letting the >>> other side know that an RDMA Write transfer has been fully received). >> >> No, iWARP *CAN NOT* implement write immediate data any better >> than IB can implement send with invalidate. Immediate data >> *MUST* be indicated to the ULP unambiguously. Imposing an >> algorithm on the application to infer immediate data arrival >> is hack, pure and simple. An application is free to perform a >> write/send if that is the semantic they want. Why does iWARP >> get transport unique APIs but not IB? I find this attempt to >> bastardize the IB semantic of immediate data a little curious. >> > >The transports aren't getting anything. Features are there for >applications, especially when the feature can be defined in a >way that makes sense without explaining transport mechanics. > APIs exist to gain access to transport services so of course it is all about the transport. Presumably the transport services were defined because they seemed useful, but a transport service exists in a standard somewhere before it is defined in DAPL. I believe that the IB immediate data service and semantic is useful and should be supported too. >Completing a transaction, complete with supplying a transaction >response and releasing the advertised STag associated with the >transaction is something that makes sense in the application >domain and conforms to normal DAT ordering rules. > I don't disagree. And unambiguous immediate data indications fall into that same category which is why I'm puzzled there is so much resistance. >"Provide information about an RDMA Write to a receive operation" >also meets that definition -- as long as it conforms to the >existing ordering rules. Shifting to an 8 byte message over >iWARP to allow for the write length *and* immediate 'tag' >is certainly doable. We could even consider having the >DAT Provider supply the 'buffer' silently in the DTO itself. > If you make the receive indication unambiguous as to the fact it's associated with a write immediate, you've got my full support, even if immediate data is delivered differently by different transports. If not, it is nothing more than a write/send that the application can do itself. >With that definition the consumer would get a receive completion >that told them that their peer's RDMA Write had been successfully >placed, how long it is (the length) and which one (a tag). > >I think that is of value. iWARP can implement it as two work >requests and maintain the overall semantics. If completion of the service is ambiguous, I strongly disagree. The application can do this with write/send now and with more flexibility. True immediate indications are unambiguous and doesn't rely on the contents of a receive buffer or its completion timing. An application must be able to perform "normal" send/receives of any size and content simultaneously with RDMA write with immediate and without regard to when they arrive. The semantic proposed would put a constraint on how an application could use the send/receive facility. If an application can live with such a constraint, it is free to use write/send now. Those that can't or would perform much better with a legitimate write/immediate should be given access to the facility. > >Are you arguing that iWARP should NOT provide this service >until it can do it in a single work request? I'm arguing that an iWARP provider NOT support this service until it can deliver immediate data indications unambiguously. >It seems to >me that allowing an extra work request and completion is >a fairly simple accomodation as opposed to using an alternate >algorithm in the main transaction processing of the application. > >If we enable the applicatin can query how a remote write >with immediate will complete outside of the transaction loop >then we can allow the application to have *no* overhead inside >the main transaction loop, and *identical* logic on the sending >side. I would contend that placing constraints on what and when an application can send "normal" data just to use write immediate is far far worse. And all just too basically save one extra function call. > >And IB *could* implement send with invalidate by simply agreeing >on how the RKey to be invalidated is communicated between the >IB providers (perhaps as an immediate). I'm afraid I don't follow. If you're talking about providers setting up there own private EPs to communicate, perhaps that's a solution for iWARP providers to supply unambiguous immediate data indications.... > >But more to the point, I don't see how the more flexible >definition of write with immediate negatively impacts the >IB implementation of the feature. IB providers do not need >to allow for the extra work requests. They are not being >asked to place the immediate data into the receive buffer, >or to do any extra work at all. This is not about extra work requests or the initiating API. It's about a very poor/non-existent indication semantic and the neutering of a legitimate one. Surely you would not allow the semantics of write with invalidate to be relaxed or changed to support an emulation, right? Look, if an API provides a semantic that allows ambiguous services and leaves it as an exercise to the application to figure out the service has been rendered, it is a hack. I'm surprise that even has to be argued. Transports that support true write with immediate do not make the immediate indication ambiguous. The application is free to use the receive queue for any other combination of receive operations. A legitimate write immediate service does not put usage constraints on the receive queue. I could not be convinced otherwise, so if those proposing such a constrained semantic feel the same, I'll consider this thread dead. From mdidomenico at gmail.com Tue Feb 7 14:10:34 2006 From: mdidomenico at gmail.com (Michael Di Domenico) Date: Tue, 7 Feb 2006 17:10:34 -0500 Subject: [openib-general] openib and mellanox hca problem Message-ID: <97a7c7ed0602071410p1b92cc3etd032fbd5a95243ee@mail.gmail.com> I'm trying to build a system using the openib drivers with a mellanox hca card. I don't have much information about the card itself, it's in a server right now... But I downloaded openib today from the svn source, installed it onto a fresh copy of Fedora Core 4 with Kernel version 2.6.15.3... Everything seemed to compile fine and install okay. I've been following the instructions from the wiki page thus far without a problem. I get upto this step modprobe ib_mthca and get the below error in /var/log/messages. Strangely enough all the modules load, and i do a udevstart, but i never get a /dev/infiniband directory and /sys/class/infiniband directory is empty. Does anyone know how i might fix this, or point me to some better documentation then what is on the wiki? Thanks - Michael Feb 7 16:59:37 linux14-ts kernel: ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005) Feb 7 16:59:37 linux14-ts kernel: ib_mthca: Initializing 0000:07:00.0 Feb 7 16:59:37 linux14-ts kernel: ACPI: PCI Interrupt 0000:07:00.0[?] -> GSI 26 (level, low) -> IRQ 217 Feb 7 16:59:48 linux14-ts kernel: ib_mthca 0000:07:00.0: PCI device did not come back after reset, aborting. Feb 7 16:59:48 linux14-ts kernel: ib_mthca 0000:07:00.0: Failed to reset HCA, aborting. Feb 7 16:59:48 linux14-ts kernel: ACPI: PCI interrupt for device 0000:07:00.0 disabled --- lspci output 06:03.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev ff) (prog-if ff) !!! Unknown header type 7f 07:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev ff) (prog-if ff) !!! Unknown header type 7f From caitlinb at broadcom.com Tue Feb 7 14:27:15 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 7 Feb 2006 14:27:15 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal Message-ID: <54AD0F12E08D1541B826BE97C98F99F122DA20@NT-SJCA-0751.brcm.ad.broadcom.com> Larsen, Roy K wrote: > >> Completing a transaction, complete with supplying a transaction >> response and releasing the advertised STag associated with the >> transaction is something that makes sense in the application domain >> and conforms to normal DAT ordering rules. >> > > I don't disagree. And unambiguous immediate data indications > fall into that same category which is why I'm puzzled there is so > much resistance. > >> "Provide information about an RDMA Write to a receive operation" >> also meets that definition -- as long as it conforms to the existing >> ordering rules. Shifting to an 8 byte message over iWARP to allow for >> the write length *and* immediate 'tag' >> is certainly doable. We could even consider having the DAT Provider >> supply the 'buffer' silently in the DTO itself. >> > > If you make the receive indication unambiguous as to the fact > it's associated with a write immediate, you've got my full > support, even if immediate data is delivered differently by > different transports. If not, it is nothing more than a > write/send that the application can do itself. > >From the viewpoint of the Provider/RNIC/driver there is no wire Send message which can be known to be associated with an RDMA Write. Up to the maximum message size, any combination of bytes are legal. Therefore this is a distinct that an iWARP provider CANNOT make. Unlike send with invalidate, which CAN be supported under IB 1.2. Keep in mind that under the basic DAT semantics, RDMA Writes are *not* signalled to the Data Sink. That was settled years ago. So under DAT semantics you use a Send to cause a completion at the other end -- period. What we are talking about is whether to allow short sends that supply a minimal set of standard information to be piggy-backed on a prior RDMA Write by making it an RDMA Write with immediate. I am not opposed to allowing IB Providers to do that. I am opposed to changing the fundamental DAT semantics that RDMA Writes are not visible the Data Sink. Conceptually, a Send is required. And as with all Send Messages, it is up to the *application* to ensure tha their meaning is known at the Data Sink. This can be done by ordering and/or content of the data. From rpandit at silverstorm.com Tue Feb 7 14:39:00 2006 From: rpandit at silverstorm.com (Ranjit Pandit) Date: Tue, 7 Feb 2006 14:39:00 -0800 Subject: [openib-general] openib and mellanox hca problem In-Reply-To: <97a7c7ed0602071410p1b92cc3etd032fbd5a95243ee@mail.gmail.com> References: <97a7c7ed0602071410p1b92cc3etd032fbd5a95243ee@mail.gmail.com> Message-ID: <96f8e60e0602071439n13988e6cm92ee7b66c1e19f75@mail.gmail.com> Michael, I have seen this problem before.. See following mail thread http://www.mail-archive.com/openib-general at openib.org/msg13861.html Commenting out call to mthca_reset() in mthca_main.c worked around the problem on my system, and as far as I can tell, did not have any negative impact. It will be good if someone reviews the reset path in mthca. Ranjit On 2/7/06, Michael Di Domenico wrote: > I'm trying to build a system using the openib drivers with a mellanox > hca card. I don't have much information about the card itself, it's > in a server right now... > > But I downloaded openib today from the svn source, installed it onto a > fresh copy of Fedora Core 4 with Kernel version 2.6.15.3... > Everything seemed to compile fine and install okay. I've been > following the instructions from the wiki page thus far without a > problem. I get upto this step > > modprobe ib_mthca > > and get the below error in /var/log/messages. Strangely enough all > the modules load, and i do a udevstart, but i never get a > /dev/infiniband directory and /sys/class/infiniband directory is > empty. > > Does anyone know how i might fix this, or point me to some better > documentation then what is on the wiki? > > Thanks > - Michael > > > Feb 7 16:59:37 linux14-ts kernel: ib_mthca: Mellanox InfiniBand HCA > driver v0.06 (June 23, 2005) > Feb 7 16:59:37 linux14-ts kernel: ib_mthca: Initializing 0000:07:00.0 > Feb 7 16:59:37 linux14-ts kernel: ACPI: PCI Interrupt 0000:07:00.0[?] > -> GSI 26 (level, low) -> IRQ 217 > Feb 7 16:59:48 linux14-ts kernel: ib_mthca 0000:07:00.0: PCI device > did not come back after reset, aborting. > Feb 7 16:59:48 linux14-ts kernel: ib_mthca 0000:07:00.0: Failed to > reset HCA, aborting. > Feb 7 16:59:48 linux14-ts kernel: ACPI: PCI interrupt for device > 0000:07:00.0 disabled > > > --- lspci output > 06:03.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev ff) > (prog-if ff) > !!! Unknown header type 7f > > 07:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev ff) > (prog-if ff) > !!! Unknown header type 7f > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From roy.k.larsen at intel.com Tue Feb 7 15:40:44 2006 From: roy.k.larsen at intel.com (Larsen, Roy K) Date: Tue, 7 Feb 2006 15:40:44 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal Message-ID: <468F3FDA28AA87429AD807992E22D07E08651654@orsmsx408> >>> Completing a transaction, complete with supplying a transaction >>> response and releasing the advertised STag associated with the >>> transaction is something that makes sense in the application domain >>> and conforms to normal DAT ordering rules. >>> >> >> I don't disagree. And unambiguous immediate data indications >> fall into that same category which is why I'm puzzled there is so >> much resistance. >> >>> "Provide information about an RDMA Write to a receive operation" >>> also meets that definition -- as long as it conforms to the existing >>> ordering rules. Shifting to an 8 byte message over iWARP to allow for >>> the write length *and* immediate 'tag' >>> is certainly doable. We could even consider having the DAT Provider >>> supply the 'buffer' silently in the DTO itself. >>> >> >> If you make the receive indication unambiguous as to the fact >> it's associated with a write immediate, you've got my full >> support, even if immediate data is delivered differently by >> different transports. If not, it is nothing more than a >> write/send that the application can do itself. >> > >From the viewpoint of the Provider/RNIC/driver there is no >wire Send message which can be known to be associated with >an RDMA Write. Up to the maximum message size, any combination >of bytes are legal. > >Therefore this is a distinct that an iWARP provider CANNOT make. >Unlike send with invalidate, which CAN be supported under IB 1.2. So, your stance is that if an RDMA transport protocol specification exists that can't support or emulate a service faithfully, an API can't exist for that service in DAPL. Ok, that makes this discussion much clearer and to the point. >Keep in mind that under the basic DAT semantics, RDMA Writes >are *not* signalled to the Data Sink. That was settled years ago. > >So under DAT semantics you use a Send to cause a completion at >the other end -- period. Addressed below... > >What we are talking about is whether to allow short sends that >supply a minimal set of standard information to be piggy-backed >on a prior RDMA Write by making it an RDMA Write with immediate. That is not what those advocating immediate data have been talking about. It has always been whether the IB capability could be exposed to the ULP. Remember, the first proposal by Arlin was to make this an extension. This list wanted to expose it as a formal API. So, I find that assertion puzzling. >I am not opposed to allowing IB Providers to do that. I am opposed >to changing the fundamental DAT semantics that RDMA Writes are not >visible the Data Sink. Conceptually, a Send is required. > Conceptual, eh? Well, of course IB immediate data *is* indicated on the receive queue. Not conceptual enough? But that aside, it is a rather strict and convenient interpretation. Are you sure you want to put a stake that deep in the ground about all currently defined DAPL semantics against transport standards that evolve, or just those that can't be implemented by all transports? I was under the assumption that the DAT community defined the APIs and semantics through an open process. Given that the IB write immediate data facility does not break the implementation or semantics of the currently defined RDMA write facility, I see no reason the DAPL spec couldn't be updated, through consensus, with the realities of existing transport services. Nevertheless, I presume you'll have no objection to implementing this useful service as a DAPL extension since the semantic rules for extensions haven't been define yet. Roy From caitlinb at broadcom.com Tue Feb 7 15:56:50 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 7 Feb 2006 15:56:50 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal Message-ID: <54AD0F12E08D1541B826BE97C98F99F122DA46@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > > I was under the assumption that the DAT community defined the > APIs and semantics through an open process. Given that the > IB write immediate data facility does not break the > implementation or semantics of the currently defined RDMA > write facility, I see no reason the DAPL spec couldn't be > updated, through consensus, with the realities of existing > transport services. Nevertheless, I presume you'll have no > objection to implementing this useful service as a DAPL > extension since the semantic rules for extensions haven't > been define yet. > > Roy That is correct, because as an extension the user would not expect normal semantics to still be guaranteed. From rdreier at cisco.com Tue Feb 7 15:57:23 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 07 Feb 2006 15:57:23 -0800 Subject: [openib-general] openib and mellanox hca problem In-Reply-To: <97a7c7ed0602071410p1b92cc3etd032fbd5a95243ee@mail.gmail.com> (Michael Di Domenico's message of "Tue, 7 Feb 2006 17:10:34 -0500") References: <97a7c7ed0602071410p1b92cc3etd032fbd5a95243ee@mail.gmail.com> Message-ID: > Feb 7 16:59:48 linux14-ts kernel: ib_mthca 0000:07:00.0: PCI device did not come back after reset, aborting. Can you give more details on the system where you saw this? - R. From mdidomenico at gmail.com Tue Feb 7 15:59:08 2006 From: mdidomenico at gmail.com (Michael Di Domenico) Date: Tue, 7 Feb 2006 18:59:08 -0500 Subject: [openib-general] openib and mellanox hca problem In-Reply-To: References: <97a7c7ed0602071410p1b92cc3etd032fbd5a95243ee@mail.gmail.com> Message-ID: <97a7c7ed0602071559w2f22a127vacd954970d063f8f@mail.gmail.com> What specifically would you like to know? On 2/7/06, Roland Dreier wrote: > > Feb 7 16:59:48 linux14-ts kernel: ib_mthca 0000:07:00.0: PCI device did not come back after reset, aborting. > > Can you give more details on the system where you saw this? > > - R. > From Arkady.Kanevsky at netapp.com Tue Feb 7 16:09:25 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Tue, 7 Feb 2006 19:09:25 -0500 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal Message-ID: We have problem no matter which option we choose. The current Transport Level Requirement state: There is a one-to-one correspondence between send operation on one Endpoint of the Connection and recv operations on the other Endpoint of the Connection. There is no correspondence between RDMA operations on one Endpoint of the Connection and recv or send data transfer operation on the other Endpoint of the Connection. Receive operations on a Connection must be completed in the order of posting of their corresponding sends. The Immediate data and Atomic ops violate these requirements including ordering rules. I had started updating these rules when I generated the first draft of the requirements. They are included in the enclosed pdf file. But they do not cover Atomic ops that also impact transport requirements. This chapter of the spec have not been changed since DAPL 1.0 and I am very concern with any changes to it. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Caitlin Bestler [mailto:caitlinb at broadcom.com] > Sent: Tuesday, February 07, 2006 6:57 PM > To: Larsen, Roy K; dat-discussions at yahoogroups.com; Arlin > Davis; Hefty, Sean > Cc: openib-general at openib.org > Subject: RE: [dat-discussions] [openib-general] [RFC] > DAT2.0immediatedataproposal > > openib-general-bounces at openib.org wrote: > > > > > I was under the assumption that the DAT community defined > the APIs and > > semantics through an open process. Given that the IB write > immediate > > data facility does not break the implementation or semantics of the > > currently defined RDMA write facility, I see no reason the > DAPL spec > > couldn't be updated, through consensus, with the realities > of existing > > transport services. Nevertheless, I presume you'll have no > objection > > to implementing this useful service as a DAPL extension since the > > semantic rules for extensions haven't been define yet. > > > > Roy > > That is correct, because as an extension the user would not > expect normal semantics to still be guaranteed. > > > > > > > Yahoo! Groups Links > > <*> To visit your group on the web, go to: > http://groups.yahoo.com/group/dat-discussions/ > > <*> To unsubscribe from this group, send an email to: > dat-discussions-unsubscribe at yahoogroups.com > > <*> Your use of Yahoo! Groups is subject to: > http://docs.yahoo.com/info/terms/ > > > -------------- next part -------------- A non-text attachment was scrubbed... Name: transport_req_020706.pdf Type: application/octet-stream Size: 26434 bytes Desc: transport_req_020706.pdf URL: From caitlinb at broadcom.com Tue Feb 7 16:40:00 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 7 Feb 2006 16:40:00 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal Message-ID: <54AD0F12E08D1541B826BE97C98F99F122DA5D@NT-SJCA-0751.brcm.ad.broadcom.com> dat-discussions at yahoogroups.com wrote: > We have problem no matter which option we choose. > The current Transport Level Requirement state: > > There is a one-to-one correspondence between send operation > on one Endpoint of the Connection and recv operations on the > other Endpoint of the Connection. > There is no correspondence between RDMA operations on one > Endpoint of the Connection and recv or send data transfer > operation on the other Endpoint of the Connection. > Receive operations on a Connection must be completed in the > order of posting of their corresponding sends. > > The Immediate data and Atomic ops violate these requirements > including ordering rules. > > I had started updating these rules when I generated the first > draft of the requirements. They are included in the enclosed pdf file. > But they do not cover Atomic ops that also impact transport > requirements. This chapter of the spec have not been changed since > DAPL 1.0 > and I am very concern with any changes to it. > > Arkady > If "RDMA Write with Immediate" is viewed as being the equivalent of doing RDMA Write and then an RDMA Send the correspondence rule is maintained. But *only* if the "rdma write with immediate" has all of the semantics of a Send. Atomics do not violate the rules if you view them as being a variation on an RDMA Read. They are an RDMA Read with modify. The real question is whether it makes sense to put it in the RDMA device. It is also not subject to emulation at a highe layer. With send with invalidate we know how InfiniBand *will* support it, because of the IB 1.2 verbs. We do not know that for atomics over iWARP. We do not know whether it will be added, more importantly we do not know *how* it would be added if it were added. That makes coming up with a transport neutral definition very premature. In particular, if atomics were added to iWARP there is a distinct design option where it would *not* be the same work queue as RDMA Reads (adding atomics through Queue ID 3 would make layering on top of a current implementation much easier. But it would mean that atomic credits would be distinct from read credits. This is a very strong reason to defer attempting to define RDMA Atomics in a transport neutral fashion. From rdreier at cisco.com Tue Feb 7 16:40:15 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 07 Feb 2006 16:40:15 -0800 Subject: [openib-general] Re: ipoib_mcast_send.patch In-Reply-To: <20060207142238.GC31609@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 7 Feb 2006 16:22:38 +0200") References: <20060207142238.GC31609@mellanox.co.il> Message-ID: Anyway, I've (finally) applied this patch. - R. From sean.hefty at intel.com Tue Feb 7 17:01:25 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 7 Feb 2006 17:01:25 -0800 Subject: [openib-general] RE: [PATCH 1 of 3] mad: large RMPP support In-Reply-To: <20060207073933.GA14165@mellanox.co.il> Message-ID: Based on what you've done, I'd like to suggest changing interface similar to that shown below. I believe that this could be done with minor changes to the current patches. Detailed comments that led to suggesting this change are inline in my responses. struct ib_mad_segments { u32 num_segments; u32 segment_size; void *segment[0]; }; struct ib_mad_send_buf { ... void *mad; /* First MAD segment */ struct ib_mad_segments *segments; /* RMPP segments > 1 */ ... }; This will avoid walking through a list to find segments, and allows for efficient allocation of the segment data buffers. Multiple segments could be allocated through a single kzalloc. (For example, every n-th segment would start a new allocation, making deallocation easy as well.) >+struct ib_mad_multipacket_seg { >+ struct list_head list; >+ u32 size; >+ u8 data[0]; >+}; Should we ensure that the data alignment is on a 64-byte boundary? > struct ib_mad_send_buf { > struct ib_mad_send_buf *next; >- void *mad; >+ void *mad; /* RMPP: first segment, >+ including the MAD header */ >+ void *mad_payload; /* RMPP: changed per segment */ Mad_payload doesn't appear to be directly accessible directly by the user. It should be hidden. - Sean From sean.hefty at intel.com Tue Feb 7 17:01:31 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 7 Feb 2006 17:01:31 -0800 Subject: [openib-general] RE: [PATCH 2 of 3] mad: large RMPP support In-Reply-To: <20060207074036.GB14165@mellanox.co.il> Message-ID: >+static int data_offset(u8 mgmt_class) >+{ >+ if (mgmt_class == IB_MGMT_CLASS_SUBN_ADM) >+ return IB_MGMT_SA_HDR; >+ else if ((mgmt_class >= IB_MGMT_CLASS_VENDOR_RANGE2_START) && >+ (mgmt_class <= IB_MGMT_CLASS_VENDOR_RANGE2_END)) >+ return IB_MGMT_VENDOR_HDR; >+ else >+ return IB_MGMT_RMPP_HDR; >+} I think that the RMPP code may have this same routine. If so, maybe we can make this an inline function. >+static int copy_recv_mad(struct ib_mad_recv_wc *mad_recv_wc, >+ struct ib_umad_packet *packet) >+{ >+ struct ib_mad_recv_buf *seg_buf; >+ struct ib_rmpp_mad *rmpp_mad; >+ void *data; >+ struct ib_mad_multipacket_seg *seg; >+ int size, len, offset; >+ u8 flags; >+ >+ len = mad_recv_wc->mad_len; >+ if (len <= sizeof(struct ib_mad)) { >+ memcpy(&packet->mad.data, mad_recv_wc->recv_buf.mad, len); >+ return 0; >+ } >+ >+ /* Multipacket (RMPP) MAD */ >+ offset = data_offset(mad_recv_wc->recv_buf.mad->mad_hdr.mgmt_class); >+ >+ list_for_each_entry(seg_buf, &mad_recv_wc->rmpp_list, list) { >+ rmpp_mad = (struct ib_rmpp_mad *)seg_buf->mad; >+ flags = ib_get_rmpp_flags(&rmpp_mad->rmpp_hdr); >+ >+ if (flags & IB_MGMT_RMPP_FLAG_FIRST) { >+ size = sizeof(*rmpp_mad); >+ memcpy(&packet->mad.data, rmpp_mad, size); >+ } else { >+ data = (void *) rmpp_mad + offset; >+ if (flags & IB_MGMT_RMPP_FLAG_LAST) >+ size = len; >+ else >+ size = sizeof(*rmpp_mad) - offset; >+ seg = kmalloc(sizeof(struct ib_mad_multipacket_seg) + >+ sizeof(struct ib_rmpp_mad) - offset, >+ GFP_KERNEL); >+ if (!seg) >+ return -ENOMEM; >+ memcpy(seg->data, data, size); >+ list_add_tail(&seg->list, &packet->seg_list); >+ } >+ len -= size; >+ } >+ return 0; >+} It would be more efficient to just queue the received MAD until it can be copied directly to the userspace buffer, rather than copying it into a temporary buffer. >+ >+static struct ib_umad_packet *alloc_packet(void) >+{ >+ struct ib_umad_packet *packet; >+ int length = sizeof *packet + sizeof(struct ib_mad); >+ >+ packet = kzalloc(length, GFP_KERNEL); >+ if (!packet) { >+ printk(KERN_ERR "alloc_packet: mem alloc failed for length %d\n", >+ length); >+ return NULL; >+ } >+ INIT_LIST_HEAD(&packet->seg_list); >+ return packet; >+} We should probably just drop this function. It looks like it's only called in one place, plus would only save a single line of code for each place that it is called. - Sean From hyu at pantasys.com Tue Feb 7 17:08:19 2006 From: hyu at pantasys.com (Harris Yu) Date: Tue, 07 Feb 2006 17:08:19 -0800 Subject: [openib-general] Ifdown/ifup pick up the wrong ib interface configuration file Message-ID: <1139360899.14390.54.camel@simba.pantasys.com> Hi Everyone, Now I am using OpenIB Gen2 on SuSE10. I got a strange problem when I tried to bring up/down ib interface, I put the ib interface startup script ifcfg-ib0/ifcfg-ib1 under /etc/sysconfig/network directory, when I use the command 'ifdown ib0', and from the message shown, it will pick up the ib1's configuration, then I use the command 'ifup ib0', it picks up the ib1's configuration again, and assigned the IPoIB for ib1 to ib0. The following is the command I used, and message showed: #ifdown ib0 ib0 device: Mellanox Technologies MT25208 InfiniHost III Ex HCA (Tavor compatibility mode) (rev a0) ib0 configuration: ib1 Any hints will be helpful. Thanks in advance, Harris From sean.hefty at intel.com Tue Feb 7 17:01:36 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 7 Feb 2006 17:01:36 -0800 Subject: [openib-general] RE: [PATCH 3 of 3] mad: large RMPP support In-Reply-To: <20060207074133.GC14165@mellanox.co.il> Message-ID: >-static inline u64 get_seg_addr(struct ib_mad_send_wr_private *mad_send_wr) >+static inline void *get_seg_addr(struct ib_mad_send_wr_private *mad_send_wr) > { >- return mad_send_wr->sg_list[0].addr + mad_send_wr->data_offset + >- (sizeof(struct ib_rmpp_mad) - mad_send_wr->data_offset) * >- (mad_send_wr->seg_num - 1); >+ struct ib_mad_multipacket_seg *seg; >+ int i = 2; >+ >+ list_for_each_entry(seg, &mad_send_wr->multipacket_list, list) { >+ if (i == mad_send_wr->seg_num) >+ return seg->data; >+ i++; >+ } >+ return NULL; > } It would be more efficient if the payload segments were stored in an array. We're going to end up walking through the list N times in order to send N segments. >+ > struct ib_mad_send_buf * ib_create_send_mad(struct ib_mad_agent *mad_agent, > u32 remote_qpn, u16 pkey_index, > int rmpp_active, >@@ -787,39 +798,38 @@ struct ib_mad_send_buf * ib_create_send_ > { > struct ib_mad_agent_private *mad_agent_priv; > struct ib_mad_send_wr_private *mad_send_wr; >- int length, buf_size; >+ int length, message_size, seg_size; > void *buf; > > mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private, > agent); >- buf_size = get_buf_length(hdr_len, data_len); >+ message_size = get_buf_length(hdr_len, data_len); > > if ((!mad_agent->rmpp_version && >- (rmpp_active || buf_size > sizeof(struct ib_mad))) || >- (!rmpp_active && buf_size > sizeof(struct ib_mad))) >+ (rmpp_active || message_size > sizeof(struct ib_mad))) || >+ (!rmpp_active && message_size > sizeof(struct ib_mad))) > return ERR_PTR(-EINVAL); > >- length = sizeof *mad_send_wr + buf_size; >- if (length >= PAGE_SIZE) >- buf = (void *)__get_free_pages(gfp_mask, >long_log2(roundup_pow_of_two(length)) - PAGE_SHIFT); >- else >- buf = kmalloc(length, gfp_mask); >+ length = sizeof *mad_send_wr + message_size; >+ buf = kzalloc(sizeof *mad_send_wr + sizeof(struct ib_mad), gfp_mask); > > if (!buf) > return ERR_PTR(-ENOMEM); > >- memset(buf, 0, length); >- >- mad_send_wr = buf + buf_size; >+ mad_send_wr = buf + sizeof(struct ib_mad); >+ INIT_LIST_HEAD(&mad_send_wr->multipacket_list); > mad_send_wr->send_buf.mad = buf; >+ mad_send_wr->send_buf.mad_payload = buf + hdr_len; > > mad_send_wr->mad_agent_priv = mad_agent_priv; >- mad_send_wr->sg_list[0].length = buf_size; >+ mad_send_wr->sg_list[0].length = hdr_len; > mad_send_wr->sg_list[0].lkey = mad_agent->mr->lkey; >+ mad_send_wr->sg_list[1].length = sizeof(struct ib_mad) - hdr_len; >+ mad_send_wr->sg_list[1].lkey = mad_agent->mr->lkey; The common case will be for single segment MADs. We should try to pass a single SGE to the hardware in this case, rather than requiring the hardware to fetch two entries. > > mad_send_wr->send_wr.wr_id = (unsigned long) mad_send_wr; > mad_send_wr->send_wr.sg_list = mad_send_wr->sg_list; >- mad_send_wr->send_wr.num_sge = 1; >+ mad_send_wr->send_wr.num_sge = 2; > mad_send_wr->send_wr.opcode = IB_WR_SEND; > mad_send_wr->send_wr.send_flags = IB_SEND_SIGNALED; > mad_send_wr->send_wr.wr.ud.remote_qpn = remote_qpn; >@@ -827,6 +837,7 @@ struct ib_mad_send_buf * ib_create_send_ > mad_send_wr->send_wr.wr.ud.pkey_index = pkey_index; > > if (rmpp_active) { >+ struct ib_mad_multipacket_seg *seg; > struct ib_rmpp_mad *rmpp_mad = mad_send_wr->send_buf.mad; > rmpp_mad->rmpp_hdr.paylen_newwin = cpu_to_be32(hdr_len - > IB_MGMT_RMPP_HDR + data_len); >@@ -834,6 +845,27 @@ struct ib_mad_send_buf * ib_create_send_ > rmpp_mad->rmpp_hdr.rmpp_type = IB_MGMT_RMPP_TYPE_DATA; > ib_set_rmpp_flags(&rmpp_mad->rmpp_hdr, > IB_MGMT_RMPP_FLAG_ACTIVE); >+ mad_send_wr->total_length = message_size; >+ /* allocate RMPP buffers */ >+ message_size -= sizeof(struct ib_mad); >+ seg_size = sizeof(struct ib_mad) - hdr_len; >+ while (message_size > 0) { >+ seg = kmalloc(sizeof(struct ib_mad_multipacket_seg) + >+ seg_size, gfp_mask); >+ if (!seg) { >+ printk(KERN_ERR "ib_create_send_mad: RMPP mem " >+ "alloc failed for len %zd, gfp %#x\n", >+ sizeof(struct ib_mad_multipacket_seg) + >+ seg_size, gfp_mask); >+ free_send_multipacket_list(mad_send_wr); >+ kfree(buf); >+ return ERR_PTR(-ENOMEM); >+ } >+ seg->size = seg_size; >+ list_add_tail(&seg->list, >+ &mad_send_wr->multipacket_list); >+ message_size -= seg_size; >+ } > } > > mad_send_wr->send_buf.mad_agent = mad_agent; >@@ -842,23 +874,36 @@ struct ib_mad_send_buf * ib_create_send_ > } > EXPORT_SYMBOL(ib_create_send_mad); This function is getting fairly long. Can we split it up? >+struct ib_mad_multipacket_seg *ib_get_multipacket_seg(struct ib_mad_send_buf * >+ send_buf, int seg_num) >+{ >+ struct ib_mad_send_wr_private *mad_send_wr; >+ struct ib_mad_multipacket_seg *seg; >+ int i = 2; >+ >+ mad_send_wr = container_of(send_buf, struct ib_mad_send_wr_private, >+ send_buf); >+ list_for_each_entry(seg, &mad_send_wr->multipacket_list, list) { >+ if (i == seg_num) >+ return seg; >+ i++; >+ } >+ return NULL; >+} >+EXPORT_SYMBOL(ib_get_multipacket_seg); Same list walking issue. - Sean From rjwalsh at pathscale.com Tue Feb 7 18:30:03 2006 From: rjwalsh at pathscale.com (Robert Walsh) Date: Tue, 07 Feb 2006 18:30:03 -0800 Subject: [openib-general] Pathscale driver build broken in SVN5330 In-Reply-To: References: <1AC79F16F5C5284499BB9591B33D6F0006D0CCC8@orsmsx408> Message-ID: <1139365803.13833.2.camel@phosphene.durables.org> On Tue, 2006-02-07 at 09:29 -0800, Roland Dreier wrote: > Thanks, I broke this when I merged Or's FMR patch. > I checked in this fix: > > --- infiniband/hw/ipath/ipath_verbs.c (revision 5330) > +++ infiniband/hw/ipath/ipath_verbs.c (working copy) > @@ -5756,7 +5756,7 @@ static struct ib_fmr *ipath_alloc_fmr(st > fmr->mr.offset = 0; > fmr->mr.access_flags = mr_access_flags; > fmr->mr.max_segs = fmr_attr->max_pages; > - fmr->page_size = fmr_attr->page_size; > + fmr->page_size = fmr_attr->page_shift; > return &fmr->ibfmr; > } Thanks, Roland. I followed up with a fix to change our private struct member to page_shift, too, just to be consistent. BTW: we should have a new driver drop shortly. This won't be one suitable for submission to the kernel, but it will match our newly shipped InfiniPath software (1.2, which went live today.) We're still working on incorporating all the feedback from lkml. Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. From rdreier at cisco.com Tue Feb 7 15:59:55 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 07 Feb 2006 15:59:55 -0800 Subject: [openib-general] openib and mellanox hca problem In-Reply-To: <96f8e60e0602071439n13988e6cm92ee7b66c1e19f75@mail.gmail.com> (Ranjit Pandit's message of "Tue, 7 Feb 2006 14:39:00 -0800") References: <97a7c7ed0602071410p1b92cc3etd032fbd5a95243ee@mail.gmail.com> <96f8e60e0602071439n13988e6cm92ee7b66c1e19f75@mail.gmail.com> Message-ID: Ranjit> Commenting out call to mthca_reset() in mthca_main.c Ranjit> worked around the problem on my system, and as far as I Ranjit> can tell, did not have any negative impact. Yes, that should work fine in most cases. The reset is done to get the HCA into a known state, since it might have been initialized by a boot ROM or a previous driver load. Ranjit> It will be good if someone reviews the reset path in mthca. It's hard to see what software bug could be causing this issue. We reset the chip, and it never comes back -- reads to the PCI config header return 0xffffffff for more than 10 seconds. - R. From rdreier at cisco.com Tue Feb 7 16:01:07 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 07 Feb 2006 16:01:07 -0800 Subject: [openib-general] openib and mellanox hca problem In-Reply-To: <97a7c7ed0602071559w2f22a127vacd954970d063f8f@mail.gmail.com> (Michael Di Domenico's message of "Tue, 7 Feb 2006 18:59:08 -0500") References: <97a7c7ed0602071410p1b92cc3etd032fbd5a95243ee@mail.gmail.com> <97a7c7ed0602071559w2f22a127vacd954970d063f8f@mail.gmail.com> Message-ID: Michael> What specifically would you like to know? What kind of CPU, motherboard, PCI host bridge, etc. "lspci -vvv" output would be interesting. Maybe /proc/cpuinfo too. - R. From yael at mellanox.co.il Wed Feb 8 00:30:22 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 08 Feb 2006 10:30:22 +0200 Subject: [openib-general] [PATCH] Opensm - osm_mcast_mgr.c add type casting Message-ID: <5z7j86l0bl.fsf@mtl066.yok.mtl.com> Hi Hal, The following patch adds a missing type casting in the return value of the function osm_mcast_mgr_compute_max_hops. Thanks, Yael Signed-off-by: Yael Kalka Index: osm_mcast_mgr.c =================================================================== --- osm_mcast_mgr.c (revision 5307) +++ osm_mcast_mgr.c (working copy) @@ -269,7 +269,7 @@ osm_mcast_mgr_compute_max_hops( } OSM_LOG_EXIT( p_mgr->p_log ); - return( max_hops ); + return(float)(max_hops); } /********************************************************************** From yael at mellanox.co.il Wed Feb 8 00:50:19 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 08 Feb 2006 10:50:19 +0200 Subject: [openib-general] [PATCH] Opensm - type changing in st.h/c files Message-ID: <5z64nqkzec.fsf@mtl066.yok.mtl.com> Hi Hal, There was a problem with some of the types defined when compiling on 64bit windows machines. The following patch adds support for these as well. Thanks, Yael Signed-off-by: Yael Kalka Index: include/opensm/st.h =================================================================== --- include/opensm/st.h (revision 5307) +++ include/opensm/st.h (working copy) @@ -50,14 +50,21 @@ BEGIN_C_DECLS -typedef unsigned long st_data_t; +#if (__WORDSIZE == 64) || defined (_WIN64) +#define st_ptr_t unsigned long long +#else +#define st_ptr_t unsigned long +#endif + +typedef st_ptr_t st_data_t; + #define ST_DATA_T_DEFINED typedef struct st_table st_table; struct st_hash_type { int (*compare)(void *, void *); - int (*hash)(void *); + st_ptr_t (*hash)(void *); }; struct st_table { Index: opensm/st.c =================================================================== --- opensm/st.c (revision 5307) +++ opensm/st.c (working copy) @@ -41,7 +41,6 @@ # include #endif /* HAVE_CONFIG_H */ -#include "config.h" #include #include #include @@ -73,7 +72,7 @@ struct st_table_entry { * */ static int numcmp(void *, void *); -static int numhash(void *); +static st_ptr_t numhash(void *); static struct st_hash_type type_numhash = { numcmp, numhash, @@ -83,7 +82,7 @@ static struct st_hash_type type_numhash /* extern int strcmp(const char *, const char *); */ static int strhash(const char *); -static inline int st_strhash(void *key) +static inline st_ptr_t st_strhash(void *key) { return strhash((const char *)key); } @@ -619,12 +618,12 @@ static int numcmp(x, y) void *x, *y; { - return (long)x != (long)y; + return (st_ptr_t)x != (st_ptr_t)y; } -static int +static st_ptr_t numhash(n) void *n; { - return (long)n; + return (st_ptr_t)n; } From mst at mellanox.co.il Wed Feb 8 02:40:43 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 8 Feb 2006 12:40:43 +0200 Subject: [openib-general] Re: openib and mellanox hca problem In-Reply-To: <97a7c7ed0602071559w2f22a127vacd954970d063f8f@mail.gmail.com> References: <97a7c7ed0602071559w2f22a127vacd954970d063f8f@mail.gmail.com> Message-ID: <20060208104043.GF28594@mellanox.co.il> I wander whether we manage to locate the bridge. It would be interesting to build mthca with debug enabled. Quoting r. Michael Di Domenico : > > What specifically would you like to know? > > On 2/7/06, Roland Dreier wrote: > > > Feb 7 16:59:48 linux14-ts kernel: ib_mthca 0000:07:00.0: PCI device did not come back after reset, aborting. > > > > Can you give more details on the system where you saw this? > > > > - R. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Wed Feb 8 02:44:02 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 8 Feb 2006 12:44:02 +0200 Subject: [openib-general] Re: openib and mellanox hca problem In-Reply-To: <96f8e60e0602071439n13988e6cm92ee7b66c1e19f75@mail.gmail.com> References: <96f8e60e0602071439n13988e6cm92ee7b66c1e19f75@mail.gmail.com> Message-ID: <20060208104402.GG28594@mellanox.co.il> If you really suspect timing issues, you can always increase timeouts: look for msleep in mthca_reset.c and try bumping up the numbers. Anyway - could you please enable mthca debug in menuconfig? This would give us some more information on whats going on. Quoting r. Ranjit Pandit : > Subject: Re: openib and mellanox hca problem > > Michael, > > I have seen this problem before.. > See following mail thread > > http://www.mail-archive.com/openib-general at openib.org/msg13861.html > > Commenting out call to mthca_reset() in mthca_main.c worked around the > problem on my system, and as far as I can tell, did not have any > negative impact. > > It will be good if someone reviews the reset path in mthca. > > Ranjit > > > On 2/7/06, Michael Di Domenico wrote: > > I'm trying to build a system using the openib drivers with a mellanox > > hca card. I don't have much information about the card itself, it's > > in a server right now... > > > > But I downloaded openib today from the svn source, installed it onto a > > fresh copy of Fedora Core 4 with Kernel version 2.6.15.3... > > Everything seemed to compile fine and install okay. I've been > > following the instructions from the wiki page thus far without a > > problem. I get upto this step > > > > modprobe ib_mthca > > > > and get the below error in /var/log/messages. Strangely enough all > > the modules load, and i do a udevstart, but i never get a > > /dev/infiniband directory and /sys/class/infiniband directory is > > empty. > > > > Does anyone know how i might fix this, or point me to some better > > documentation then what is on the wiki? > > > > Thanks > > - Michael > > > > > > Feb 7 16:59:37 linux14-ts kernel: ib_mthca: Mellanox InfiniBand HCA > > driver v0.06 (June 23, 2005) > > Feb 7 16:59:37 linux14-ts kernel: ib_mthca: Initializing 0000:07:00.0 > > Feb 7 16:59:37 linux14-ts kernel: ACPI: PCI Interrupt 0000:07:00.0[?] > > -> GSI 26 (level, low) -> IRQ 217 > > Feb 7 16:59:48 linux14-ts kernel: ib_mthca 0000:07:00.0: PCI device > > did not come back after reset, aborting. > > Feb 7 16:59:48 linux14-ts kernel: ib_mthca 0000:07:00.0: Failed to > > reset HCA, aborting. > > Feb 7 16:59:48 linux14-ts kernel: ACPI: PCI interrupt for device > > 0000:07:00.0 disabled > > > > > > --- lspci output > > 06:03.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev ff) > > (prog-if ff) > > !!! Unknown header type 7f > > > > 07:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev ff) > > (prog-if ff) > > !!! Unknown header type 7f -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From amaoiam at postino.at Wed Feb 8 03:00:46 2006 From: amaoiam at postino.at (amaoiam at postino.at) Date: Wed, 8 Feb 2006 03:00:46 -0800 (PST) Subject: [openib-general] =?iso-2022-jp?b?GyRCTCVOT0UqJEpGeUJOJHIbKEI=?= =?iso-2022-jp?b?GyRCNS5KfSRLJDQ+UjJwJDckXiQ5GyhC?= Message-ID: 20060208190519.97253mail@mail.love-sexlife88545879889_woman-server889_womansystem01_woman-sexlife-love.tv ━━━◎あなたに熟れた肉体をお届け・・・◎━━━      ◎◎◎◎◎◎◎◎◎◎◎◎◎◎◎◎◎◎ ◎◎淫乱熟女専用 の 完全無料紹介サイト◎◎   ◎◎◎◎◎◎◎◎◎◎◎◎◎◎◎◎◎◎ 当サイトの最大の魅力、本当のSEXを知っている熟女との 濃厚なSEX。それだけでなくお小遣いまで頂けます。 それは当サイト登録女性様の目的が ≪熟女様の男買い・男遊び≫だからです。 女性様は独自のネットワークを使っていますので日々集まってきていますが、 当然限りがありますので限定で男性会員の募集を行います! ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ もし、あなた様に熟女との濃厚なひと時に興味が御座いましたら、 こちらからお入りください http://twilight.cx/h/ ※こちらでは一部の女性をご紹介させて頂きました。  ━━━━━━━━━━━━━━━━━━━━━━━━ No、1) 田中 麻紀 様  40歳      容姿 ◎  SEX ○  お礼金 △ ━━━━━━━━━━━━━━━━━━━━━━━━ No、2) 櫻井 雛子 様  39歳       容姿  ◎◎ SEX ○ お礼金 △ ━━━━━━━━━━━━━━━━━━━━━━━━ No、3) 片岡 鏡子 様  45歳      容姿  △ SEX ◎  お礼金 ◎◎ ━━━━━━━━━━━━━━━━━━━━━━━━ No、4) 愛 様      43歳      容姿  ○ SEX ◎  お礼金 ◎ ━━━━━━━━━━━━━━━━━━━━━━━━ http://twilight.cx/h/ こちらにはまだ上記のような女性が多数います。 ※締め切り間近ですので、くれぐれも遅れないようお願いします。 From ogerlitz at voltaire.com Wed Feb 8 03:16:25 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 8 Feb 2006 13:16:25 +0200 (IST) Subject: [openib-general] iser: cleanups changeset Message-ID: kind of huge cleanup as part of the preparations for the RFC iscsi_iser.h | 166 +++++++++++++++++-------------------------------------- iser_initiator.c | 69 ---------------------- iser_memory.c | 138 +++++++++++++++++++++++++-------------------- iser_verbs.c | 52 ++++++++--------- 4 files changed, 156 insertions(+), 269 deletions(-) ------------------------------------------------------------------------ r5336 | ogerlitz | 2006-02-08 13:13:17 +0200 (Wed, 08 Feb 2006) | 4 lines cleanps Signed-off-by: Or Gerlitz From cjyoujsd at yahoo.com.cn Wed Feb 8 04:28:33 2006 From: cjyoujsd at yahoo.com.cn (=?gb2312?B?aW5mb3JtYXRpb24=?=) Date: Wed, 8 Feb 2006 04:28:33 -0800 (PST) Subject: [openib-general] =?iso-2022-jp?b?GyRCNjUkKCRGJCIkMiRrIXkbKEI=?= Message-ID: <20060208122833.54F9E2283DB@openib.ca.sandia.gov> 最近の出会い系って出会えなくなったと思いませんか?? 理由は→→http://www.awg5.net/?2007 皆ここで出会ってるから(^-^)♪ 今出会い系ランクで一位になってて凄い事になってるみたい!!! 早くした方がいいかもしれないよ☆ メール不要な方はこちら↓ priority7_net at yahoo.ca -------------- next part -------------- An HTML attachment was scrubbed... URL: From grave at ipno.in2p3.fr Wed Feb 8 04:39:27 2006 From: grave at ipno.in2p3.fr (Xavier Grave) Date: Wed, 08 Feb 2006 13:39:27 +0100 Subject: [openib-general] trying to run cmpost example Message-ID: <1139402367.8528.14.camel@ipnnarval> Hi all, one more newbie question. Here is my ib modules installation (2.6.15 kernel from ftp.kernel.org) lsmod | grep ib ib_umad 26472 0 ib_ucm 31992 0 ib_cm 50648 1 ib_ucm ib_mthca 156244 0 ib_uverbs 57968 0 ib_ipoib 61736 0 ib_sa 24568 1 ib_ipoib ib_mad 56548 4 ib_umad,ib_cm,ib_mthca,ib_sa ib_core 71344 8 ib_umad,ib_ucm,ib_cm,ib_mthca,ib_uverbs,ib_ipoib,ib_sa,ib_mad I run cmpost from libibcm/example directory as root ls -la /dev/infiniband/ucm0 gives : crw-r--r-- 1 root root 231, 255 2006-02-08 13:28 /dev/infiniband/ucm0 Prompt> LD_LIBRARY_PATH=/usr/local/lib ./cmpost libibcm: error <-1:6> opening device starting server listen request failed test complete Does somebody have an idea of what is missing ? All my lib code comes from the svn repository, do I need to modify the 2.6.15 infiniband directory ? Thanks in advance, xavier From mst at mellanox.co.il Wed Feb 8 04:45:44 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 8 Feb 2006 14:45:44 +0200 Subject: [openib-general] Re: ipoib_mcast_send.patch In-Reply-To: References: Message-ID: <20060208124544.GL28594@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: ipoib_mcast_send.patch > > Michael> I agree. Do you want to fix it or should I? > > If you get a chance that would be great. I'm at the OpenIB workshop > now so I probably can't seriously look at it until tomorrow at the > earliest. Here you are. The following is in ipoib_broadcast_gid.patch in svn. --- The way priv->broadcast is initialized in ipoib_mcast_join_task() is somewhat unsafe, since there's no lock and conceivably a send-only join could complete before priv->broadcast is fully set up. Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c (revision 5336) +++ openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c (working copy) @@ -533,8 +533,9 @@ void ipoib_mcast_join_task(void *dev_ptr } if (!priv->broadcast) { - priv->broadcast = ipoib_mcast_alloc(dev, 1); - if (!priv->broadcast) { + struct ipoib_mcast *broadcast; + broadcast = ipoib_mcast_alloc(dev, 1); + if (!broadcast) { ipoib_warn(priv, "failed to allocate broadcast group\n"); mutex_lock(&mcast_mutex); if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) @@ -544,10 +545,11 @@ void ipoib_mcast_join_task(void *dev_ptr return; } - memcpy(priv->broadcast->mcmember.mgid.raw, priv->dev->broadcast + 4, + spin_lock_irq(&priv->lock); + priv->broadcast = broadcast; + memcpy(broadcast->mcmember.mgid.raw, priv->dev->broadcast + 4, sizeof (union ib_gid)); - spin_lock_irq(&priv->lock); __ipoib_mcast_add(dev, priv->broadcast); spin_unlock_irq(&priv->lock); } -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mdidomenico at gmail.com Wed Feb 8 06:00:09 2006 From: mdidomenico at gmail.com (Michael Di Domenico) Date: Wed, 8 Feb 2006 09:00:09 -0500 Subject: [openib-general] openib and mellanox hca problem In-Reply-To: References: <97a7c7ed0602071410p1b92cc3etd032fbd5a95243ee@mail.gmail.com> <97a7c7ed0602071559w2f22a127vacd954970d063f8f@mail.gmail.com> Message-ID: <97a7c7ed0602080600s373d5c04ie6733715284bf23e@mail.gmail.com> Roland, I've attached the dmesg and lspci outputs... >From what i know, it's an APPRO 1224xi server w/ a Tyan S2721 motherboard w/ intel chipsets. I'd have to crack the case open to get more detail... On 2/7/06, Roland Dreier wrote: > Michael> What specifically would you like to know? > > What kind of CPU, motherboard, PCI host bridge, etc. > > "lspci -vvv" output would be interesting. Maybe /proc/cpuinfo too. > > - R. > -------------- next part -------------- A non-text attachment was scrubbed... Name: lspci.out Type: application/octet-stream Size: 13795 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: dmesg.out Type: application/octet-stream Size: 16088 bytes Desc: not available URL: From mst at mellanox.co.il Wed Feb 8 06:13:33 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 8 Feb 2006 16:13:33 +0200 Subject: [openib-general] Re: openib and mellanox hca problem In-Reply-To: <97a7c7ed0602071410p1b92cc3etd032fbd5a95243ee@mail.gmail.com> References: <97a7c7ed0602071410p1b92cc3etd032fbd5a95243ee@mail.gmail.com> Message-ID: <20060208141333.GP28594@mellanox.co.il> Quoting Michael Di Domenico : > Feb 7 16:59:48 linux14-ts kernel: ib_mthca 0000:07:00.0: PCI device > did not come back after reset, aborting. > Feb 7 16:59:48 linux14-ts kernel: ib_mthca 0000:07:00.0: Failed to > reset HCA, aborting. > Feb 7 16:59:48 linux14-ts kernel: ACPI: PCI interrupt for device > 0000:07:00.0 disabled > > > --- lspci output > 06:03.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev ff) > (prog-if ff) > !!! Unknown header type 7f > > 07:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev ff) > (prog-if ff) > !!! Unknown header type 7f This could be a hardware problem. Please contact your mellanox FAE representative. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mdidomenico at gmail.com Wed Feb 8 06:20:19 2006 From: mdidomenico at gmail.com (Michael Di Domenico) Date: Wed, 8 Feb 2006 09:20:19 -0500 Subject: [openib-general] Re: openib and mellanox hca problem In-Reply-To: <20060208141333.GP28594@mellanox.co.il> References: <97a7c7ed0602071410p1b92cc3etd032fbd5a95243ee@mail.gmail.com> <20060208141333.GP28594@mellanox.co.il> Message-ID: <97a7c7ed0602080620w1e3fa7c3t3ac21b477c13b53a@mail.gmail.com> On 2/8/06, Michael S. Tsirkin wrote: > Quoting Michael Di Domenico : > > > > --- lspci output > > 06:03.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev ff) > > (prog-if ff) > > !!! Unknown header type 7f > > > > 07:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev ff) > > (prog-if ff) > > !!! Unknown header type 7f > > This could be a hardware problem. Please contact your mellanox FAE > representative. > It shouldn't be. These machines were working fine with a copy of REL3 using a 2.4 kernel and the silverstorm hca stack. This has only creeped up when i switched to Fedora Core v4 v2.6 kernel and the openib stack From mdidomenico at gmail.com Wed Feb 8 06:24:25 2006 From: mdidomenico at gmail.com (Michael Di Domenico) Date: Wed, 8 Feb 2006 09:24:25 -0500 Subject: [openib-general] Re: openib and mellanox hca problem In-Reply-To: <20060208104402.GG28594@mellanox.co.il> References: <96f8e60e0602071439n13988e6cm92ee7b66c1e19f75@mail.gmail.com> <20060208104402.GG28594@mellanox.co.il> Message-ID: <97a7c7ed0602080624g24f0bf6fi93d54a8c6d2d8022@mail.gmail.com> On 2/8/06, Michael S. Tsirkin wrote: > If you really suspect timing issues, you can always > increase timeouts: look for msleep in mthca_reset.c and try bumping up > the numbers. > > Anyway - could you please enable mthca debug in menuconfig? > This would give us some more information on whats going on. I enabled debug in the module config recompiled and tried to reload using modprobe ib_mthca and got the same results? Am i missing a debug parameter somewhere? Or should it just spit out more information automatically? From Arkady.Kanevsky at netapp.com Wed Feb 8 06:24:30 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Wed, 8 Feb 2006 09:24:30 -0500 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal Message-ID: One more issue to discuss. Does Completion of Recv that matches RDMA Write with Immediate Data automatically sync local memory or Consumer still need to do lmr_sync_rdma_write prior to accessing RDMAed data. Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Caitlin Bestler [mailto:caitlinb at broadcom.com] > Sent: Tuesday, February 07, 2006 7:40 PM > To: dat-discussions at yahoogroups.com; Larsen, Roy K; Arlin > Davis; Hefty, Sean > Cc: openib-general at openib.org > Subject: RE: [dat-discussions] [openib-general] [RFC] > DAT2.0immediatedataproposal > > dat-discussions at yahoogroups.com wrote: > > We have problem no matter which option we choose. > > The current Transport Level Requirement state: > > > > There is a one-to-one correspondence between send operation on one > > Endpoint of the Connection and recv operations on the other > Endpoint > > of the Connection. > > There is no correspondence between RDMA operations on one > Endpoint of > > the Connection and recv or send data transfer operation on > the other > > Endpoint of the Connection. > > Receive operations on a Connection must be completed in the > order of > > posting of their corresponding sends. > > > > The Immediate data and Atomic ops violate these > requirements including > > ordering rules. > > > > I had started updating these rules when I generated the > first draft of > > the requirements. They are included in the enclosed pdf file. > > But they do not cover Atomic ops that also impact transport > > requirements. This chapter of the spec have not been changed since > > DAPL 1.0 and I am very concern with any changes to it. > > > > Arkady > > > > If "RDMA Write with Immediate" is viewed as being the > equivalent of doing RDMA Write and then an RDMA Send the > correspondence rule is maintained. But *only* if the "rdma > write with immediate" > has all of the semantics of a Send. > > Atomics do not violate the rules if you view them as being a > variation on an RDMA Read. They are an RDMA Read with modify. > The real question is whether it makes sense to put it in the > RDMA device. It is also not subject to emulation at a highe layer. > > With send with invalidate we know how InfiniBand *will* > support it, because of the IB 1.2 verbs. We do not know that > for atomics over iWARP. We do not know whether it will be > added, more importantly we do not know *how* it would be > added if it were added. That makes coming up with a transport > neutral definition very premature. > In particular, if atomics were added to iWARP there is a > distinct design option where it would *not* be the same work > queue as RDMA Reads (adding atomics through Queue ID 3 would > make layering on top of a current implementation much easier. > But it would mean that atomic credits would be distinct from > read credits. This is a very strong reason to defer > attempting to define RDMA Atomics in a transport neutral fashion. > > > > > > > > Yahoo! Groups Links > > <*> To visit your group on the web, go to: > http://groups.yahoo.com/group/dat-discussions/ > > <*> To unsubscribe from this group, send an email to: > dat-discussions-unsubscribe at yahoogroups.com > > <*> Your use of Yahoo! Groups is subject to: > http://docs.yahoo.com/info/terms/ > > > From mst at mellanox.co.il Wed Feb 8 06:49:25 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 8 Feb 2006 16:49:25 +0200 Subject: [openib-general] Re: openib and mellanox hca problem In-Reply-To: <97a7c7ed0602080624g24f0bf6fi93d54a8c6d2d8022@mail.gmail.com> References: <97a7c7ed0602080624g24f0bf6fi93d54a8c6d2d8022@mail.gmail.com> Message-ID: <20060208144924.GQ28594@mellanox.co.il> Quoting r. Michael Di Domenico : > Subject: Re: openib and mellanox hca problem > > On 2/8/06, Michael S. Tsirkin wrote: > > If you really suspect timing issues, you can always > > increase timeouts: look for msleep in mthca_reset.c and try bumping up > > the numbers. > > > > Anyway - could you please enable mthca debug in menuconfig? > > This would give us some more information on whats going on. > > I enabled debug in the module config recompiled and tried to reload > using modprobe ib_mthca and got the same results? Am i missing a > debug parameter somewhere? Or should it just spit out more > information automatically? Yes, it should spit out things like "Found bridge". Are you sure you installed it properly? To check, you can try to stick mthca_dbg(mdev, "Here\n"); at the beginning of mthca_reset and see that it gets printed. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From caitlinb at broadcom.com Wed Feb 8 07:00:41 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 8 Feb 2006 07:00:41 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal Message-ID: <54AD0F12E08D1541B826BE97C98F99F122DB09@NT-SJCA-0751.brcm.ad.broadcom.com> dat-discussions at yahoogroups.com wrote: > One more issue to discuss. > Does Completion of Recv that matches RDMA Write with > Immediate Data automatically sync local memory or Consumer > still need to do lmr_sync_rdma_write prior to accessing RDMAed data. > Why would it be any different than for a plain receive? The intent is the same, to indicate that prior Writes have completed. From mdidomenico at gmail.com Wed Feb 8 07:03:28 2006 From: mdidomenico at gmail.com (Michael Di Domenico) Date: Wed, 8 Feb 2006 10:03:28 -0500 Subject: [openib-general] Re: openib and mellanox hca problem In-Reply-To: <20060208144924.GQ28594@mellanox.co.il> References: <97a7c7ed0602080624g24f0bf6fi93d54a8c6d2d8022@mail.gmail.com> <20060208144924.GQ28594@mellanox.co.il> Message-ID: <97a7c7ed0602080703w3fa9b184x2aec5e84ae64c560@mail.gmail.com> On 2/8/06, Michael S. Tsirkin wrote: > Quoting r. Michael Di Domenico : > > Subject: Re: openib and mellanox hca problem > > > > On 2/8/06, Michael S. Tsirkin wrote: > > > If you really suspect timing issues, you can always > > > increase timeouts: look for msleep in mthca_reset.c and try bumping up > > > the numbers. > > > > > > Anyway - could you please enable mthca debug in menuconfig? > > > This would give us some more information on whats going on. > > > > I enabled debug in the module config recompiled and tried to reload > > using modprobe ib_mthca and got the same results? Am i missing a > > debug parameter somewhere? Or should it just spit out more > > information automatically? > > Yes, it should spit out things like "Found bridge". > Are you sure you installed it properly? > > To check, you can try to stick mthca_dbg(mdev, "Here\n"); at the beginning of > mthca_reset and see that it gets printed. definately working... Feb 8 10:01:23 linux14-ts kernel: ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005) Feb 8 10:01:23 linux14-ts kernel: ib_mthca: Initializing 0000:07:00.0 Feb 8 10:01:23 linux14-ts kernel: ACPI: PCI Interrupt 0000:07:00.0[?] -> GSI 26 (level, low) -> IRQ 217 Feb 8 10:01:23 linux14-ts kernel: ib_mthca 0000:07:00.0: Here Feb 8 10:01:23 linux14-ts kernel: ib_mthca 0000:07:00.0: Found bridge: 0000:06:03.0 Feb 8 10:01:34 linux14-ts kernel: ib_mthca 0000:07:00.0: PCI device did not come back after reset, aborting. Feb 8 10:01:34 linux14-ts kernel: ib_mthca 0000:07:00.0: Failed to reset HCA, aborting. Feb 8 10:01:34 linux14-ts kernel: ACPI: PCI interrupt for device 0000:07:00.0 disabled From postmaster at sichremobile.com Wed Feb 8 06:06:27 2006 From: postmaster at sichremobile.com (postmaster at sichremobile.com) Date: Wed, 08 Feb 2006 16:06:27 +0200 Subject: [openib-general] 2: ALL MAJOR DESIGNER REPLICA //ATCHES! Save $35 Message-ID: <001b01c62cc1$3c1ada91$d5ecbe51@X> An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Wed Feb 8 07:23:42 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 8 Feb 2006 17:23:42 +0200 Subject: [openib-general] Re: openib and mellanox hca problem In-Reply-To: <97a7c7ed0602080600s373d5c04ie6733715284bf23e@mail.gmail.com> References: <97a7c7ed0602080600s373d5c04ie6733715284bf23e@mail.gmail.com> Message-ID: <20060208152342.GS28594@mellanox.co.il> Quoting r. Michael Di Domenico : > Subject: Re: openib and mellanox hca problem > > Roland, > > I've attached the dmesg and lspci outputs... You really want lspci *before* mthca got loaded. This one just shows the card's incommunicado. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Wed Feb 8 07:26:03 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 8 Feb 2006 17:26:03 +0200 Subject: [openib-general] Re: openib and mellanox hca problem In-Reply-To: <97a7c7ed0602080703w3fa9b184x2aec5e84ae64c560@mail.gmail.com> References: <97a7c7ed0602080703w3fa9b184x2aec5e84ae64c560@mail.gmail.com> Message-ID: <20060208152603.GT28594@mellanox.co.il> Quoting Michael Di Domenico : > Feb 8 10:01:23 linux14-ts kernel: ib_mthca 0000:07:00.0: Found > bridge: 0000:06:03.0 Hmm, looks like the bridge lookup worked fine. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From jackm at mellanox.co.il Wed Feb 8 08:28:39 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Wed, 8 Feb 2006 18:28:39 +0200 Subject: [openib-general] FW: [PATCH 1 of 3] mad: large RMPP support Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3015CCDF5@mtlexch01.mtl.com> Sorry for breaking the thread (Outlook is problematic). Jack -----Original Message----- From: Jack Morgenstein Sent: Wednesday, February 08, 2006 6:23 PM To: 'Sean Hefty' Cc: Michael S. Tsirkin; 'rolandd at cisco.com' Subject: RE: [PATCH 1 of 3] mad: large RMPP support Sorry for not echoing to openib -- I'm having problems with mutt and our server (replying to this from Outlook will not place the reply in the thread). I would much rather use the linked list. We may need to allocate a rather large contiguous array (ib_mad_segments segment array) for queries involving a large cluster, and such an allocation has a larger probability of failure. For example, a 1000 host cluster, with 2 ports per HCA will have at least 4000 records in a SubnAdmGetTableResp for all PortInfo records on the network (2000 for HCAs, and at least 2000 for the switch ports). Such a query response will generate an RMPP of size 256K -- 1000 segments, or a 4K buffer on an X86 machine just for the array (assuming one allocation per RMPP segment -- N=1). b. Regarding using buffers which contain N RMPP segments, this becomes a management nightmare: If choose N too large, we may fail to allocate segments in a large RMPP, so that the entire RMPP fails (where it could succeed if N=1). Having N=1 guarantees that if we can succeed in our allocation, we will. I do not consider variable-size N within a single RMPP, since this will be very complicatated and error-prone. We could re-allocate everything if some N does not work -- also very complex. Regarding the order N-squared algorithm for finding the next RMPP segment to send, MST and I agree that this is not acceptable. We are considering an algorithm which stores the current segment pointer in "struct ib_mad_send_wr_private" so that when getting the next segment we simply go to the "next" link. We're still ironing out proper handling of the "last acknowledged" processing (maintaining a pointer to the last-acked segment, upgrading the last-acked pointer when a new ack arrives -- this might still involve linear searches). Regarding the payload pointer, I agree. It is also trivial to move it to the ib_mad_send_wr_private structure, hiding it from the user. Regarding the 64-byte boundary, why is this important? Jack -----Original Message----- From: Sean Hefty [mailto:sean.hefty at intel.com] Sent: Wednesday, February 08, 2006 3:01 AM To: Jack Morgenstein Cc: openib-general at openib.org Subject: RE: [PATCH 1 of 3] mad: large RMPP support Based on what you've done, I'd like to suggest changing interface similar to that shown below. I believe that this could be done with minor changes to the current patches. Detailed comments that led to suggesting this change are inline in my responses. struct ib_mad_segments { u32 num_segments; u32 segment_size; void *segment[0]; }; struct ib_mad_send_buf { ... void *mad; /* First MAD segment */ struct ib_mad_segments *segments; /* RMPP segments > 1 */ ... }; This will avoid walking through a list to find segments, and allows for efficient allocation of the segment data buffers. Multiple segments could be allocated through a single kzalloc. (For example, every n-th segment would start a new allocation, making deallocation easy as well.) >+struct ib_mad_multipacket_seg { >+ struct list_head list; >+ u32 size; >+ u8 data[0]; >+}; Should we ensure that the data alignment is on a 64-byte boundary? > struct ib_mad_send_buf { > struct ib_mad_send_buf *next; >- void *mad; >+ void *mad; /* RMPP: first segment, >+ including the MAD header */ >+ void *mad_payload; /* RMPP: changed per segment */ Mad_payload doesn't appear to be directly accessible directly by the user. It should be hidden. - Sean From swise at opengridcomputing.com Wed Feb 8 08:27:22 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 08 Feb 2006 10:27:22 -0600 Subject: [openib-general] [PATCH] [RFC] - example user mode rdma ping/pong program using CMA Message-ID: <1139416042.26808.14.camel@stevo-desktop> All, Attached is a user-mode program, called rping, that uses librdmacm and libibverbs to implement a ping-pong program over an RC connection. The program utilizes SEND, RECV, RDMA READ, and WRITE ops, as well as cq channels to get cq events, and rdma_get_event() to detect CMA events. It is multi-threaded. I've built it as an example program in librdmacm/examples and tested it with mthca. It is useful to test CMA as well as all the major rdma operations in a transport-neutral way. If you all find it has utility, please pull it into librdmacm/examples. Signed-off-by: Steve Wise Index: Makefile.am =================================================================== --- Makefile.am (revision 5330) +++ Makefile.am (working copy) @@ -18,9 +18,11 @@ src_librdmacm_la_SOURCES = src/cma.c src_librdmacm_la_LDFLAGS = -avoid-version $(rdmacm_version_script) -bin_PROGRAMS = examples/ucmatose +bin_PROGRAMS = examples/ucmatose examples/rping examples_ucmatose_SOURCES = examples/cmatose.c examples_ucmatose_LDADD = $(top_builddir)/src/librdmacm.la +examples_rping_SOURCES = examples/rping.c +examples_rping_LDADD = $(top_builddir)/src/librdmacm.la librdmacmincludedir = $(includedir)/rdma Index: examples/rping.c =================================================================== --- examples/rping.c (revision 0) +++ examples/rping.c (revision 0) @@ -0,0 +1,1175 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +static int debug = 0; +#define DEBUG_LOG if (debug) printf + +/* + * rping "ping/pong" loop: + * client sends source rkey/addr/len + * server receives source rkey/add/len + * server rdma reads "ping" data from source + * server sends "go ahead" on rdma read completion + * client sends sink rkey/addr/len + * server receives sink rkey/addr/len + * server rdma writes "pong" data to sink + * server sends "go ahead" on rdma write completion + * + */ + +/* + * These states are used to signal events between the completion handler + * and the main client or server thread. + * + * Once CONNECTED, they cycle through RDMA_READ_ADV, RDMA_WRITE_ADV, + * and RDMA_WRITE_COMPLETE for each ping. + */ +typedef enum { + IDLE = 1, + CONNECT_REQUEST, + CONNECTED, + RDMA_READ_ADV, + RDMA_READ_COMPLETE, + RDMA_WRITE_ADV, + RDMA_WRITE_COMPLETE, + ERROR +} state_t; + +/* + * Default max buffer size for IO... + */ +#define RPING_BUFSIZE 64*1024 +#define RPING_SQ_DEPTH 16 + +/* + * Control block struct. + */ +struct rping_cb { + int server; /* 0 iff client */ + pthread_t cqthread; + struct ibv_comp_channel *channel; + struct ibv_cq *cq; + struct ibv_pd *pd; + struct ibv_qp *qp; + + struct ibv_recv_wr rq_wr; /* recv work request record */ + struct ibv_sge recv_sgl; /* recv single SGE */ + char *recv_buf; /* malloc'd buffer */ + struct ibv_mr *recv_mr; /* MR associated with this buffer */ + + struct ibv_send_wr sq_wr; /* send work requrest record */ + struct ibv_sge send_sgl; + char *send_buf; /* single send buf */ + struct ibv_mr *send_mr; + + struct ibv_send_wr rdma_sq_wr; /* rdma work request record */ + struct ibv_sge rdma_sgl; /* rdma single SGE */ + char *rdma_buf; /* used as rdma sink */ + struct ibv_mr *rdma_mr; + + + uint32_t remote_rkey; /* remote guys RKEY */ + uint64_t remote_addr; /* remote guys TO */ + uint32_t remote_len; /* remote guys LEN */ + + char *start_buf; /* rdma read src */ + struct ibv_mr *start_mr; + + state_t state; /* used for cond/signalling */ + sem_t sem; + + uint16_t port; /* dst port in NBO */ + uint32_t addr; /* dst addr in NBO */ + char *addr_str; /* dst addr string */ + int verbose; /* verbose logging */ + int count; /* ping count */ + int size; /* ping data size */ + int validate; /* validate ping data */ + + /* CM stuff */ + pthread_t cmthread; + struct rdma_cm_id *cm_id; /* connection on client side,*/ + /* listener on service side. */ + struct rdma_cm_id *child_cm_id; /* connection on server side */ +}; + + +static void rping_cma_event_handler(struct rdma_cm_id *cma_id, + struct rdma_cm_event *event) +{ + int rc = 0; + struct rping_cb *cbp = (struct rping_cb *) cma_id->context; + + + DEBUG_LOG("cma_event type %d cma_id %p (%s)\n", + event->event, cma_id, + (cma_id == cbp->cm_id) ? "parent" : "child"); + switch (event->event) { + + case RDMA_CM_EVENT_ADDR_RESOLVED: + rc = rdma_resolve_route(cma_id, 2000); + if (rc) { + fprintf(stderr, "rdma_resolve_route error %d\n", rc); + cbp->state = ERROR; + sem_post(&cbp->sem); + } + break; + + case RDMA_CM_EVENT_ROUTE_RESOLVED: + sem_post(&cbp->sem); + break; + + case RDMA_CM_EVENT_CONNECT_REQUEST: + cbp->state = CONNECT_REQUEST; + cbp->child_cm_id = cma_id; + DEBUG_LOG("child cma %p\n", cbp->child_cm_id); + sem_post(&cbp->sem); + break; + + case RDMA_CM_EVENT_ESTABLISHED: + DEBUG_LOG("ESTABLISHED\n"); + cbp->state = CONNECTED; + sem_post(&cbp->sem); + break; + + case RDMA_CM_EVENT_ADDR_ERROR: + case RDMA_CM_EVENT_ROUTE_ERROR: + case RDMA_CM_EVENT_CONNECT_ERROR: + case RDMA_CM_EVENT_UNREACHABLE: + case RDMA_CM_EVENT_REJECTED: + fprintf(stderr, "cma event %d, error %d\n", event->event, + event->status); + cbp->state = ERROR; + sem_post(&cbp->sem); + break; + + case RDMA_CM_EVENT_DISCONNECTED: + fprintf(stderr, "DISCONNECT EVENT...\n"); + cbp->state = ERROR; + sem_post(&cbp->sem); + break; + + case RDMA_CM_EVENT_DEVICE_REMOVAL: + fprintf(stderr, "cma detected device removal!!!!\n"); + break; + + default: + fprintf(stderr, "oof bad type!\n"); + cbp->state = ERROR; + sem_post(&cbp->sem); + break; + + } + return; +} + +static void rping_cq_event_handler(struct rping_cb *cbp) +{ + struct ibv_wc wc; + struct ibv_recv_wr *bad_wr; + int rc; + + while ((rc = ibv_poll_cq(cbp->cq, 1, &wc)) == 1) { + if (wc.status) { + fprintf(stderr, "cq completion failed status %d\n", + wc.status); + cbp->state = ERROR; + sem_post(&cbp->sem); + return; + } + switch (wc.opcode) { + case IBV_WC_SEND: + DEBUG_LOG("send completion\n"); + break; + + case IBV_WC_RDMA_WRITE: + DEBUG_LOG("rdma write completion\n"); + cbp->state = RDMA_WRITE_COMPLETE; + sem_post(&cbp->sem); + break; + + case IBV_WC_RDMA_READ: + cbp->state = RDMA_READ_COMPLETE; + DEBUG_LOG("rdma read completion\n"); + sem_post(&cbp->sem); + break; + + case IBV_WC_RECV: + DEBUG_LOG("recv completion\n"); + if (cbp->server) { + if (wc.byte_len != 16) { + fprintf(stderr, + "Received bogus data, size %d\n", + wc.byte_len); + cbp->state = ERROR; + sem_post(&cbp->sem); + return; + } + cbp->remote_rkey = *((uint32_t *)cbp->recv_buf); + cbp->remote_addr = + *((uint64_t *) & cbp->recv_buf[4]); + cbp->remote_len = + *((uint32_t *) & cbp->recv_buf[12]); + DEBUG_LOG( + "Received rkey %x addr %llx " + "len %d from peer\n", + cbp->remote_rkey, cbp->remote_addr, + cbp->remote_len); + if (cbp->state == CONNECTED + || cbp->state == RDMA_WRITE_COMPLETE) + cbp->state = RDMA_READ_ADV; + else { + cbp->state = RDMA_WRITE_ADV; + } + } else { + if (wc.byte_len != 1) { + fprintf(stderr, + "Received bogus data, size %d\n", + wc.byte_len); + cbp->state = ERROR; + sem_post(&cbp->sem); + return; + } + if (cbp->state == RDMA_READ_ADV) { + cbp->state = RDMA_WRITE_ADV; + DEBUG_LOG("set state to WRITE_ADV\n"); + } else { + cbp->state = RDMA_WRITE_COMPLETE; + DEBUG_LOG("set state " + "to WRITE_COMPLETE\n"); + } + } + + /* + * post recv buf again + */ + rc = ibv_post_recv(cbp->qp, &cbp->rq_wr, &bad_wr); + if (rc) { + cbp->state = ERROR; + } + sem_post(&cbp->sem); + break; + + default: + DEBUG_LOG("unknown!!!!! completion\n"); + break; + } + } + if (rc) { + fprintf(stderr, "poll error %d\n", rc); + exit(rc); + } +} + +static int rping_accept_cr(struct rping_cb *cbp, char *priv, int len) +{ + struct rdma_conn_param conn_param; + + DEBUG_LOG("accept_cr!\n"); + memset(&conn_param, 0, sizeof conn_param); + conn_param.private_data = priv; + conn_param.private_data_len = len; + conn_param.responder_resources = 1; + conn_param.initiator_depth = 1; + return rdma_accept(cbp->child_cm_id, &conn_param); +} + +static int rping_connect(struct rping_cb *cbp, char *priv, int len) +{ + int rc; + struct rdma_conn_param conn_param; + + memset(&conn_param, 0, sizeof conn_param); + conn_param.private_data = priv; + conn_param.private_data_len = len; + conn_param.responder_resources = 1; + conn_param.initiator_depth = 1; + conn_param.retry_count = 10; + rc = rdma_connect(cbp->cm_id, &conn_param); + if (rc) { + fprintf(stderr, "rdma_connect error %d\n", rc); + cbp->state = ERROR; + sem_post(&cbp->sem); + } + return 0; +} + +static void rping_free_buffers(struct rping_cb *cbp) +{ + DEBUG_LOG("rping_free_buffers called on cbp %p\n", cbp); + ibv_dereg_mr(cbp->recv_mr); + free(cbp->recv_buf); + ibv_dereg_mr(cbp->send_mr); + free(cbp->send_buf); + ibv_dereg_mr(cbp->rdma_mr); + free(cbp->rdma_buf); + if (!cbp->server) { + ibv_dereg_mr(cbp->start_mr); + free(cbp->start_buf); + } +} + +static int rping_setup_buffers(struct rping_cb *cbp) +{ + struct ibv_recv_wr *bad_wr; + int rc; + + DEBUG_LOG("rping_setup_buffers called on cbp %p\n", cbp); + cbp->recv_buf = malloc(RPING_BUFSIZE); + if (cbp->recv_buf == NULL) { + return ENOMEM; + } + + cbp->recv_mr = ibv_reg_mr(cbp->pd, cbp->recv_buf, RPING_BUFSIZE, + IBV_ACCESS_LOCAL_WRITE); + if (!(cbp->recv_mr)) { + free(cbp->recv_buf); + cbp->recv_buf = NULL; + return errno; + } + + /* + * these never change + */ + cbp->recv_sgl.addr = (uint64_t) (unsigned long) cbp->recv_buf; + cbp->recv_sgl.length = RPING_BUFSIZE; + cbp->recv_sgl.lkey = cbp->recv_mr->lkey; + + /* + * these never change + */ + cbp->rq_wr.wr_id = (uint64_t) (unsigned long) &cbp->rq_wr; + cbp->rq_wr.sg_list = &cbp->recv_sgl; + cbp->rq_wr.num_sge = 1; + + cbp->send_buf = malloc(RPING_BUFSIZE); + if (cbp->send_buf == NULL) { + ibv_dereg_mr(cbp->recv_mr); + free(cbp->recv_buf); + cbp->recv_buf = NULL; + return ENOMEM; + } + + cbp->send_mr = ibv_reg_mr(cbp->pd, cbp->send_buf, RPING_BUFSIZE, 0); + if (!(cbp->send_mr)) { + ibv_dereg_mr(cbp->recv_mr); + free(cbp->recv_buf); + free(cbp->send_buf); + cbp->recv_buf = NULL; + return errno; + } + + /* + * these never change + */ + cbp->send_sgl.addr = (uint64_t) (unsigned long) cbp->send_buf; + cbp->send_sgl.lkey = cbp->send_mr->lkey; + + /* + * these never change + */ + cbp->sq_wr.opcode = IBV_WR_SEND; + cbp->sq_wr.wr_id = (uint64_t) (unsigned long) &cbp->sq_wr; + cbp->sq_wr.num_sge = 1; + cbp->sq_wr.sg_list = &cbp->send_sgl; + cbp->sq_wr.send_flags = IBV_SEND_SIGNALED; + + cbp->rdma_buf = malloc(RPING_BUFSIZE); + if (cbp->rdma_buf == NULL) { + ibv_dereg_mr(cbp->send_mr); + ibv_dereg_mr(cbp->recv_mr); + free(cbp->recv_buf); + free(cbp->send_buf); + cbp->recv_buf = NULL; + return ENOMEM; + } + + cbp->rdma_mr = ibv_reg_mr(cbp->pd, cbp->rdma_buf, RPING_BUFSIZE, + IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_READ | + IBV_ACCESS_REMOTE_WRITE); + if (!(cbp->rdma_mr)) { + ibv_dereg_mr(cbp->send_mr); + ibv_dereg_mr(cbp->recv_mr); + free(cbp->recv_buf); + free(cbp->send_buf); + free(cbp->rdma_buf); + cbp->recv_buf = NULL; + return errno; + } + + /* + * these never change + */ + cbp->rdma_sq_wr.wr_id = (uint64_t) (unsigned long) &cbp->rdma_sq_wr; + cbp->rdma_sq_wr.sg_list = &cbp->rdma_sgl; + cbp->rdma_sq_wr.num_sge = 1; + cbp->rdma_sgl.addr = (uint64_t) (unsigned long) cbp->rdma_buf; + cbp->rdma_sgl.lkey = cbp->rdma_mr->lkey; + + if (!cbp->server) { + cbp->start_buf = malloc(RPING_BUFSIZE); + if (cbp->start_buf == NULL) { + ibv_dereg_mr(cbp->send_mr); + ibv_dereg_mr(cbp->recv_mr); + ibv_dereg_mr(cbp->rdma_mr); + free(cbp->send_buf); + free(cbp->recv_buf); + free(cbp->rdma_buf); + cbp->recv_buf = NULL; + return ENOMEM; + } + + cbp->start_mr = ibv_reg_mr(cbp->pd, cbp->start_buf, + RPING_BUFSIZE, + IBV_ACCESS_LOCAL_WRITE | + IBV_ACCESS_REMOTE_READ | + IBV_ACCESS_REMOTE_WRITE); + if (!(cbp->start_mr)) { + ibv_dereg_mr(cbp->send_mr); + ibv_dereg_mr(cbp->recv_mr); + ibv_dereg_mr(cbp->rdma_mr); + free(cbp->send_buf); + free(cbp->recv_buf); + free(cbp->rdma_buf); + free(cbp->start_buf); + cbp->recv_buf = NULL; + return errno; + } + } + + rc = ibv_post_recv(cbp->qp, &cbp->rq_wr, &bad_wr); + if (rc) { + ibv_dereg_mr(cbp->send_mr); + ibv_dereg_mr(cbp->recv_mr); + ibv_dereg_mr(cbp->rdma_mr); + free(cbp->recv_buf); + free(cbp->send_buf); + free(cbp->rdma_buf); + if (!cbp->server) { + free(cbp->start_buf); + ibv_dereg_mr(cbp->start_mr); + } + cbp->recv_buf = NULL; + return rc; + } + + DEBUG_LOG("allocated & registered buffers...\n"); + return 0; +} + +static int rping_create_qp(struct rping_cb *cbp) +{ + struct ibv_qp_init_attr init_attr; + struct ibv_qp_attr qp_attr; + int rc = 0; + + memset(&init_attr, 0, sizeof(init_attr)); + init_attr.cap.max_send_wr = RPING_SQ_DEPTH; + init_attr.cap.max_recv_wr = 2; + init_attr.cap.max_recv_sge = 1; + init_attr.cap.max_send_sge = 1; + init_attr.qp_type = IBV_QPT_RC; + init_attr.send_cq = cbp->cq; + init_attr.recv_cq = cbp->cq; + if (cbp->server) { + rc = rdma_create_qp(cbp->child_cm_id, cbp->pd, &init_attr); + cbp->qp = cbp->child_cm_id->qp; + } else { + rc = rdma_create_qp(cbp->cm_id, cbp->pd, &init_attr); + cbp->qp = cbp->cm_id->qp; + } + if (rc) { + cbp->qp = NULL; + return rc; + } + + /* set REMOTE access rights on QP */ + qp_attr.qp_access_flags = IBV_ACCESS_REMOTE_READ| + IBV_ACCESS_REMOTE_WRITE; + rc = ibv_modify_qp(cbp->qp, &qp_attr, IBV_QP_ACCESS_FLAGS); + if (rc) + printf("ibv_modify_qp returned %d\n", rc); + return rc; +} + +static void *cm_thread(void *arg) +{ + int rc; + struct rdma_cm_event *event; + + while (1) { + rc = rdma_get_cm_event(&event); + if (rc) { + fprintf(stderr, "rdma_get_cm_event err %d\n", rc); + exit(rc); + } + rping_cma_event_handler(event->id, event); + rdma_ack_cm_event(event); + } +} + +static void *cq_thread(void *arg) +{ + struct rping_cb *cbp = arg; + int rc; + + DEBUG_LOG("cq_thread started.\n"); + + while (1) { + struct ibv_cq *ev_cq; + void *ev_ctx; + + rc = ibv_get_cq_event(cbp->channel, &ev_cq, &ev_ctx); + if (rc) { + fprintf(stderr, "Failed to get cq event!\n"); + exit(rc); + } + if (ev_cq != cbp->cq) { + fprintf(stderr, "Unkown CQ!\n"); + exit(-1); + } + rc = ibv_req_notify_cq(cbp->cq, 0); + if (rc) { + fprintf(stderr, "Failed to set notify!\n"); + exit(rc); + } + rping_cq_event_handler(cbp); + ibv_ack_cq_events(cbp->cq, 1); + } +} + +static void do_rping(struct rping_cb *cbp) +{ + int ping = 0; + int start; + int cc; + unsigned char c; + int rc; + + if (cbp->size == 0) + cbp->size = 64; + + /* + * Now doit! + */ + if (cbp->server) { + + /* + * Create listening endpoint. + */ + rc = rdma_listen(cbp->cm_id, 3); + if (rc) { + fprintf(stderr, "listen error %d\n", rc); + goto out; + } + + /* + * Wait for a connection request. + */ + rc = sem_wait(&cbp->sem); + if (rc || (cbp->state == ERROR)) { + fprintf(stderr, + "wait for CONNECT_REQUEST error %d state %d\n", + rc, cbp->state); + goto out; + } + + cbp->pd = ibv_alloc_pd(cbp->child_cm_id->verbs); + if (!(cbp->pd)) { + rc = errno; + goto out; + } + DEBUG_LOG("created pd %p\n", cbp->pd); + + cbp->channel = ibv_create_comp_channel(cbp->child_cm_id->verbs); + if (!cbp->channel) { + rc = errno; + goto out; + } + DEBUG_LOG("created channel %p\n", cbp->channel); + + cbp->cq = ibv_create_cq(cbp->child_cm_id->verbs, + RPING_SQ_DEPTH * 2, cbp, + cbp->channel, 0); + if (!(cbp->cq)) { + rc = errno; + goto out; + } + DEBUG_LOG("created cq %p\n", cbp->cq); + + rc = ibv_req_notify_cq(cbp->cq, 0); + if (rc) { + fprintf(stderr, "Failed to set notify!\n"); + rc = errno; + goto out; + } + + pthread_create(&cbp->cqthread, NULL, cq_thread, cbp); + + rc = rping_create_qp(cbp); + if (rc) { + goto out; + } + DEBUG_LOG("created qp %p\n", cbp->qp); + + /* + * Setup registered buffers. + */ + rc = rping_setup_buffers(cbp); + if (rc) { + goto out; + } + + /* + * Accept the connection request. + */ + rc = rping_accept_cr(cbp, "server", strlen("server") + 1); + if (rc) { + fprintf(stderr, "accept error %d\n", rc); + goto out; + } + rc = sem_wait(&cbp->sem); + if (rc || (cbp->state == ERROR)) { + fprintf(stderr, "wait for CONNECTED " + "state error %d state %d\n", + rc, cbp->state); + goto out; + } + + /* + * Server side ping loop + */ + while (1) { + struct ibv_send_wr *bad_wr; + + /* + * Wait for client's Start STAG/TO/Len + */ + rc = sem_wait(&cbp->sem); + if (rc || (cbp->state == ERROR)) { + fprintf(stderr, "wait for RDMA_READ_ADV " + "state error %d state %d\n", + rc, cbp->state); + goto out; + } + + DEBUG_LOG("server received sink adv\n"); + + /* + * Issue RDMA Read. + */ + cbp->rdma_sq_wr.opcode = IBV_WR_RDMA_READ; + cbp->rdma_sq_wr.wr_id = + (uint64_t) (unsigned long) &cbp->rdma_sq_wr; + cbp->rdma_sq_wr.sg_list = &cbp->rdma_sgl; + cbp->rdma_sq_wr.send_flags = IBV_SEND_SIGNALED; + cbp->rdma_sq_wr.wr.rdma.rkey = cbp->remote_rkey; + cbp->rdma_sq_wr.wr.rdma.remote_addr = cbp->remote_addr; + cbp->rdma_sq_wr.sg_list->length = cbp->remote_len; + + rc = ibv_post_send(cbp->qp, &cbp->rdma_sq_wr, &bad_wr); + if (rc) { + fprintf(stderr, "post send error %d\n", rc); + goto out; + } + DEBUG_LOG("server posted rdma read req \n"); + + /* + * Wait for read completion + */ + rc = sem_wait(&cbp->sem); + if (rc || (cbp->state == ERROR)) { + fprintf(stderr, "wait for " + "RDMA_READ_COMPLETE state error %d " + "state %d\n", + rc, cbp->state); + goto out; + } + DEBUG_LOG("server received read complete\n"); + + /* + * Display data in recv buf + */ + if (cbp->verbose) { + printf("server ping data: %s\n", + cbp->rdma_buf); + } + + /* + * Tell client to continue + */ + cbp->send_sgl.length = 1; + rc = ibv_post_send(cbp->qp, &cbp->sq_wr, &bad_wr); + if (rc) { + fprintf(stderr, "post send error %d\n", rc); + goto out; + } + DEBUG_LOG("server posted go ahead\n"); + + /* + * Wait for client's RDMA STAG/TO/Len + */ + rc = sem_wait(&cbp->sem); + if (rc || (cbp->state == ERROR)) { + fprintf(stderr, "wait for RDMA_WRITE_ADV " + "state error %d state %d\n", + rc, cbp->state); + goto out; + } + DEBUG_LOG("server received sink adv\n"); + + /* + * RDMA Write echo data + */ + cbp->rdma_sq_wr.opcode = IBV_WR_RDMA_WRITE; + cbp->rdma_sq_wr.wr.rdma.rkey = cbp->remote_rkey; + cbp->rdma_sq_wr.wr.rdma.remote_addr = cbp->remote_addr; + cbp->rdma_sq_wr.send_flags = IBV_SEND_SIGNALED; + cbp->rdma_sq_wr.sg_list->length = + strlen(cbp->rdma_buf) + 1; + DEBUG_LOG( + "rdma write from lkey %x laddr %llx len %d\n", + cbp->rdma_sq_wr.sg_list->lkey, + cbp->rdma_sq_wr.sg_list->addr, + cbp->rdma_sq_wr.sg_list->length); + rc = ibv_post_send(cbp->qp, &cbp->rdma_sq_wr, &bad_wr); + if (rc) { + fprintf(stderr, "post send error %d\n", rc); + goto out; + } + + /* + * Wait for completion + */ + rc = sem_wait(&cbp->sem); + if (rc || (cbp->state == ERROR)) { + fprintf(stderr, "waiting for " + "RDMA_WRITE_COMPLETE state error %d\n", + rc); + goto out; + } + DEBUG_LOG("server rdma write complete \n"); + + /* + * Tell client to begin again + */ + cbp->send_sgl.length = 1; + rc = ibv_post_send(cbp->qp, &cbp->sq_wr, &bad_wr); + if (rc) { + fprintf(stderr, "post send error %d\n", rc); + goto out; + } + DEBUG_LOG("server posted go ahead\n"); + } + } else { + cbp->pd = ibv_alloc_pd(cbp->cm_id->verbs); + if (!(cbp->pd)) { + rc = errno; + goto out; + } + DEBUG_LOG("created pd %p\n", cbp->pd); + + cbp->channel = ibv_create_comp_channel(cbp->cm_id->verbs); + if (!cbp->channel) { + rc = errno; + goto out; + } + DEBUG_LOG("created channel %p\n", cbp->channel); + + cbp->cq = ibv_create_cq(cbp->cm_id->verbs, + RPING_SQ_DEPTH * 2, cbp, cbp->channel, + 0); + if (!(cbp->cq)) { + rc = errno; + goto out; + } + DEBUG_LOG("created cq %p\n", cbp->cq); + + rc = ibv_req_notify_cq(cbp->cq, 0); + if (rc) { + fprintf(stderr, "Failed to set notify!\n"); + rc = errno; + goto out; + } + + pthread_create(&cbp->cqthread, NULL, cq_thread, cbp); + + rc = rping_create_qp(cbp); + if (rc) { + goto out; + } + DEBUG_LOG("created qp %p\n", cbp->qp); + + /* + * Setup registered buffers. + */ + rc = rping_setup_buffers(cbp); + if (rc) + goto out; + + /* + * Connect to server. + */ + rc = rping_connect(cbp, "client", strlen("client") + 1); + if (rc) { + fprintf(stderr, "connect error %d\n", rc); + goto out; + } + + rc = sem_wait(&cbp->sem); + if (rc || (cbp->state == ERROR)) { + fprintf(stderr, + "wait for CONNECTED error %d state %d\n", rc, + cbp->state); + goto out; + } + + /* + * Client side ping loop. + */ + start = 65; + while (1) { + int i; + struct ibv_send_wr *bad_wr; + + cbp->state = RDMA_READ_ADV; + + ++ping; + if (cbp->count && (ping > cbp->count)) { + goto out; + } + + /* + * Put some ascii text in the buffer. + */ + cc = sprintf(cbp->start_buf, "rdma-ping-%d: ", ping); + for (i = cc, c = start; i < cbp->size; i++) { + cbp->start_buf[i] = c; + c++; + if (c > 122) + c = 65; + } + start++; + if (start > 122) + start = 65; + cbp->start_buf[cbp->size] = 0; + + /* + * Send our start buffer rkey/addr/len... + * The server will use this to RDMA READ the ping. + */ + DEBUG_LOG("Sending Start rkey %x " + "addr %llx len %d for RDMA READ Source\n", + cbp->start_mr->rkey, + (uint64_t) (unsigned long) cbp->start_buf, + cbp->size + 1); + cbp->send_sgl.length = 16; + *((uint32_t *) (cbp->send_buf)) = cbp->start_mr->rkey; + *((uint64_t *) (cbp->send_buf + 4)) = + (uint64_t) (unsigned long) cbp->start_buf; + *((uint32_t *) (cbp->send_buf + 12)) = cbp->size + 1; + rc = ibv_post_send(cbp->qp, &cbp->sq_wr, &bad_wr); + if (rc) { + fprintf(stderr, "post send error %d\n", rc); + goto out; + } + + /* + * Wait for server to ACK + */ + rc = sem_wait(&cbp->sem); + if (rc || (cbp->state == ERROR)) { + fprintf(stderr, "wait for RDMA_WRITE_ADV " + "state error %d state %d\n", + rc, cbp->state); + goto out; + } + + /* + * Send our rdma buffer rkey/addr/len for receiving + * the ping echo from the server via RDMA_WRITE... + */ + DEBUG_LOG("Sending rkey %x addr %llx " + "len %d for RDMA WRITE Sink\n", + cbp->rdma_mr->rkey, + (uint64_t) (unsigned long) cbp->rdma_buf, + cbp->size + 1); + cbp->send_sgl.length = 16; + *((uint32_t *) (cbp->send_buf)) = cbp->rdma_mr->rkey; + *((uint64_t *) (cbp->send_buf + 4)) = + (uint64_t) (unsigned long) cbp->rdma_buf; + *((uint32_t *) (cbp->send_buf + 12)) = cbp->size + 1; + rc = ibv_post_send(cbp->qp, &cbp->sq_wr, &bad_wr); + if (rc) { + fprintf(stderr, "post send error %d\n", rc); + goto out; + } + + /* + * Wait for the server to say the RDMA Write is + * complete. + */ + rc = sem_wait(&cbp->sem); + if (rc || (cbp->state == ERROR)) { + fprintf(stderr, "wait for " + "RDMA_WRITE_COMPLETE state error %d " + "state %d\n", rc, cbp->state); + goto out; + } + + if (cbp->validate) { + + /* + * Validate data + */ + if (memcmp (cbp->start_buf, cbp->rdma_buf, + cbp->size + 1)) { + fprintf(stderr, "data mismatch!\n"); + goto out; + } + } + + /* + * Display ping data. + */ + if (cbp->verbose) { + printf("ping data: %s\n", cbp->rdma_buf); + } + } + } +out: + DEBUG_LOG("disconnecting\n"); + rdma_disconnect(cbp->cm_id); + + /* cleanup */ + if (cbp->child_cm_id) { + DEBUG_LOG("destroying child cm_id %p\n", cbp->child_cm_id); + rdma_destroy_id(cbp->child_cm_id); + } + if (cbp->qp) { + DEBUG_LOG("destroying qp %p\n", cbp->qp); + ibv_destroy_qp(cbp->qp); + } + if (cbp->cq) { + DEBUG_LOG("destroying cq %p\n", cbp->cq); + ibv_destroy_cq(cbp->cq); + } + if (cbp->recv_buf) { + DEBUG_LOG("freeing bufs/mrs\n"); + rping_free_buffers(cbp); + } + if (cbp->pd) { + DEBUG_LOG("dealloc pd %p\n", cbp->pd); + ibv_dealloc_pd(cbp->pd); + } + printf("close complete - returning from test \n"); + return; +} + +static void usage(char *name) +{ + printf("%s -c|s [-vVd] [-S size] [-C count] -a addr -p port\n", + basename(name)); + printf("\t-c\t\tclient side\n"); + printf("\t-s\t\tserver side\n"); + printf("\t-v\t\tdisplay ping data to stdout\n"); + printf("\t-V\t\tverbosity\n"); + printf("\t-d\t\tdebug printfs\n"); + printf("\t-S size \tping data size\n"); + printf("\t-C count\tping count times\n"); + printf("\t-a addr\t\taddress\n"); + printf("\t-p port\t\tport\n"); +} + +/* + * This function parses the command and executes the appropriate + * rping test. It is assumed this entire function + * can execute on the calling thread and sleep if needed. + */ +int main(int argc, char *argv[]) +{ + struct rping_cb *cbp; + int op; + int rc = 0; + struct sockaddr_in sin; + + cbp = malloc(sizeof(*cbp)); + if (cbp == NULL) { + return ENOMEM; + } + memset(cbp, 0, sizeof(*cbp)); + cbp->server = -1; + cbp->state = IDLE; + sem_init(&cbp->sem, 0, 0); + + opterr = 0; + while ((op=getopt(argc, argv, "a:p:C:S:t:scvVd")) != -1) { + switch (op) { + case 'a': + cbp->addr_str = optarg; + cbp->addr = inet_addr(optarg); + DEBUG_LOG("ipaddr (%s)\n", optarg); + break; + case 'p': + cbp->port = htons(atoi(optarg)); + DEBUG_LOG("port %d\n", (int) atoi(optarg)); + break; + case 's': + cbp->server = 1; + DEBUG_LOG("server\n"); + break; + case 'c': + cbp->server = 0; + DEBUG_LOG("client\n"); + break; + case 'S': + cbp->size = atoi(optarg) - 1; + if ((cbp->size < 1) + || (cbp->size > (RPING_BUFSIZE - 1))) { + fprintf(stderr, "Invalid size %d " + "(valid range is 1 to %d)\n", + cbp->size, RPING_BUFSIZE); + rc = EINVAL; + } else + DEBUG_LOG("size %d\n", + (int) atoi(optarg)); + break; + case 'C': + cbp->count = atoi(optarg); + if (cbp->count < 0) { + fprintf(stderr, "Invalid count %d\n", + cbp->count); + rc = EINVAL; + } else + DEBUG_LOG("count %d\n", + (int) cbp->count); + break; + case 'v': + cbp->verbose++; + DEBUG_LOG("verbose\n"); + break; + case 'V': + cbp->validate++; + DEBUG_LOG("validate data\n"); + break; + case 'd': + debug++; + break; + default: + usage("rping"); + rc = EINVAL; + break; + } + } + if (rc) + goto out; + + if (cbp->server == -1) { + fprintf(stderr, "must be either client or server\n"); + rc = EINVAL; + goto out; + } + + rc = rdma_create_id(&cbp->cm_id, cbp); + if (rc) { + rc = errno; + cbp->cm_id = NULL; + fprintf(stderr, "rdma_create_id error %d\n", rc); + goto out; + } + DEBUG_LOG("created cm_id %p\n", cbp->cm_id); + + pthread_create(&cbp->cmthread, NULL, cm_thread, cbp); + + /* + * Server binds to local addr/port to find the device. Client resolves + * the remote addr/port to find the device. + */ + if (cbp->server) { + memset(&sin, 0, sizeof(sin)); + sin.sin_family = AF_INET; + sin.sin_addr.s_addr = cbp->addr; + sin.sin_port = cbp->port; + rc = rdma_bind_addr(cbp->cm_id, (struct sockaddr *) &sin); + if (rc) { + fprintf(stderr, "rdma_bind_addr error %d\n", rc); + goto out; + } + DEBUG_LOG("rdma_bind_addr worked\n"); + } else { + memset(&sin, 0, sizeof(sin)); + sin.sin_family = AF_INET; + sin.sin_addr.s_addr = cbp->addr; + sin.sin_port = cbp->port; + rc = rdma_resolve_addr(cbp->cm_id, NULL, + (struct sockaddr *) &sin, 2000); + if (rc) { + fprintf(stderr, "rdma_resolve_addr error %d\n", rc); + goto out; + } + + rc = sem_wait(&cbp->sem); + if (rc || cbp->state == ERROR) { + fprintf(stderr, "waiting for address resolution " + "error %d state %d\n", rc, cbp->state); + goto out; + } + DEBUG_LOG("rdma_resolve_addr worked\n"); + } + + do_rping(cbp); + +out: + if (cbp->cm_id) { + DEBUG_LOG("destroy cm_id %p\n", cbp->cm_id); + rdma_destroy_id(cbp->cm_id); + } + free(cbp); + return rc; +} From kschoche at scl.ameslab.gov Wed Feb 8 08:35:49 2006 From: kschoche at scl.ameslab.gov (Kyle Schochenmaier) Date: Wed, 08 Feb 2006 10:35:49 -0600 Subject: [openib-general] problem with user-verb WC's Message-ID: <43EA1DE5.6030507@scl.ameslab.gov> While working on the openIB port for PVFS2, I've stumbled across some problems in posting rdma requests via the user-verbs interface with ib_mthca drivers. According to a 'TODO' buried in the gen2 src/linux-kernel/infiniband/hw/ : "MW support: ib_mthca does not support memory windows" The opcodes that I receive for non-rdma requests are all correct, however, when posting rdma requests, I'm consistently getting work completions with opcodes of: IBV_WC_BIND_MW I'm not making any (known) calls or requests to bind to a memory window, or for that matter to create a memory window. So how does a completion event get generated with an opcode indicating a currently unimplemented feature has just finished? And are there other reasons why I should/would be getting this type of completion? Thanks, Kyle -- Kyle Schochenmaier kschoche at scl.ameslab.gov Research Assistant, Dr. Brett Bode AmesLab - US Dept.Energy Scalable Computing Laboratory From sean.hefty at intel.com Wed Feb 8 08:40:49 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 8 Feb 2006 08:40:49 -0800 Subject: [openib-general] trying to run cmpost example In-Reply-To: <1139402367.8528.14.camel@ipnnarval> Message-ID: >Does somebody have an idea of what is missing ? I'm not sure what's the cause of the error that you're seeing. However, at one point ucmpost required ib_at to run, which was never added to the kernel. It has since been switched over to use rdma_cm, which was not submitted until last week. You won't be able to run ucmpost against the standard 2.6.15 kernel. >All my lib code comes from the svn repository, do I need to modify the >2.6.15 infiniband directory ? I'm traveling today, so I won't be able to look into this more until later this week. It is likely that you'll need to update to the latest Infiniband code in svn, to match the userspace code that you're using. This doesn't address the issue, but is there a particular reason why you want to test the userspace CM? If you can use a later svn version, have you considered using the userspace rdma_cm? - Sean From mst at mellanox.co.il Wed Feb 8 08:45:45 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 8 Feb 2006 18:45:45 +0200 Subject: [openib-general] Re: [PATCH] [RFC] - example user mode rdma ping/pongprogram using CMA In-Reply-To: <1139416042.26808.14.camel@stevo-desktop> References: <1139416042.26808.14.camel@stevo-desktop> Message-ID: <20060208164545.GV28594@mellanox.co.il> Quoting r. Steve Wise : > Subject: [PATCH] [RFC] - example user mode rdma ping/pongprogram using CMA > > All, > > Attached is a user-mode program, called rping, that uses librdmacm and > libibverbs to implement a ping-pong program over an RC connection. The > program utilizes SEND, RECV, RDMA READ, and WRITE ops, as well as cq > channels to get cq events, and rdma_get_event() to detect CMA events. > It is multi-threaded. > > I've built it as an example program in librdmacm/examples and tested it > with mthca. It is useful to test CMA as well as all the major rdma > operations in a transport-neutral way. > > If you all find it has utility, please pull it into librdmacm/examples. > > > Signed-off-by: Steve Wise Steve, looks like you have at most a single receive work request posted at the receive workqueue at all times. If true, this is *really* not a good idea, performance-wise, even if you actually have at most 1 packet in flight. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From sean.hefty at intel.com Wed Feb 8 08:47:50 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 8 Feb 2006 08:47:50 -0800 Subject: [openib-general] [PATCH] [RFC] - example user mode rdma ping/pongprogram using CMA In-Reply-To: <1139416042.26808.14.camel@stevo-desktop> Message-ID: >Attached is a user-mode program, called rping, that uses librdmacm and >libibverbs to implement a ping-pong program over an RC connection. The >program utilizes SEND, RECV, RDMA READ, and WRITE ops, as well as cq >channels to get cq events, and rdma_get_event() to detect CMA events. >It is multi-threaded. > >I've built it as an example program in librdmacm/examples and tested it >with mthca. It is useful to test CMA as well as all the major rdma >operations in a transport-neutral way. > >If you all find it has utility, please pull it into librdmacm/examples. Thanks. I may not get a chance to test this for a couple of days, but some additional tests for librdmacm would definitely be useful. - Sean From swise at opengridcomputing.com Wed Feb 8 08:49:37 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 08 Feb 2006 10:49:37 -0600 Subject: [openib-general] Re: [PATCH] [RFC] - example user mode rdma ping/pongprogram using CMA In-Reply-To: <20060208164545.GV28594@mellanox.co.il> References: <1139416042.26808.14.camel@stevo-desktop> <20060208164545.GV28594@mellanox.co.il> Message-ID: <1139417377.26808.21.camel@stevo-desktop> On Wed, 2006-02-08 at 18:45 +0200, Michael S. Tsirkin wrote: > Quoting r. Steve Wise : > > Subject: [PATCH] [RFC] - example user mode rdma ping/pongprogram using CMA > > > > All, > > > > Attached is a user-mode program, called rping, that uses librdmacm and > > libibverbs to implement a ping-pong program over an RC connection. The > > program utilizes SEND, RECV, RDMA READ, and WRITE ops, as well as cq > > channels to get cq events, and rdma_get_event() to detect CMA events. > > It is multi-threaded. > > > > I've built it as an example program in librdmacm/examples and tested it > > with mthca. It is useful to test CMA as well as all the major rdma > > operations in a transport-neutral way. > > > > If you all find it has utility, please pull it into librdmacm/examples. > > > > > > Signed-off-by: Steve Wise > > Steve, looks like you have at most a single receive work request posted at the > receive workqueue at all times. > If true, this is *really* not a good idea, performance-wise, even if you > actually have at most 1 packet in flight. Hey Michael, There is at most only one SEND in flight. This is a test program, not a performance program. Its goal is to utilize SEND, RECV, RDMA READ, and RDMA WRITE as well as CMA to setup the connection... Thanks, Steve. From sean.hefty at intel.com Wed Feb 8 08:50:11 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 8 Feb 2006 08:50:11 -0800 Subject: [openib-general] Re: [PATCH] [RFC] - example user mode rdmaping/pongprogram using CMA In-Reply-To: <20060208164545.GV28594@mellanox.co.il> Message-ID: >Steve, looks like you have at most a single receive work request posted at the >receive workqueue at all times. >If true, this is *really* not a good idea, performance-wise, even if you >actually have at most 1 packet in flight. Can you provide some more details on this? - Sean From swise at opengridcomputing.com Wed Feb 8 08:52:12 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 08 Feb 2006 10:52:12 -0600 Subject: [openib-general] Re: [PATCH] [RFC] - example user mode rdma ping/pongprogram using CMA In-Reply-To: <1139417377.26808.21.camel@stevo-desktop> References: <1139416042.26808.14.camel@stevo-desktop> <20060208164545.GV28594@mellanox.co.il> <1139417377.26808.21.camel@stevo-desktop> Message-ID: <1139417532.26808.24.camel@stevo-desktop> > Hey Michael, > > There is at most only one SEND in flight. This is a test program, not a > performance program. Its goal is to utilize SEND, RECV, RDMA READ, and > RDMA WRITE as well as CMA to setup the connection... > > Thanks, > > Steve. By the way, in case its not clear: The SEND/RECV exchanges are done just to advertise source and sink memory regions, and to indicate completion of rdma read and write operations to the peer. The "ping/pong" data is transferred with rdma read and write operations. Thanks for the feedback! From xma at us.ibm.com Wed Feb 8 08:55:14 2006 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 8 Feb 2006 08:55:14 -0800 Subject: [openib-general] Ifdown/ifup pick up the wrong ib interface configuration file In-Reply-To: <1139360899.14390.54.camel@simba.pantasys.com> Message-ID: Check your ifcfg-ib0/ifcfg-ib1 script to see whether the interface name matches. Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From paniknanako at postino.at Wed Feb 8 09:01:46 2006 From: paniknanako at postino.at (paniknanako at postino.at) Date: Wed, 8 Feb 2006 09:01:46 -0800 (PST) Subject: [openib-general] =?utf-8?b?wpHCpsKDworCg2LCg2DCgsK/wpBewpDCq8KX?= =?utf-8?b?fsKLwoHClXPClsKewo9uwo/Cl8KCwrLCj8OQwonDriAgICAgICAgICAx?= =?utf-8?q?2542?= Message-ID: 20030606081128.92083mail@mail.smsmsmsm55114_cyoukyou88915_lovemake-server07_system22_lovelovemake.tv �@�@�������@�@�@�@�@�@�@�@�@�@�@�@�@�������@ �@���� �r�d�w������get�̕ۏؐ���m�� ���� �@�@�������@�@�@�@�@�@�@�@�@�@�@�@�@�������@ http://venusnetwork.cx/h/ �@�@���������������������������������������� �@�@�@���T�[�N���̉�������l�͂���ȕ��X�I �@�@���������������������������������������� ��. �j���̑̂�����w�����鎖�𓖂���O�ƍl����A �@�@�j�V�ьo���̖L�x�Ȉ����n���E���l�I ��. ���X�̃X�g���X��ɑ��o�C�u�ƃs���N���[�^�[���� �@�@�������ԗV���g�����I�i�j�[�����ۂ� �@�@�^���~���s���n���E���l�I ��. ���V���u�����C���T�[�g�͓�����O�A���o����]�E������� �@�@��]�E��ʔ��ˊ�]�E���J�I�i�j�[��]���A�^�������n���E �@�@���l�I ��. �����^���I�����ȏ����l�݂̂̂��o�^����ƂȂ��Ă���܂��B �@�@���T�[�N���́������E�o�^������������ �@�@�Ȃ��Ă���܂��I �@�@�‚܂�A���Љ�����Ē����܂������l�ɂ́A �@�@�������ň����Č����Ⴄ�����”\�� �@�@�m�����E���S�����񑩂����Ē����܂��I �@���݋M���l�̒n����ł�߂��ꏊ�ɂĂ��҂����킹�”\�ȏ����l �@�̓R�`���ɂđ����r�d�w�o���܂��I http://venusnetwork.cx/h/ �@�������������������������������������������������� From mdidomenico at gmail.com Wed Feb 8 08:58:33 2006 From: mdidomenico at gmail.com (Michael Di Domenico) Date: Wed, 8 Feb 2006 11:58:33 -0500 Subject: [openib-general] Re: openib and mellanox hca problem In-Reply-To: <20060208152342.GS28594@mellanox.co.il> References: <97a7c7ed0602080600s373d5c04ie6733715284bf23e@mail.gmail.com> <20060208152342.GS28594@mellanox.co.il> Message-ID: <97a7c7ed0602080858k59651c3dndd194d57fd4ce8aa@mail.gmail.com> On 2/8/06, Michael S. Tsirkin wrote: > Quoting r. Michael Di Domenico : > > Subject: Re: openib and mellanox hca problem > > > > Roland, > > > > I've attached the dmesg and lspci outputs... > > You really want lspci *before* mthca got loaded. > This one just shows the card's incommunicado. I'm going to try and rollback to RedHat EL4 IA32 and see if i can get the machines up and using the silverstorm host stack and make everything works fine. unforgunately we dont have a stack for fedora core 4 on ia32 on ia64 afterwards i'll load up the openib stack and see what happens... thanks for the help From mst at mellanox.co.il Wed Feb 8 09:10:23 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 8 Feb 2006 19:10:23 +0200 Subject: [openib-general] Re: [PATCH] [RFC] - example user mode rdmaping/pongprogram using CMA In-Reply-To: References: Message-ID: <20060208171023.GX28594@mellanox.co.il> Quoting r. Sean Hefty : > Subject: RE: [openib-general] Re: [PATCH] [RFC] - example user mode rdmaping/pongprogram using CMA > > >Steve, looks like you have at most a single receive work request posted at the > >receive workqueue at all times. > >If true, this is *really* not a good idea, performance-wise, even if you > >actually have at most 1 packet in flight. > > Can you provide some more details on this? See 9.7.7.2 end-to-end (message level) flow control -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From sean.hefty at intel.com Wed Feb 8 09:13:23 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 8 Feb 2006 09:13:23 -0800 Subject: [openib-general] FW: [PATCH 1 of 3] mad: large RMPP support In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3015CCDF5@mtlexch01.mtl.com> Message-ID: >For example, a 1000 host cluster, with 2 ports per HCA will have at >least 4000 records in a SubnAdmGetTableResp for all PortInfo records on >the network (2000 for HCAs, and at least 2000 for the switch ports). >Such a query response will generate an RMPP of size 256K -- 1000 >segments, or a 4K buffer on an X86 machine just for the array (assuming >one allocation per RMPP segment -- N=1). I think that this is a good reason to use an array. Walking a 1000 entry list 1000 times is a substantial performance hit. Lost MADs and retries will make this worse. A 4K buffer for the array is less than the 8K total needed for the 1000 list items. We're already talking about allocating over 256K of memory just for the data payload. An additional contiguous 4k buffer seems like a minor issue. I'm not convinced that there's a real issue here. To support ridiculously large transfers from userspace, we may need to push the RMPP handling up into userspace. - Sean From grave at ipno.in2p3.fr Wed Feb 8 09:15:46 2006 From: grave at ipno.in2p3.fr (Xavier Grave) Date: Wed, 08 Feb 2006 18:15:46 +0100 Subject: [openib-general] error when using libsdp Message-ID: <1139418946.8532.37.camel@ipnnarval> Hi, I have compiled and configured libsdp and when I start my application I get this message : default libsdp configuration is used Error 97 calling socket for SDP socket errno 97 gives #define EAFNOSUPPORT 97 /* Address family not supported by protocol */ How can I enable the SDP support ? Thanks in advance, xavier From rdreier at cisco.com Wed Feb 8 09:18:45 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 08 Feb 2006 09:18:45 -0800 Subject: [openib-general] Re: openib and mellanox hca problem In-Reply-To: <20060208104043.GF28594@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 8 Feb 2006 12:40:43 +0200") References: <97a7c7ed0602071559w2f22a127vacd954970d063f8f@mail.gmail.com> <20060208104043.GF28594@mellanox.co.il> Message-ID: Michael> I wander whether we manage to locate the bridge. It Michael> would be interesting to build mthca with debug enabled. Certainly building with CONFIG_INFINIBAND_MTHCA_DEBUG=y would be a good idea. But even without debug, if we don't find a bridge, we should see the warning from the code: if (!bridge) { /* * Didn't find a bridge for a Tavor device -- * assume we're in no-bridge mode and hope for * the best. */ mthca_warn(mdev, "No bridge found for %s\n", pci_name(mdev->pdev)); } - R. From mst at mellanox.co.il Wed Feb 8 09:20:53 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 8 Feb 2006 19:20:53 +0200 Subject: [openib-general] Re: Re: [PATCH] [RFC] - example user mode rdmaping/pongprogram using CMA In-Reply-To: <1139417532.26808.24.camel@stevo-desktop> References: <1139417532.26808.24.camel@stevo-desktop> Message-ID: <20060208172053.GZ28594@mellanox.co.il> Quoting r. Steve Wise : > Subject: Re: Re: [PATCH] [RFC] - example user mode rdmaping/pongprogram using CMA > > > Hey Michael, > > > > There is at most only one SEND in flight. This is a test program, not a > > performance program. Its goal is to utilize SEND, RECV, RDMA READ, and > > RDMA WRITE as well as CMA to setup the connection... > > > > Thanks, > > > > Steve. > > By the way, in case its not clear: The SEND/RECV exchanges are done > just to advertise source and sink memory regions, and to indicate > completion of rdma read and write operations to the peer. The > "ping/pong" data is transferred with rdma read and write operations. > > Thanks for the feedback! > Code tends to get copied around ... its easy to imagine someone copying this and measuring the send latency. Just posting many WRs in the initialization sequence, with no other code changes, will fix this problem. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Wed Feb 8 09:21:31 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 8 Feb 2006 19:21:31 +0200 Subject: [openib-general] Re: error when using libsdp In-Reply-To: <1139418946.8532.37.camel@ipnnarval> References: <1139418946.8532.37.camel@ipnnarval> Message-ID: <20060208172131.GA28594@mellanox.co.il> Quoting r. Xavier Grave : > Subject: error when using libsdp > > Hi, > > I have compiled and configured libsdp and when I start my application I > get this message : > default libsdp configuration is used > Error 97 calling socket for SDP socket > errno 97 gives > #define EAFNOSUPPORT 97 /* Address family not supported by > protocol */ > How can I enable the SDP support ? > > Thanks in advance, xavier > Did you load the ib_sdp module? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Wed Feb 8 09:26:41 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 8 Feb 2006 19:26:41 +0200 Subject: [openib-general] Re: openib and mellanox hca problem In-Reply-To: References: Message-ID: <20060208172641.GC28594@mellanox.co.il> Quoting Roland Dreier : > Certainly building with CONFIG_INFINIBAND_MTHCA_DEBUG=y would be a > good idea. But even without debug, if we don't find a bridge, we > should see the warning from the code: Right, I wanded to check we got the right bus/device number, and it seems we did. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From openib-general at openib.org Wed Feb 8 09:32:20 2006 From: openib-general at openib.org (openib-general at openib.org) Date: Wed, 8 Feb 2006 09:32:20 -0800 (PST) Subject: [openib-general] AWARD NOTIFICATION Message-ID: <20060208173220.4DD892283DB@openib.ca.sandia.gov> From: The director of the Prize Award Department Reference number: EG/38807886091/05 Batch number: 340/1608/RDL Re: Award Notification of Final Notice We are pleased to inform you of the result of the CARDIFF LOTTERY ORGANIZATION British sweepstakes lottery International promotion UK programmed held on the 6TH FUBRUARY 2006 Your email address attached to the ticket number 033-1146993-750 with serial number 13-15-16-21-34-36, which consequently won the lottery in the 3rd category. You have therefore been awarded the lump sum of �1.5MILLION (ONE MILLION FIVE HUNDRED THOUSAND BRITISH POUNDS STERLING) in cash credited to file number EG/38807886091/05.This is from the total cash prize off �15,000,000.00(FIFTEEN MILLION BRITISH POUNDS STERLING) which is being shared among Ten international lucky winners in this category. Your funds are deposited with a security company, which will be insured in your name once you contact us. All participants were selected through a computer ballot system drawn from 25,000 email addresses from all over the world as a part of our international promotional program, which we conduct twice annually. We hope that with a part of your prize, you will take part in our end of year high stake 3bn lottery. All prize money must be claimed no later than 14days from the date of this notice, as after this date, all funds will be returned to CARDIFF LOTTERY ORGANIZATION as unclaimed. To file for your claim, please contact our financial agent: MR. JAMES COLE CLAIMS MANAGER. MAIL:cardifforg at london.com ALTERNATIVE EMAIL:cardifforg at hotmail.com Tel: 44-7031-946-936. FAX: 0044-70304-00042 International: 44-7031-946-936. For Further Assistant please call your international Directory in your country (CARDIFF LOTTERY ORGANIZATION) From swise at opengridcomputing.com Wed Feb 8 09:41:30 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 08 Feb 2006 11:41:30 -0600 Subject: [openib-general] Re: [PATCH] [RFC] - example user mode rdmaping/pongprogram using CMA In-Reply-To: <20060208171023.GX28594@mellanox.co.il> References: <20060208171023.GX28594@mellanox.co.il> Message-ID: <1139420490.26808.32.camel@stevo-desktop> On Wed, 2006-02-08 at 19:10 +0200, Michael S. Tsirkin wrote: > Quoting r. Sean Hefty : > > Subject: RE: [openib-general] Re: [PATCH] [RFC] - example user mode rdmaping/pongprogram using CMA > > > > >Steve, looks like you have at most a single receive work request posted at the > > >receive workqueue at all times. > > >If true, this is *really* not a good idea, performance-wise, even if you > > >actually have at most 1 packet in flight. > > > > Can you provide some more details on this? > > See 9.7.7.2 end-to-end (message level) flow control > I just read this section in the 1.2 version of the spec, and I still don't understand what the issue really is? 9.7.7.2 talks about IBA doing flow control based on the RECV WQEs posted. rping always ensures that there is a RECV posted before the peer can send. This is ensured by the rping protocol itself (see the comment at the front of rping.c describing the ping loop). I'm only ever sending one outstanding message via SEND/RECV. I would rather post exactly what is needed, than post some number of RECVs "just to be safe". Sorry if I'm being dense. What am I missing here? Steve. From swise at opengridcomputing.com Wed Feb 8 09:46:29 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 08 Feb 2006 11:46:29 -0600 Subject: [openib-general] Re: Re: [PATCH] [RFC] - example user mode rdmaping/pongprogram using CMA In-Reply-To: <20060208172053.GZ28594@mellanox.co.il> References: <1139417532.26808.24.camel@stevo-desktop> <20060208172053.GZ28594@mellanox.co.il> Message-ID: <1139420789.26808.38.camel@stevo-desktop> > > By the way, in case its not clear: The SEND/RECV exchanges are done > > just to advertise source and sink memory regions, and to indicate > > completion of rdma read and write operations to the peer. The > > "ping/pong" data is transferred with rdma read and write operations. > > > > Thanks for the feedback! > > > > Code tends to get copied around ... its easy to imagine someone > copying this and measuring the send latency. Just posting many WRs > in the initialization sequence, with no other code changes, > will fix this problem. > Each "ping/pong" iteration with rping is composed of 2 sends on the client side, 2 sends on the server side, plus 1 rdma read and 1 rdma write on the server side. Again, latency performance (or any performance) isn't a goal of this program. Testing CMA, CQ and CMA event notifications, and send/recv/rr/rw are the goals. snipit from the patch: +/* + * rping "ping/pong" loop: + * client sends source rkey/addr/len + * server receives source rkey/add/len + * server rdma reads "ping" data from source + * server sends "go ahead" on rdma read completion + * client sends sink rkey/addr/len + * server receives sink rkey/addr/len + * server rdma writes "pong" data to sink + * server sends "go ahead" on rdma write completion + * + */ Can you be more specific on what you think I should change? Are you suggesting I post more RECVs? Steve. From sean.hubbell at dbresearch.net Wed Feb 8 09:46:54 2006 From: sean.hubbell at dbresearch.net (Sean Hubbell) Date: Wed, 08 Feb 2006 11:46:54 -0600 Subject: [openib-general] ibstat problem In-Reply-To: <1139328093.26395.18.camel@stevo-desktop> References: <1139328093.26395.18.camel@stevo-desktop> Message-ID: <43EA2E8E.4040102@dbresearch.net> Yes, We discovered this yesterday. You built the libraries and did not build the diag. tools. Once you do this, things work. I do still have a few problems on sending messages out multicast though. Sean Steve Wise wrote: >Anyone see this before? > >----- > >vic17:~ # ibstat >ibstat: relocation error: ibstat: symbol argv0, version IBCOMMON_1.0 not >defined in file libibcommon.so.1 with link time reference >vic17:~ # uname -a >Linux vic17 2.6.15.2-kdb #4 SMP PREEMPT Mon Feb 6 17:24:41 CST 2006 i686 >i686 i386 GNU/Linux >vic17:~ # > > >----- > >[swise at dell3 src]$ svn info >Path: . >URL: https://openib.org/svn/gen2/trunk/src >Repository UUID: 21a7a0b7-18d7-0310-8e21-e8b31bdbf5cd >Revision: 5330 >Node Kind: directory >Schedule: normal >Last Changed Author: ogerlitz >Last Changed Rev: 5330 >Last Changed Date: 2006-02-07 07:23:38 -0600 (Tue, 07 Feb 2006) > >[swise at dell3 src]$ > > > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > From caitlinb at broadcom.com Wed Feb 8 09:51:40 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 8 Feb 2006 09:51:40 -0800 Subject: [openib-general] Re: Re: [PATCH] [RFC] - example user mode rdmaping/pongprogram using CMA Message-ID: <54AD0F12E08D1541B826BE97C98F99F122DB4C@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: >>> By the way, in case its not clear: The SEND/RECV exchanges are done >>> just to advertise source and sink memory regions, and to indicate >>> completion of rdma read and write operations to the peer. The >>> "ping/pong" data is transferred with rdma read and write operations. >>> >>> Thanks for the feedback! >>> >> >> Code tends to get copied around ... its easy to imagine someone >> copying this and measuring the send latency. Just posting many WRs in >> the initialization sequence, with no other code changes, will fix >> this problem. >> > > Each "ping/pong" iteration with rping is composed of 2 sends > on the client side, 2 sends on the server side, plus 1 rdma > read and 1 rdma write on the server side. > > Again, latency performance (or any performance) isn't a goal > of this program. Testing CMA, CQ and CMA event > notifications, and send/recv/rr/rw are the goals. > > > snipit from the patch: > > +/* > + * rping "ping/pong" loop: > + * client sends source rkey/addr/len > + * server receives source rkey/add/len > + * server rdma reads "ping" data from source > + * server sends "go ahead" on rdma read completion > + * client sends sink rkey/addr/len > + * server receives sink rkey/addr/len > + * server rdma writes "pong" data to sink > + * server sends "go ahead" on rdma write completion + * > + */ > Why does the server send "go ahead" after rdma write completion? It should be able to just post the send after posting the rdma write without waiting. When the rdma write completes has no device/transport independent meaning. From mdidomenico at gmail.com Wed Feb 8 10:11:50 2006 From: mdidomenico at gmail.com (Michael Di Domenico) Date: Wed, 8 Feb 2006 13:11:50 -0500 Subject: [openib-general] Re: openib and mellanox hca problem In-Reply-To: <20060208172641.GC28594@mellanox.co.il> References: <20060208172641.GC28594@mellanox.co.il> Message-ID: <97a7c7ed0602081011m690e30deiba9789f5ce946201@mail.gmail.com> On 2/8/06, Michael S. Tsirkin wrote: > Quoting Roland Dreier : > > Certainly building with CONFIG_INFINIBAND_MTHCA_DEBUG=y would be a > > good idea. But even without debug, if we don't find a bridge, we > > should see the warning from the code: > > Right, I wanded to check we got the right bus/device number, and it seems > we did. FYI... Changed over to RHEL4 IA32 w/ SilverStorm Host Stack v3.2.0.0.21 and now i get the below info and a working infiniband setup... Since I have two servers, I'm going to leave this one working and try openib on the second machine... # uname -a Linux linux14.silverstorm.com 2.6.9-5.ELsmp #1 SMP Wed Jan 5 19:30:39 EST 2005 i686 i686 i386 GNU/Linux # lspci -vvv 06:03.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev a1) (prog-if 00 [Normal decode]) Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- SERR- TAbort- Reset- FastB2B- Capabilities: [70] PCI-X bridge device. Secondary Status: 64bit+, 133MHz+, SCD-, USC-, SCO-, SRD- Freq=3 Status: Bus=6 Dev=3 Func=0 64bit+ 133MHz+ SCD- USC-, SCO-, SRD- : Upstream: Capacity=512, Commitment Limit=512 : Downstream: Capacity=128, Commitment Limit=128 07:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1) Subsystem: Mellanox Technologies MT23108 InfiniHost Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- SERR- (Kyle Schochenmaier's message of "Wed, 08 Feb 2006 10:35:49 -0600") References: <43EA1DE5.6030507@scl.ameslab.gov> Message-ID: Kyle> According to a 'TODO' buried in the gen2 Kyle> src/linux-kernel/infiniband/hw/ : "MW support: ib_mthca does Kyle> not support memory windows" Yes, it's true that mthca doesn't support memory windows yet. Kyle> The opcodes that I receive for non-rdma requests are all Kyle> correct, however, when posting rdma requests, I'm Kyle> consistently getting work completions with opcodes of: Kyle> IBV_WC_BIND_MW Kyle> I'm not making any (known) calls or requests to bind to a Kyle> memory window, or for that matter to create a memory window. Kyle> So how does a completion event get generated with an opcode Kyle> indicating a currently unimplemented feature has just Kyle> finished? And are there other reasons why I should/would be Kyle> getting this type of completion? Looking at the code, I don't see anything obviously wrong. What is the status field of the completions that you see with the opcode IBV_WC_BIND_MW? If the status is not success, then the opcode field is undefined (only the wrid is valid). - R. From swise at opengridcomputing.com Wed Feb 8 10:53:30 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 08 Feb 2006 12:53:30 -0600 Subject: [openib-general] Re: Re: [PATCH] [RFC] - example user mode rdmaping/pongprogram using CMA In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F122DB4C@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F122DB4C@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <1139424810.26808.40.camel@stevo-desktop> On Wed, 2006-02-08 at 09:51 -0800, Caitlin Bestler wrote: > openib-general-bounces at openib.org wrote: > >>> By the way, in case its not clear: The SEND/RECV exchanges are done > >>> just to advertise source and sink memory regions, and to indicate > >>> completion of rdma read and write operations to the peer. The > >>> "ping/pong" data is transferred with rdma read and write operations. > >>> > >>> Thanks for the feedback! > >>> > >> > >> Code tends to get copied around ... its easy to imagine someone > >> copying this and measuring the send latency. Just posting many WRs in > >> the initialization sequence, with no other code changes, will fix > >> this problem. > >> > > > > Each "ping/pong" iteration with rping is composed of 2 sends > > on the client side, 2 sends on the server side, plus 1 rdma > > read and 1 rdma write on the server side. > > > > Again, latency performance (or any performance) isn't a goal > > of this program. Testing CMA, CQ and CMA event > > notifications, and send/recv/rr/rw are the goals. > > > > > > snipit from the patch: > > > > +/* > > + * rping "ping/pong" loop: > > + * client sends source rkey/addr/len > > + * server receives source rkey/add/len > > + * server rdma reads "ping" data from source > > + * server sends "go ahead" on rdma read completion > > + * client sends sink rkey/addr/len > > + * server receives sink rkey/addr/len > > + * server rdma writes "pong" data to sink > > + * server sends "go ahead" on rdma write completion + * > > + */ > > > > Why does the server send "go ahead" after rdma write completion? No particular reason. > It should be able to just post the send after posting the rdma > write without waiting. When the rdma write completes has no > device/transport independent meaning. You're correct. It does not need to wait for the rdma write completion... From mst at mellanox.co.il Wed Feb 8 11:04:14 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 8 Feb 2006 21:04:14 +0200 Subject: [openib-general] Re: [PATCH] [RFC] - example user moderdmaping/pongprogram using CMA In-Reply-To: <1139420490.26808.32.camel@stevo-desktop> References: <1139420490.26808.32.camel@stevo-desktop> Message-ID: <20060208190414.GC1697@mellanox.co.il> Quoting r. Steve Wise : > Subject: Re: [openib-general] Re: [PATCH] [RFC] - example user moderdmaping/pongprogram using CMA > > On Wed, 2006-02-08 at 19:10 +0200, Michael S. Tsirkin wrote: > > Quoting r. Sean Hefty : > > > Subject: RE: [openib-general] Re: [PATCH] [RFC] - example user mode rdmaping/pongprogram using CMA > > > > > > >Steve, looks like you have at most a single receive work request posted at the > > > >receive workqueue at all times. > > > >If true, this is *really* not a good idea, performance-wise, even if you > > > >actually have at most 1 packet in flight. > > > > > > Can you provide some more details on this? > > > > See 9.7.7.2 end-to-end (message level) flow control > > > > I just read this section in the 1.2 version of the spec, and I still > don't understand what the issue really is? 9.7.7.2 talks about IBA > doing flow control based on the RECV WQEs posted. rping always ensures > that there is a RECV posted before the peer can send. This is ensured > by the rping protocol itself (see the comment at the front of rping.c > describing the ping loop). > > I'm only ever sending one outstanding message via SEND/RECV. I would > rather post exactly what is needed, than post some number of RECVs "just > to be safe". Sorry if I'm being dense. What am I missing here? > > Steve. > As far as I know, the credits are only updated by the ACK messages. If there is a single work request outstanding on the RQ, the ACK of the SEND message will have the credit field value 0 (since exactly one receive WR was outstanding, and that is now consumed). As a result the remote side withh "think" that there are no receive WQEs and will slow down (what spec refers to as limited WQE). -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Wed Feb 8 11:06:32 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 8 Feb 2006 21:06:32 +0200 Subject: [openib-general] Re: Re: [PATCH] [RFC] - example user mode rdmaping/pongprogramusing CMA In-Reply-To: <1139420789.26808.38.camel@stevo-desktop> References: <1139420789.26808.38.camel@stevo-desktop> Message-ID: <20060208190632.GD1697@mellanox.co.il> Quoting Steve Wise : > Can you be more specific on what you think I should change? Are you > suggesting I post more RECVs? During the initialization stage, post the same receive WR multiple times (according to the RQ size). Nothing needs to be touched in the loop: when you get a CQE, post just one receive WR. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Wed Feb 8 11:11:46 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 8 Feb 2006 21:11:46 +0200 Subject: [openib-general] Re: [PATCH] [RFC] - example user mode rdma ping/pongprogram using CMA In-Reply-To: <1139416042.26808.14.camel@stevo-desktop> References: <1139416042.26808.14.camel@stevo-desktop> Message-ID: <20060208191146.GE1697@mellanox.co.il> I suggest this in rping_setup_buffers: while (!rc = ibv_post_recv(cbp->qp, &cbp->rq_wr, &bad_wr)); This way you will never have 0 end-to-end credits. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From swise at opengridcomputing.com Wed Feb 8 11:35:03 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 08 Feb 2006 13:35:03 -0600 Subject: [openib-general] Re: [PATCH] [RFC] - example user moderdmaping/pongprogram using CMA In-Reply-To: <20060208190414.GC1697@mellanox.co.il> References: <1139420490.26808.32.camel@stevo-desktop> <20060208190414.GC1697@mellanox.co.il> Message-ID: <1139427303.9121.3.camel@stevo-desktop> > > > > I just read this section in the 1.2 version of the spec, and I still > > don't understand what the issue really is? 9.7.7.2 talks about IBA > > doing flow control based on the RECV WQEs posted. rping always ensures > > that there is a RECV posted before the peer can send. This is ensured > > by the rping protocol itself (see the comment at the front of rping.c > > describing the ping loop). > > > > I'm only ever sending one outstanding message via SEND/RECV. I would > > rather post exactly what is needed, than post some number of RECVs "just > > to be safe". Sorry if I'm being dense. What am I missing here? > > > > Steve. > > > > As far as I know, the credits are only updated by the ACK messages. > If there is a single work request outstanding on the RQ, > the ACK of the SEND message will have the credit field value 0 > (since exactly one receive WR was outstanding, and that is now consumed). > > As a result the remote side withh "think" that there are no > receive WQEs and will slow down (what spec refers to as limited WQE). Oh. I understand now. This is an issue with only 1 RQ WQE posted and how IB tries to inform the peer transport of the WQE count. For iWARP, none of this transport-level flow control happens (and I'm more familiar with iWARP than IB). From swise at opengridcomputing.com Wed Feb 8 11:37:52 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 08 Feb 2006 13:37:52 -0600 Subject: [openib-general] Re: [PATCH] [RFC] - example user mode rdma ping/pongprogram using CMA In-Reply-To: <20060208191146.GE1697@mellanox.co.il> References: <1139416042.26808.14.camel@stevo-desktop> <20060208191146.GE1697@mellanox.co.il> Message-ID: <1139427472.9121.6.camel@stevo-desktop> On Wed, 2006-02-08 at 21:11 +0200, Michael S. Tsirkin wrote: > I suggest this in rping_setup_buffers: > while (!rc = ibv_post_recv(cbp->qp, &cbp->rq_wr, &bad_wr)); > > This way you will never have 0 end-to-end credits. > I can do this easily, but it bothers me to post the same buffer multiple times, knowing the application doesn't need it (and would fail if more than one RECV is consumed at a time), just to make the transport more efficient. Is this common practice for IB applications? Thanks, Steve. From rdreier at cisco.com Wed Feb 8 11:39:39 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 08 Feb 2006 11:39:39 -0800 Subject: [openib-general] Re: [PATCH] [RFC] - example user mode rdma ping/pongprogram using CMA In-Reply-To: <1139427472.9121.6.camel@stevo-desktop> (Steve Wise's message of "Wed, 08 Feb 2006 13:37:52 -0600") References: <1139416042.26808.14.camel@stevo-desktop> <20060208191146.GE1697@mellanox.co.il> <1139427472.9121.6.camel@stevo-desktop> Message-ID: Steve> Is this common practice for IB applications? No, I think it's more of a cute trick that works in your particular case. From ardavis at ichips.intel.com Wed Feb 8 11:41:21 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Wed, 08 Feb 2006 11:41:21 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F122DA46@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F122DA46@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <43EA4961.6090905@ichips.intel.com> Caitlin Bestler wrote: >openib-general-bounces at openib.org wrote: > > >>I was under the assumption that the DAT` community defined the >>APIs and semantics through an open process. Given that the >>IB write immediate data facility does not break the >>implementation or semantics of the currently defined RDMA >>write facility, I see no reason the DAPL spec couldn't be >>updated, through consensus, with the realities of existing >>transport services. >> I agree, but I am not sure we are any closer to consensus regarding an **unambiguous** write with immediate. I am also not totally clear on the rules by which we implement new API's. Regardless, it appears that there is a strong desire to follow the IB semantics if we go with a standard API and if we cannot come to consensus then it should be incorporated as an extension. I would vote for implementing the standard rdma_write_with_ immediate API that follows IB semantics. -arlin > > > From rdreier at cisco.com Wed Feb 8 11:49:36 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 08 Feb 2006 11:49:36 -0800 Subject: [openib-general] Re: ipoib_mcast_send.patch In-Reply-To: <20060208124544.GL28594@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 8 Feb 2006 14:45:44 +0200") References: <20060208124544.GL28594@mellanox.co.il> Message-ID: I think we might want to be even more paranoid and wait until the broadcast join succeeds before allowing send-only joins. Otherwise we could create a send-only MCG with the wrong Q_Key, SL, etc. something like this maybe? --- infiniband/ulp/ipoib/ipoib_multicast.c (revision 5337) +++ infiniband/ulp/ipoib/ipoib_multicast.c (working copy) @@ -222,6 +222,13 @@ static int ipoib_mcast_join_finish(struc sizeof (union ib_gid))) { priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey); priv->tx_wr.wr.ud.remote_qkey = priv->qkey; + + /* + * Make sure that all the attributes are visible + * before we set the attached bit, so that send-only + * joins don't get started with incorrect attributes. + */ + smp_wmb(); } if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) { @@ -533,8 +540,10 @@ void ipoib_mcast_join_task(void *dev_ptr } if (!priv->broadcast) { - priv->broadcast = ipoib_mcast_alloc(dev, 1); - if (!priv->broadcast) { + struct ipoib_mcast *broadcast; + + broadcast = ipoib_mcast_alloc(dev, 1); + if (!broadcast) { ipoib_warn(priv, "failed to allocate broadcast group\n"); mutex_lock(&mcast_mutex); if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) @@ -544,10 +553,11 @@ void ipoib_mcast_join_task(void *dev_ptr return; } - memcpy(priv->broadcast->mcmember.mgid.raw, priv->dev->broadcast + 4, + spin_lock_irq(&priv->lock); + memcpy(broadcast->mcmember.mgid.raw, priv->dev->broadcast + 4, sizeof (union ib_gid)); + priv->broadcast = broadcast; - spin_lock_irq(&priv->lock); __ipoib_mcast_add(dev, priv->broadcast); spin_unlock_irq(&priv->lock); } From kr93u72js at yahoo.com Wed Feb 8 11:56:04 2006 From: kr93u72js at yahoo.com (kr93u72js at yahoo.com) Date: Thu, 9 Feb 2006 04:56:04 +0900 Subject: [openib-general] =?iso-2022-jp?b?GyRCQTQkRkw1TkEkRyQ0NmEbKEI=?= =?iso-2022-jp?b?GyRCPWokNSRzQzUkNxsoQg==?= Message-ID: <20060208200238.088842283DB@openib.ca.sandia.gov> *:。.:*:・'゜☆。.:*:・'゜★ *:。.:*:・'゜☆。.:*:・'゜★ *:。.:*:          寒い冬は誰と過ごしますか? *:。.:*:・'゜☆。.:*:・'゜★ *:。.:*:・'゜☆。.:*:・'゜★ *:。.:*: 男女会員数30万人以上!! 今がチャンスの完全無料コミュニティーにご参加下さい。 ---------------------------------------------------------------------- 真菜 21歳 学生 題名:遊びたいよー 彼氏にフラレちゃって淋しい毎日を過ごしています。あーぁ、私って 運がないのかな?今年こそはいい年にしたいなぁ。最近、楽しい事が ないので一緒に遊びませんか?色んな事を忘れてはじけたいです。 http://www.sweet-ch.com/?es ---------------------------------------------------------------------- 里子 31歳 OL 題名:31歳独身お茶組してます お茶組して三年目…派遣社員として入って正社員の座を 射止めたはいいんですが… それも上司と口車に乗せられて…なんか低給料で全然稼げないんですよ… 最悪なんですけど…だから夜とか少しバイトとかしてます。 日曜とか休みの日が多いけどバイトとか入ったら夜とかも仕事してます。 メールだったら時間関係なくお付き合いできるかなって思って。 家にPCあるので一緒にメッセンジャーでもしませんか?待ってますね。 http://www.sweet-ch.com/?es ---------------------------------------------------------------------- 順子 40歳 主婦 題名:お外で楽しみたいな たまに主婦したりってしてます。でも亭主との夜の関係が一年以上ないし そろそろハメを外しちゃおうかなって考えて登録しました。実際歳より 若いって見られる事も多いので、体もエステとか行ってその辺の40代には 負けてないって自分でも思うけど。どうですか?私はお外で楽しみたいな とか思ってますけど。秘密厳守の人でお願いします。 http://www.sweet-ch.com/?es ◎ご近所さん探し◎---------------------------------------------------- ┏★ 完全無料                ┏┃┛  エッチな子も恋いしたい子もいっぱい ★┛    http://www.meets-u.net/?mm ━━━━━注意事項━━━━━━━━━━━━━━━━━━━━━━━━━━ 本メールマガジン掲載に関する情報に関しては一切責任を負いません。 掲載情報の利用に際しては、各人が自分の責任で行なって下さい。 いかなる損害に関しても一切責任を負いかねますのでご了承下さい。 情報は必ずご自分でご確認ください。 掲載された記事の一部または全部を許可なく転載することを禁止致します。 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ━【購読解除について】━━━━━━━━━━━━━━━━━━━━━━━━ ※ 購読解除方法  万が一18歳未満の方に届いた場合や、登録解除をご希望の方は  お手数ですが下記までお願い致します。  release_mmm at yahoo.com ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ From mst at mellanox.co.il Wed Feb 8 11:58:39 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 8 Feb 2006 21:58:39 +0200 Subject: [openib-general] Re: ipoib_mcast_send.patch In-Reply-To: References: Message-ID: <20060208195839.GB32759@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: ipoib_mcast_send.patch > > I think we might want to be even more paranoid and wait until the > broadcast join succeeds before allowing send-only joins. Otherwise > we could create a send-only MCG with the wrong Q_Key, SL, etc. > > something like this maybe? Right, but I thought atomic test_and_set_bit implied smp_wmb already? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Wed Feb 8 11:59:43 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 8 Feb 2006 21:59:43 +0200 Subject: [openib-general] Re: Re: [PATCH] [RFC] - example user mode rdma ping/pongprogram using CMA In-Reply-To: References: Message-ID: <20060208195943.GC32759@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: Re: [PATCH] [RFC] - example user mode rdma ping/pongprogram using CMA > > Steve> Is this common practice for IB applications? > > No, I think it's more of a cute trick that works in your particular case. > Correct. Real apps are unlikely to get by with a single outstanding WR. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From rdreier at cisco.com Wed Feb 8 12:02:49 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 08 Feb 2006 12:02:49 -0800 Subject: [openib-general] Re: ipoib_mcast_send.patch In-Reply-To: <20060208195839.GB32759@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 8 Feb 2006 21:58:39 +0200") References: <20060208195839.GB32759@mellanox.co.il> Message-ID: Michael> Right, but I thought atomic test_and_set_bit implied Michael> smp_wmb already? So did I but then I looked in the kernel source and now I think that set_bit operations are only ordered against other bitops that touch the same word. For example ia64 just uses cmpxchg to implement the bitops, and powerpc just uses locked loads and stores. - R. From mst at mellanox.co.il Wed Feb 8 12:06:56 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 8 Feb 2006 22:06:56 +0200 Subject: [openib-general] Re: Re: ipoib_mcast_send.patch In-Reply-To: References: Message-ID: <20060208200656.GD32759@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: Re: ipoib_mcast_send.patch > > Michael> Right, but I thought atomic test_and_set_bit implied > Michael> smp_wmb already? > > So did I but then I looked in the kernel source and now I think that > set_bit operations are only ordered against other bitops that touch > the same word. For example ia64 just uses cmpxchg to implement the > bitops, and powerpc just uses locked loads and stores. Ugh, if thats the case you cant protect arbitrary data with a bit: you need a spinlock or a barrier? Wouldnt lots of code in ipoib that looks at bits be broken then? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Wed Feb 8 12:14:04 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 8 Feb 2006 22:14:04 +0200 Subject: [openib-general] IPoIB and lid change Message-ID: <20060208201404.GE32759@mellanox.co.il> Hi, Roland! One issue we have with IPoIB is that IPoIB may cache a remote node path for a long time. Remote LID may get changed e.g. if the SM is changed, and IPoIB might lose connectivity. One simple way to address this would be to have a list of all address handles per net device and kill them on an SM change event. What do you think? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Wed Feb 8 12:22:38 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 8 Feb 2006 22:22:38 +0200 Subject: [openib-general] Re: Re: ipoib_mcast_send.patch In-Reply-To: References: Message-ID: <20060208202238.GF32759@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: Re: ipoib_mcast_send.patch > > Michael> Right, but I thought atomic test_and_set_bit implied > Michael> smp_wmb already? > > So did I but then I looked in the kernel source and now I think that > set_bit operations are only ordered against other bitops that touch > the same word. For example ia64 just uses cmpxchg to implement the > bitops, and powerpc just uses locked loads and stores. > > - R. > Hmm. Roland, which kernel version is that? On 2.6.15 I see in include/asm-powerpc/bitops.h static __inline__ int test_and_set_bit(unsigned long nr, volatile unsigned long *addr) { unsigned long old, t; unsigned long mask = BITOP_MASK(nr); unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr); __asm__ __volatile__( EIEIO_ON_SMP "1:" PPC_LLARX "%0,0,%3 # test_and_set_bit\n" "or %1,%0,%2 \n" PPC405_ERR77(0,%3) PPC_STLCX "%1,0,%3 \n" "bne- 1b" ISYNC_ON_SMP : "=&r" (old), "=&r" (t) : "r" (mask), "r" (p) : "cc", "memory"); return (old & mask) != 0; } EIEIO_ON_SMP is a write barrier on smp, isnt it? I see this in 2.6.11: include/asm-ppc64/bitops.h static __inline__ int test_and_set_bit(unsigned long nr, volatile unsigned long *addr) { unsigned long old, t; unsigned long mask = 1UL << (nr & 0x3f); unsigned long *p = ((unsigned long *)addr) + (nr >> 6); __asm__ __volatile__( EIEIO_ON_SMP "1: ldarx %0,0,%3 # test_and_set_bit\n\ or %1,%0,%2 \n\ stdcx. %1,0,%3 \n\ bne- 1b" ISYNC_ON_SMP : "=&r" (old), "=&r" (t) : "r" (mask), "r" (p) : "cc", "memory"); return (old & mask) != 0; } EIEIO_ON_SMP is exactly what is needed, no? /* * The test_and_*_bit operations are taken to imply a memory barrier * on SMP systems. */ ... /* * test_and_*_bit do imply a memory barrier (?) */ static __inline__ int test_and_set_bit(int nr, volatile unsigned long *addr) { unsigned int old, t; unsigned int mask = 1 << (nr & 0x1f); volatile unsigned int *p = ((volatile unsigned int *)addr) + (nr >> 5); __asm__ __volatile__(SMP_WMB "\n\ 1: lwarx %0,0,%4 \n\ or %1,%0,%3 \n" PPC405_ERR77(0,%4) " stwcx. %1,0,%4 \n\ bne 1b" SMP_MB : "=&r" (old), "=&r" (t), "=m" (*p) : "r" (mask), "r" (p), "m" (*p) : "cc", "memory"); return (old & mask) != 0; } Ahem. It does look to me like atomics imply smp_wmb. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From Thomas.Talpey at netapp.com Wed Feb 8 12:58:56 2006 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Wed, 08 Feb 2006 15:58:56 -0500 Subject: [openib-general] NFS/RDMA client release for Linux 2.6.15 Message-ID: <7.0.1.0.2.20060207211754.0409e8a0@netapp.com> We have released an updated NFS/RDMA client for Linux at the project's Sourceforge site: This release updates the RPC/RDMA support as follows: Linux 2.6.15.2 supported Integrates with RPC via 2.6.15 transport switch Employs OpenIB RDMA verbs API (not kDAPL) Dual BSD/GPL2 licensing There are no protocol changes in this release, it is identical to the previous release (and the IETF draft) in this respect. The client has been tested with NFSv3 and passes the Connectathon test suite. At present, the client requires some additional transport switch patches to be applied to the Linux kernel, these are available at Chuck Lever's patches page: The related CITI NFS/RDMA server project is currently available for 2.6.14 from: This server is functional but only supports small RDMA inline data transfers, and a single request in flight. So, its performance is quite far from the potential. However, it is functional and is the server we pass Connectathon with! The server project is now being developed by Open Grid Computing, moving to the OpenIB common RDMA verbs API. We'll be making updates to both client and server as they become available. There's a lot more to do. We look forward to comments and feedback from the various standards and open source communities on this. Feel free to use the mailing list on the sourceforge project site, or any of these lists (which we usually monitor) but cc at least me and James Lentini (jlentini at netapp.com). Thanks, Tom Talpey, for the various NFS/RDMA projects. From krause at cup.hp.com Wed Feb 8 13:25:49 2006 From: krause at cup.hp.com (Michael Krause) Date: Wed, 08 Feb 2006 13:25:49 -0800 Subject: [openib-general] Re: [PATCH] [RFC] - example user moderdmaping/pongprogram using CMA In-Reply-To: <20060208190414.GC1697@mellanox.co.il> References: <1139420490.26808.32.camel@stevo-desktop> <20060208190414.GC1697@mellanox.co.il> Message-ID: <6.2.0.14.2.20060208132434.026f1b68@esmail.cup.hp.com> At 11:04 AM 2/8/2006, Michael S. Tsirkin wrote: >Quoting r. Steve Wise : > > Subject: Re: [openib-general] Re: [PATCH] [RFC] - example user > moderdmaping/pongprogram using CMA > > > > On Wed, 2006-02-08 at 19:10 +0200, Michael S. Tsirkin wrote: > > > Quoting r. Sean Hefty : > > > > Subject: RE: [openib-general] Re: [PATCH] [RFC] - example user mode > rdmaping/pongprogram using CMA > > > > > > > > >Steve, looks like you have at most a single receive work request > posted at the > > > > >receive workqueue at all times. > > > > >If true, this is *really* not a good idea, performance-wise, even > if you > > > > >actually have at most 1 packet in flight. > > > > > > > > Can you provide some more details on this? > > > > > > See 9.7.7.2 end-to-end (message level) flow control > > > > > > > I just read this section in the 1.2 version of the spec, and I still > > don't understand what the issue really is? 9.7.7.2 talks about IBA > > doing flow control based on the RECV WQEs posted. rping always ensures > > that there is a RECV posted before the peer can send. This is ensured > > by the rping protocol itself (see the comment at the front of rping.c > > describing the ping loop). > > > > I'm only ever sending one outstanding message via SEND/RECV. I would > > rather post exactly what is needed, than post some number of RECVs "just > > to be safe". Sorry if I'm being dense. What am I missing here? > > > > Steve. > > > >As far as I know, the credits are only updated by the ACK messages. >If there is a single work request outstanding on the RQ, >the ACK of the SEND message will have the credit field value 0 >(since exactly one receive WR was outstanding, and that is now consumed). > >As a result the remote side withh "think" that there are no >receive WQEs and will slow down (what spec refers to as limited WQE). Correct. The ACK / NAK protocol used by IB is used to return credits. In order to pipeline to improve performance, then you must post multiple receive work requests in order to account for the expected round trip time of the fabric and the associated CA processing. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From krause at cup.hp.com Wed Feb 8 13:36:50 2006 From: krause at cup.hp.com (Michael Krause) Date: Wed, 08 Feb 2006 13:36:50 -0800 Subject: [openib-general] Re: [PATCH] [RFC] - example user moderdmaping/pongprogram using CMA In-Reply-To: <1139427303.9121.3.camel@stevo-desktop> References: <1139420490.26808.32.camel@stevo-desktop> <20060208190414.GC1697@mellanox.co.il> <1139427303.9121.3.camel@stevo-desktop> Message-ID: <6.2.0.14.2.20060208133213.02430320@esmail.cup.hp.com> At 11:35 AM 2/8/2006, Steve Wise wrote: > > > > > > I just read this section in the 1.2 version of the spec, and I still > > > don't understand what the issue really is? 9.7.7.2 talks about IBA > > > doing flow control based on the RECV WQEs posted. rping always ensures > > > that there is a RECV posted before the peer can send. This is ensured > > > by the rping protocol itself (see the comment at the front of rping.c > > > describing the ping loop). > > > > > > I'm only ever sending one outstanding message via SEND/RECV. I would > > > rather post exactly what is needed, than post some number of RECVs "just > > > to be safe". Sorry if I'm being dense. What am I missing here? > > > > > > Steve. > > > > > > > As far as I know, the credits are only updated by the ACK messages. > > If there is a single work request outstanding on the RQ, > > the ACK of the SEND message will have the credit field value 0 > > (since exactly one receive WR was outstanding, and that is now consumed). > > > > As a result the remote side withh "think" that there are no > > receive WQEs and will slow down (what spec refers to as limited WQE). > >Oh. I understand now. This is an issue with only 1 RQ WQE posted and >how IB tries to inform the peer transport of the WQE count. For iWARP, >none of this transport-level flow control happens (and I'm more familiar >with iWARP than IB). For iWARP, we decided to not implement application receiver based flow control due to two items:TCP provides transport-level flow control (IB does not provide the equivalent per se) and upon examination of the majority of the ULP, they exchange and track the number of receive buffers allowed to be processed thus there is no need to replicate this in iWARP. There are some subtleties as well between a message-based transport and a byte stream such as TCP that go into the equation but these are not that important for most application writers to deal with. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Wed Feb 8 13:53:32 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 08 Feb 2006 13:53:32 -0800 Subject: [openib-general] Re: ipoib_mcast_send.patch In-Reply-To: <20060208202238.GF32759@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 8 Feb 2006 22:22:38 +0200") References: <20060208202238.GF32759@mellanox.co.il> Message-ID: Michael> Ahem. It does look to me like atomics imply smp_wmb. Yes, you're right. I misread the source. - R. From rdreier at cisco.com Wed Feb 8 13:54:18 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 08 Feb 2006 13:54:18 -0800 Subject: [openib-general] Re: ipoib_mcast_send.patch In-Reply-To: <20060208202238.GF32759@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 8 Feb 2006 22:22:38 +0200") References: <20060208202238.GF32759@mellanox.co.il> Message-ID: So something like this should be good enough: --- infiniband/ulp/ipoib/ipoib_multicast.c (revision 5337) +++ infiniband/ulp/ipoib/ipoib_multicast.c (working copy) @@ -533,8 +533,10 @@ void ipoib_mcast_join_task(void *dev_ptr } if (!priv->broadcast) { - priv->broadcast = ipoib_mcast_alloc(dev, 1); - if (!priv->broadcast) { + struct ipoib_mcast *broadcast; + + broadcast = ipoib_mcast_alloc(dev, 1); + if (!broadcast) { ipoib_warn(priv, "failed to allocate broadcast group\n"); mutex_lock(&mcast_mutex); if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) @@ -544,10 +546,11 @@ void ipoib_mcast_join_task(void *dev_ptr return; } - memcpy(priv->broadcast->mcmember.mgid.raw, priv->dev->broadcast + 4, + spin_lock_irq(&priv->lock); + memcpy(broadcast->mcmember.mgid.raw, priv->dev->broadcast + 4, sizeof (union ib_gid)); + priv->broadcast = broadcast; - spin_lock_irq(&priv->lock); __ipoib_mcast_add(dev, priv->broadcast); spin_unlock_irq(&priv->lock); } @@ -701,7 +704,9 @@ void ipoib_mcast_send(struct net_device */ spin_lock(&priv->lock); - if (!test_bit(IPOIB_MCAST_STARTED, &priv->flags) || !priv->broadcast) { + if (!test_bit(IPOIB_MCAST_STARTED, &priv->flags) || + !priv->broadcast || + !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) { ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); goto unlock; From krause at cup.hp.com Wed Feb 8 13:48:06 2006 From: krause at cup.hp.com (Michael Krause) Date: Wed, 08 Feb 2006 13:48:06 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal In-Reply-To: References: Message-ID: <6.2.0.14.2.20060208134317.02654d00@esmail.cup.hp.com> At 09:16 PM 2/6/2006, Sean Hefty wrote: > >The requirement is to provide an API that supports RDMA writes with > immediate > >data. A send that follows an RDMA write is not immediate data, and the API > >should not be constructed around trying to make it so. > >To be clear, I believe that write with immediate should be part of the normal >APIs, rather than an extension, but should be designed around those >devices that >provide it natively. One thing to keep in mind is that the IBTA workgroup responsible for the transport wanted to eliminate immediate data support entirely but it was retained solely to enable VIA application migration (even though the application base was quite small). If that requirement could have been eliminated, then it would have been gone in a heart beat. Given a RDMA-WRITE followed by a SEND provides the same application semantics based on the use models, iWARP chose not to support immediate data. So, here we have a long discussion on attempting to perpetuate a concept that is not universal across transports and was deemed to have minimal value that most wanted to see removed from the architecture. One has to question the value of trying to develop any API / software to support immediate data instead of just enabling the preferred method which is RDMA WRITE - SEND. I agree with those who have contended that this is difficult to do in a general purpose fashion. When all of this is taken into account, it seems the only good engineering answer is to eliminate immediate data support by the software and focused on the method that works across all interconnects. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Wed Feb 8 13:56:28 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 08 Feb 2006 13:56:28 -0800 Subject: [openib-general] Re: IPoIB and lid change In-Reply-To: <20060208201404.GE32759@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 8 Feb 2006 22:14:04 +0200") References: <20060208201404.GE32759@mellanox.co.il> Message-ID: Michael> One simple way to address this would be to have a list of Michael> all address handles per net device and kill them on an SM Michael> change event. Seems reasonable. It seems a little painful to implement at a first glance but I might be looking at it wrong. - R. From rdreier at cisco.com Wed Feb 8 14:03:45 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 08 Feb 2006 14:03:45 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal In-Reply-To: <6.2.0.14.2.20060208134317.02654d00@esmail.cup.hp.com> (Michael Krause's message of "Wed, 08 Feb 2006 13:48:06 -0800") References: <6.2.0.14.2.20060208134317.02654d00@esmail.cup.hp.com> Message-ID: Michael> So, here we have a long discussion on attempting to Michael> perpetuate a concept that is not universal across Michael> transports and was deemed to have minimal value that most Michael> wanted to see removed from the architecture. But this discussion is being driven by an application developer who does see value in immediate data. Arlin, can you quantify the benefit you see from RDMA write with immediate vs. RDMA write followed by a send? - R. From mst at mellanox.co.il Wed Feb 8 14:15:10 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 9 Feb 2006 00:15:10 +0200 Subject: [openib-general] Re: ipoib_mcast_send.patch In-Reply-To: References: Message-ID: <20060208221510.GE32082@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: ipoib_mcast_send.patch > > So something like this should be good enough: > > --- infiniband/ulp/ipoib/ipoib_multicast.c (revision 5337) > +++ infiniband/ulp/ipoib/ipoib_multicast.c (working copy) > @@ -533,8 +533,10 @@ void ipoib_mcast_join_task(void *dev_ptr > } > > if (!priv->broadcast) { > - priv->broadcast = ipoib_mcast_alloc(dev, 1); > - if (!priv->broadcast) { > + struct ipoib_mcast *broadcast; > + > + broadcast = ipoib_mcast_alloc(dev, 1); > + if (!broadcast) { > ipoib_warn(priv, "failed to allocate broadcast group\n"); > mutex_lock(&mcast_mutex); > if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) > @@ -544,10 +546,11 @@ void ipoib_mcast_join_task(void *dev_ptr > return; > } > > - memcpy(priv->broadcast->mcmember.mgid.raw, priv->dev->broadcast + 4, > + spin_lock_irq(&priv->lock); > + memcpy(broadcast->mcmember.mgid.raw, priv->dev->broadcast + 4, > sizeof (union ib_gid)); > + priv->broadcast = broadcast; > > - spin_lock_irq(&priv->lock); > __ipoib_mcast_add(dev, priv->broadcast); > spin_unlock_irq(&priv->lock); > } Thats identical to what I posted till this point - right? > @@ -701,7 +704,9 @@ void ipoib_mcast_send(struct net_device > */ > spin_lock(&priv->lock); > > - if (!test_bit(IPOIB_MCAST_STARTED, &priv->flags) || !priv->broadcast) { > + if (!test_bit(IPOIB_MCAST_STARTED, &priv->flags) || > + !priv->broadcast || > + !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) { > ++priv->stats.tx_dropped; > dev_kfree_skb_any(skb); > goto unlock; > I thought its important for performance to queue packets under mcast->pkt_queue? If not why do we do it? Maybe we shouldnt call netif_carrier_on if we drop all packets? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From rdreier at cisco.com Wed Feb 8 14:22:11 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 08 Feb 2006 14:22:11 -0800 Subject: [openib-general] Re: ipoib_mcast_send.patch In-Reply-To: <20060208221510.GE32082@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 9 Feb 2006 00:15:10 +0200") References: <20060208221510.GE32082@mellanox.co.il> Message-ID: Michael> Thats identical to what I posted till this point - right? I think I added one blank line, but other than that, yes. Michael> I thought its important for performance to queue packets Michael> under mcast->pkt_queue? If not why do we do it? Maybe we Michael> shouldnt call netif_carrier_on if we drop all packets? The queueing is there so that we aren't guaranteed to drop the first multicast packet sent to a given group. I'm not sure that it really is important, but it does seem like it would be bad to lose that packet every time. >From reading the code we can't call netif_carrier_on until after priv->broadcast has the attached flag set. In ipoib_mcast_join_task(), we have if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) { ipoib_mcast_join(dev, priv->broadcast, 0); return; } and then at the very bottom netif_carrier_on(dev); - R. From mst at mellanox.co.il Wed Feb 8 14:27:38 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 9 Feb 2006 00:27:38 +0200 Subject: [openib-general] Re: IPoIB and lid change In-Reply-To: References: Message-ID: <20060208222738.GF32082@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: IPoIB and lid change > > Michael> One simple way to address this would be to have a list of > Michael> all address handles per net device and kill them on an SM > Michael> change event. > > Seems reasonable. It seems a little painful to implement at a first > glance but I might be looking at it wrong. It will be very easy once you merge ipoib_all_neigh_issues_2.patch since that gets us a list of neigh to wwal on sm event. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Wed Feb 8 14:28:53 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 9 Feb 2006 00:28:53 +0200 Subject: [openib-general] Re: ipoib_mcast_send.patch In-Reply-To: References: Message-ID: <20060208222853.GG32082@mellanox.co.il> Quoting r. Roland Dreier : > >From reading the code we can't call netif_carrier_on until after > priv->broadcast has the attached flag set. In ipoib_mcast_join_task(), > we have > > if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) { > ipoib_mcast_join(dev, priv->broadcast, 0); > return; > } > > and then at the very bottom > > netif_carrier_on(dev); > > - R. > But we set the bit on finish, I thought ipoib_mcast_join only starts it - right? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From rdreier at cisco.com Wed Feb 8 14:29:49 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 08 Feb 2006 14:29:49 -0800 Subject: [openib-general] Re: ipoib_mcast_send.patch In-Reply-To: <20060208222853.GG32082@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 9 Feb 2006 00:28:53 +0200") References: <20060208222853.GG32082@mellanox.co.il> Message-ID: Michael> But we set the bit on finish, I thought ipoib_mcast_join Michael> only starts it - right? Right, but ipoib_mcast_join_task() is guaranteed to return before it reaches the netif_carrier_on() unless priv->broadcast is attached. - R. From ardavis at ichips.intel.com Wed Feb 8 15:36:37 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Wed, 08 Feb 2006 15:36:37 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal In-Reply-To: References: <6.2.0.14.2.20060208134317.02654d00@esmail.cup.hp.com> Message-ID: <43EA8085.5070208@ichips.intel.com> Roland Dreier wrote: > Michael> So, here we have a long discussion on attempting to > Michael> perpetuate a concept that is not universal across > Michael> transports and was deemed to have minimal value that most > Michael> wanted to see removed from the architecture. > >But this discussion is being driven by an application developer who >does see value in immediate data. > >Arlin, can you quantify the benefit you see from RDMA write with >immediate vs. RDMA write followed by a send? > > > We need speed and simplicity. A very latency sensitive application that requires immediate notification of RDMA write completion on the remote node without ANY latency penalties associated with combining operations, HCA priority rules across QPs, wire congestion, etc. An application that has no requirement for messaging outside of remote rdma write completion notifications. The application would not have to register and manage additional message buffers on either side, we can just size the queues accordingly and post zero byte messages. We need something that would be equivelent to setting there polling on the last byte of inbound data. But, since data ordering within an operation is not guaranteed that is not an option. So, rdma with immediate data is the most optimal and simplistic method for indication of RDMA-write completion that we have available today. In fact, I would like to see it increased in size to make it even more useful. -arlin > > From rangam at gmail.com Wed Feb 8 16:24:33 2006 From: rangam at gmail.com (Srirangam Addepalli) Date: Wed, 8 Feb 2006 18:24:33 -0600 Subject: [openib-general] Could not retrieve handle to the HCA InfiniHost0 (VAPI_EINVAL_HCA_ID) Message-ID: <8c02c3450602081624p73890577tb47d28e63c0353f0@mail.gmail.com> Hello All, When i do a vstat i get the following error. What does this mean. vstat 1 HCA found: hca_id=InfiniHost0 Error: Could not retrieve handle to the HCA InfiniHost0 (VAPI_EINVAL_HCA_ID) /var/log/messages has this [KERNEL_IB][_tslbTavorPnPEventHandler][/var/tmp/IBGD//tmp/openib/infiniband/ib_verbs/hw/provider/tavor_main.c: 352]_tslbTavorPnPEventHandler: could not add HCA InfiniHost0 (-19) what are the possible things that might have gone wrong ? does any one know. Rangam -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Wed Feb 8 16:35:12 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 08 Feb 2006 16:35:12 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal In-Reply-To: <43EA8085.5070208@ichips.intel.com> (Arlin Davis's message of "Wed, 08 Feb 2006 15:36:37 -0800") References: <6.2.0.14.2.20060208134317.02654d00@esmail.cup.hp.com> <43EA8085.5070208@ichips.intel.com> Message-ID: Arlin> A very latency sensitive application that requires Arlin> immediate notification of RDMA write completion on the Arlin> remote node without ANY latency penalties associated with Arlin> combining operations, HCA priority rules across QPs, wire Arlin> congestion, etc. An application that has no requirement for Arlin> messaging outside of remote rdma write completion Arlin> notifications. The application would not have to register Arlin> and manage additional message buffers on either side, we Arlin> can just size the queues accordingly and post zero byte Arlin> messages. We need something that would be equivelent to Arlin> setting there polling on the last byte of inbound Arlin> data. But, since data ordering within an operation is not Arlin> guaranteed that is not an option. So, rdma with immediate Arlin> data is the most optimal and simplistic method for Arlin> indication of RDMA-write completion that we have available Arlin> today. In fact, I would like to see it increased in size to Arlin> make it even more useful. Hmm. Can you put a number on how much better RDMA write with immediate is on current HCA hardware? How does using the underlying OpenIB verbs ability to post a list of work requests compare (ie posting an RDMA write followed by a send in one verbs call)? Maybe "post multiple" is a better direction for DAT. - R. From roy.k.larsen at intel.com Wed Feb 8 16:57:09 2006 From: roy.k.larsen at intel.com (Larsen, Roy K) Date: Wed, 8 Feb 2006 16:57:09 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal Message-ID: <468F3FDA28AA87429AD807992E22D07E086B7BEF@orsmsx408> One thing to keep in mind is that the IBTA workgroup responsible for the transport wanted to eliminate immediate data support entirely but it was retained solely to enable VIA application migration (even though the application base was quite small). If that requirement could have been eliminated, then it would have been gone in a heart beat. Given a RDMA-WRITE followed by a SEND provides the same application semantics based on the use models, iWARP chose not to support immediate data. Mike, I was not part of the original IBTA discussions and I won't argue whether this facility should or shouldn't have been include. Nevertheless, it is part of the specification, there are HCA vendors that implement it, and we have applications that make use of it. I would, however, disagree with your assertion that write followed by a send is semantically equivalent to write immediate. Ordering may be semantically the same, but the service is not. Receive work completions are explicitly indicated as being associated with immediate data and therefore an associated write completion. A write followed by a send does not provide the same indication semantic. Roy -------------- next part -------------- An HTML attachment was scrubbed... URL: From caitlinb at broadcom.com Wed Feb 8 22:07:41 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 8 Feb 2006 22:07:41 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal Message-ID: <54AD0F12E08D1541B826BE97C98F99F122DC59@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > Arlin> A very latency sensitive application that requires > Arlin> immediate notification of RDMA write completion on the > Arlin> remote node without ANY latency penalties associated with > Arlin> combining operations, HCA priority rules across QPs, wire > Arlin> congestion, etc. An application that has no requirement for > Arlin> messaging outside of remote rdma write completion > Arlin> notifications. The application would not have to register > Arlin> and manage additional message buffers on either side, we > Arlin> can just size the queues accordingly and post zero byte > Arlin> messages. We need something that would be equivelent to > Arlin> setting there polling on the last byte of inbound > Arlin> data. But, since data ordering within an operation is not > Arlin> guaranteed that is not an option. So, rdma with immediate > Arlin> data is the most optimal and simplistic method for > Arlin> indication of RDMA-write completion that we have available > Arlin> today. In fact, I would like to see it increased in size to > Arlin> make it even more useful. > > Hmm. Can you put a number on how much better RDMA write with > immediate is on current HCA hardware? How does using the > underlying OpenIB verbs ability to post a list of work > requests compare (ie posting an RDMA write followed by a send > in one verbs call)? > Maybe "post multiple" is a better direction for DAT. > The distinction between "Write and Send" versus "post multiple" is that it maintains a very simple one-to-one correspondence with the post_recv at the data sink. I also do not see how the *application* keeping the "write and send" semantics can have a negative performance implication if we allow InfiniBand Providers to encode it as an RDMA Write with Immediate. If the Data Source needs to communicate to the Data Sink that a specific RDMA Write transfer is done then it is sending a message. Information transfer and synchronization is occuring. I fail to see the value, let alone the optimization, of layering on an extra bit of information disguised as an opcode and using a specific transport's encoding methods as the model for a transport neutral API (particularly one at the DAT layer, at the verb layer it is a different issue because at the verb layer we do not want to hide any hardware capabilities even while encouraging safe harbor transport neutral practices). If distinquishing between 32-bit messages and 32-bit immediates that can arrive in indeterminate order is really that important to your application then maybe you really needed a 33-bit message to begin with. Encoding application layer information via your choice of carrier pigeon is not a very robust strategy. From kool_king_002 at vknz.com Wed Feb 8 20:51:04 2006 From: kool_king_002 at vknz.com (=?ISO-2022-JP?B?GyRCIVpJcTtSISYwJjtSIVsbKEI=?=) Date: 9 Feb 2006 13:51:04 +0900 Subject: [openib-general] $BL5NABN83;PKe(B Message-ID: <20060209045104.4371.qmail@mail.vknz.com> 当サイトは「女性優先」制を採用しており、女性会員の要求に従 うのです。  このメールは非会員の貴方に女性を紹介する事について、女性 (舞子・愛子姉妹)本人の依頼をされた男性だけに送られているメー ルなので、期待に答えてあげてください。 メッセージ: 自営業2人の姉妹なんですけど、興味ないですか?【舞子・愛子】です! 私達は2人で男性に奉仕するのが好きなんです(*^_^*)でもそんな相手見つけにくいし、恥ずかしいし、思い切って入会しました!別に私達をイかせてくれなくてもいいので、3Pのお相手してくださいm(__)mアドはPFに書いておりますので、良ければ写メとアド付けてお返事ください(^_-)-☆ 貴方は【無料体験】の利用者として、 ( http://www.kool-king.net?002 )をアクセスして、【無料体験】から舞子・愛子様と連絡してください。 なお、お客様からのメールが無い場合は、他の方へご紹介することとなりますので、なるべく早めのメール送信をお願いします。 メール送信はこちらから、直接舞子・愛子様へお送りください。 http://www.kool-king.net?002 至急、返事下さい! 拒否はbadluck at kool-king.netまで From caitlinb at broadcom.com Wed Feb 8 22:13:03 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 8 Feb 2006 22:13:03 -0800 Subject: [openib-general] Re: [PATCH] [RFC] - example user moderdmaping/pongprogram using CMA Message-ID: <54AD0F12E08D1541B826BE97C98F99F122DC5B@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > At 11:35 AM 2/8/2006, Steve Wise wrote: > > > > > > > > > I just read this section in the 1.2 version of the > spec, and I still > > > don't understand what the issue really is? 9.7.7.2 > talks about IBA > > > doing flow control based on the RECV WQEs posted. > rping always ensures > > > that there is a RECV posted before the peer can > send. This is ensured > > > by the rping protocol itself (see the comment at > the front of rping.c > > > describing the ping loop). > > > > > > I'm only ever sending one outstanding message via SEND/RECV. I > would > > rather post exactly what is needed, than post some > number of RECVs "just > > > to be safe". Sorry if I'm being dense. What am I > missing here? > > > > > > Steve. > > > > > > > As far as I know, the credits are only updated by the > ACK messages. > > If there is a single work request outstanding on the RQ, > > the ACK of the SEND message will have the credit field value 0 > > (since exactly one receive WR was outstanding, and > that is now consumed). > > > > As a result the remote side withh "think" that there are no > > receive WQEs and will slow down (what spec refers to > as limited WQE). > > Oh. I understand now. This is an issue with only 1 RQ > WQE posted and > how IB tries to inform the peer transport of the WQE > count. For iWARP, > none of this transport-level flow control happens (and > I'm more familiar > with iWARP than IB). > > > For iWARP, we decided to not implement application receiver > based flow control due to two items:TCP provides > transport-level flow control (IB does not provide the > equivalent per se) and upon examination of the majority of > the ULP, they exchange and track the number of receive > buffers allowed to be processed thus there is no need to > replicate this in iWARP. There are some subtleties as well > between a message-based transport and a byte stream such as > TCP that go into the equation but these are not that > important for most application writers to deal with. > > Mike But in terms of compiling the safe harbor transport neutral recommended programming practices, I think this is a valid point. Having one "spare" buffer is a good safety mechanism at the application layer in general, *and* it may prevent snarls in the transport layer flow control. Suggesting that consumers avoid letting the RQ hit empty strikes me as aa valid transport neutral recommendation. And we'll improve the public education by following those recommendations in sample and test programs. From sean.hefty at intel.com Wed Feb 8 23:20:00 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 8 Feb 2006 23:20:00 -0800 Subject: [dat-discussions] [openib-general] [RFC]DAT2.0immediatedataproposal In-Reply-To: Message-ID: >Hmm. Can you put a number on how much better RDMA write with >immediate is on current HCA hardware? How does using the underlying >OpenIB verbs ability to post a list of work requests compare (ie >posting an RDMA write followed by a send in one verbs call)? >Maybe "post multiple" is a better direction for DAT. A "post multiple" call as a general API makes sense, but I think that's a separate issue. Given that IB provides true immediate data with RDMA writes, a way should be available to make use of it. I don't know what the performance numbers between using a write with immediate versus a write followed by a send, but I don't think that anyone could argue that the write with immediate wouldn't perform better. To me, the question is whether write with immediate is supported as a transport specific extension, which was Arlin's original patch, or through some standard API. The attempt to make the API standard, so that iWarp could emulate it (poorly in my view), is what appears to be driving the disagreements. It also appears to me that the decisions are coming down to one of the following. If iWarp can emulate write with immediate, then a generic API should be used. If iWarp cannot properly emulate write with immediate, then the API should be transport specific. It's curious to me that in both cases, iWarp is driving the API decision and design for something that is an IB specific feature. - Sean From jackm at mellanox.co.il Wed Feb 8 23:23:34 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Thu, 9 Feb 2006 09:23:34 +0200 Subject: [openib-general] FW: [PATCH 1 of 3] mad: large RMPP support Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3015CCEBB@mtlexch01.mtl.com> My point was not the total storage used for the array (it ends up more than the linked list, as you noted). I'm concerned that an allocation of a 4K buffer may fail in a situation where lots of small allocations of around 256 bytes would succeed. Is your point that if we fail to allocate a 4K buffer, we're in deep trouble already? Note that I've only considered a 1000 host cluster. What about scalability (e.g., 10,000 nodes -- we then need a 40K buffer) -- the linked list has no scalability problem (no need to push RMPP handling to user space). Regarding the list-walk, if we track the "last-sent segment" in the list, there is no need to do the list walk (we simply get the next segment in the list). We'll only have a short list walk when the "ack" pointer gets updated (need to walk forward only items in the linked list from the previously ack'ed item). -- What is the reason you are thinking about 64-byte boundary support? Jack -----Original Message----- From: Sean Hefty [mailto:sean.hefty at intel.com] Sent: Wednesday, February 08, 2006 7:13 PM To: Jack Morgenstein; openib-general at openib.org Subject: RE: [openib-general] FW: [PATCH 1 of 3] mad: large RMPP support >For example, a 1000 host cluster, with 2 ports per HCA will have at >least 4000 records in a SubnAdmGetTableResp for all PortInfo records on >the network (2000 for HCAs, and at least 2000 for the switch ports). >Such a query response will generate an RMPP of size 256K -- 1000 >segments, or a 4K buffer on an X86 machine just for the array (assuming >one allocation per RMPP segment -- N=1). I think that this is a good reason to use an array. Walking a 1000 entry list 1000 times is a substantial performance hit. Lost MADs and retries will make this worse. A 4K buffer for the array is less than the 8K total needed for the 1000 list items. We're already talking about allocating over 256K of memory just for the data payload. An additional contiguous 4k buffer seems like a minor issue. I'm not convinced that there's a real issue here. To support ridiculously large transfers from userspace, we may need to push the RMPP handling up into userspace. - Sean From sean.hefty at intel.com Wed Feb 8 23:45:22 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 8 Feb 2006 23:45:22 -0800 Subject: [openib-general] FW: [PATCH 1 of 3] mad: large RMPP support In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3015CCEBB@mtlexch01.mtl.com> Message-ID: >I'm concerned that an allocation of a 4K buffer may fail in a situation >where lots of small allocations of around 256 bytes would succeed. Is >your point that if we fail to allocate a 4K buffer, we're in deep >trouble already? Note that I've only considered a 1000 host cluster. Yes - if we can't allocate a 4k buffer, it seems highly unlikely that we'd be able to allocate 1000 256-byte buffers. >What about scalability (e.g., 10,000 nodes -- we then need a 40K buffer) >-- the linked list has no scalability problem (no need to push RMPP >handling to user space). I did consider this, and I don't know when we'll start hitting issues allocating a single data buffer. But we're going to ask for 10,000 256-byte buffers - over 2.5 MB of kernel memory in order to perform this single data transfer. Is it likely that we can allocate that much memory, but not the 40k buffer? I really don't know. If the answer is yes, then I agree that using a linked list would be better. >Regarding the list-walk, if we track the "last-sent segment" in the >list, there is no need to do the list walk (we simply get the next >segment in the list). We'll only have a short list walk when the "ack" >pointer gets updated (need to walk forward only > items in the linked list from the >previously ack'ed item). I thought of this as well. For efficiency, you need to track the last sent and last acked, meaning that the list will be walked at most twice. You may be able to jump the ack pointer to last sent if that is a common case. >What is the reason you are thinking about 64-byte boundary support? I was concerned about 64-byte values in the MADs aligned on a 32-byte boundary. But then I think that some of the MADs have this issue anyway by architectural design. - Sean From kaormi777 at kobej.zzn.com Thu Feb 9 00:06:16 2006 From: kaormi777 at kobej.zzn.com (kaormi777 at kobej.zzn.com) Date: Thu, 9 Feb 2006 00:06:16 -0800 (PST) Subject: [openib-general] =?utf-8?b?woNawoPCjMKDdcKJwpzCl2zCgsOGwoLDjMKC?= =?utf-8?b?Z8KCw4XCgXnClcOxwo9WwoLDjcKNw4XCksOhwoJRwoJPwpbCnMKJfsKB?= =?utf-8?b?esKBQMKBQMKBQMKBQMKBQDIxNTAzNQ==?= Message-ID: 20030606231600.81554mail@mail.lovelove-kameriasex552158754_lookserver772_womansystem01_woman-kameria-love.tv ������ۃ��� �@���� �@� �@�Q�Q�Q�Q�Q �@�@�����@�@�@�@�@�M���l�ւ̌����I�Ȃ��U����� �@�@�@�������������������������������� �@�@�@���� �l��club���K������mail ������ �@�@�@�@�������������������������������� �@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@�@������ �@�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P�P �@������������������������������������������������ �@ http://camellia.cx/h/ �@��club�́A���킸��m�ꂽ�w�g�ŗD��ړI�x�� �@��l��club�B�����l�X�ȗ~�]������A�A���A��A �@���m��ʒj���l�Ƃ̔閧�̑�l�̂��t�������� �@�J��L�����Ă���܂��B �@��club�ɂ��o�^���ꂽ��������l�́A �@�w���Ζʂ̒j���Ƃ̑��������g�x��~���Ă���܂��B �@����Ȉ��������l���́A�M���l�Ƃ̂g�ւ̕�V�� �@�Œ�Q�O���~�̐�����V�����p�Ӓ����� �@����܂��B �@ http://camellia.cx/h/ �@����l�̂��t�������������āA���Z���ԍ������� �@�̑S�Ă��ɓ���鎖�̏o���铖club�� �@�󏭉��l�̍������Svip�ҋ����B �@���ɋM���l�ւ̐��E��������l��� �@���p�ӂ����Ē����܂��Ă̓����A���ƂȂ�܂��B �@ http://camellia.cx/h/ �@������������������������������������������������ �@���� �M���l�֐��E��������l ���� �@�m:Entry-No.011057:�n���a�c �Ďq�i��� �����j�l�@��29�� �@ http://camellia.cx/h/ �@������������������������������������������������ From mst at mellanox.co.il Thu Feb 9 00:04:51 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 9 Feb 2006 10:04:51 +0200 Subject: [openib-general] RE: [PATCH 3 of 3] mad: large RMPP support In-Reply-To: References: Message-ID: <20060209080451.GE28594@mellanox.co.il> > >The hardware fetches two SGEs anyway, since the QP was created with the > >number of send s/g entries = 2. > > It fetches both of the SGEs as part of the WR, but only requires a > single fetch for the data, versus two. Sean, I wouldnt worry about that. Hardware can (and does) pipeline these two reads, so the difference in speed is not measurable. Try testing it on top of ibverbs and see. It seems obvious to me that any reasonable implementation will work this way. -- What item should I pick to always win in rock, scissors, paper? From mst at mellanox.co.il Thu Feb 9 00:11:45 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 9 Feb 2006 10:11:45 +0200 Subject: [openib-general] FW: [PATCH 1 of 3] mad: large RMPP support In-Reply-To: References: Message-ID: <20060209081145.GF28594@mellanox.co.il> Quoting r. Sean Hefty : > I did consider this, and I don't know when we'll start hitting issues allocating > a single data buffer. But we're going to ask for 10,000 256-byte buffers - over > 2.5 MB of kernel memory in order to perform this single data transfer. Is it > likely that we can allocate that much memory, but not the 40k buffer? I really > don't know. If the answer is yes, then I agree that using a linked list would > be better. It seems to me a linked list is a better short term solution. Lets consider NodeInfo record. It seems a MAD segment would include only 4 of these. So on a modest 4K node cluster to get a list of all of them requires 1K segments. Keeping these in an array would need 8K bytes on a 64 bit server. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From yael at mellanox.co.il Thu Feb 9 04:10:47 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 09 Feb 2006 14:10:47 +0200 Subject: [openib-general] [PATCH] Opensm - client reregistration Message-ID: <5z4q38lol4.fsf@mtl066.yok.mtl.com> Hi Hal, Currently, the OpenSM sends PortInfo with ClientReRegistration bit turned on only during the first sweep after becoming Master. This doesn't cover all cases where ClientReRegistration should be turned on. OpenSM should turn on this bit also on new ports it discovers (in cases of subnet merging, for example). The following patch adds support for turning on ClientReRegistration bit on newly discovered ports. Thanks, Yael Signed-off-by: Yael Kalka Index: opensm/osm_lid_mgr.c =================================================================== --- opensm/osm_lid_mgr.c (revision 5337) +++ opensm/osm_lid_mgr.c (working copy) @@ -897,6 +897,7 @@ __osm_lid_mgr_get_port_lid( static boolean_t __osm_lid_mgr_set_physp_pi( IN osm_lid_mgr_t * const p_mgr, + IN osm_port_t* const p_port, IN osm_physp_t* const p_physp, IN ib_net16_t const lid ) { @@ -910,6 +911,7 @@ __osm_lid_mgr_set_physp_pi( uint8_t op_vls; uint8_t port_num; boolean_t send_set = FALSE; + boolean_t new_port = FALSE; OSM_LOG_ENTER( p_mgr->p_log, __osm_lid_mgr_set_physp_pi ); @@ -1140,8 +1142,17 @@ __osm_lid_mgr_set_physp_pi( /* We need to set the cli_rereg bit when we are in first_time_master_sweep for ports supporting the ClientReregistration Vol1 (v1.2) p811 14.4.11 + Also, if this port was just now discovered - then we should also set the + cli_rereg bit. We know that the port was just discovered if it is in + the p_subn->new_port_list list. */ - if ( ( p_mgr->p_subn->first_time_master_sweep == TRUE ) && + if ( cl_is_object_in_list(&p_mgr->p_subn->new_ports_list, p_port) ) + /* p_port is in new_ports_list - mark new_port as TRUE */ + new_port = TRUE; + } + + if ( ( p_mgr->p_subn->first_time_master_sweep == TRUE || + new_port == TRUE ) && ( (p_old_pi->capability_mask & IB_PORT_CAP_HAS_CLIENT_REREG) != 0 ) ) ib_port_info_set_client_rereg( p_pi, 1 ); else @@ -1243,7 +1254,7 @@ __osm_lid_mgr_process_our_sm_node( */ p_physp = osm_port_get_default_phys_ptr( p_port ); - __osm_lid_mgr_set_physp_pi( p_mgr, p_physp, cl_hton16( min_lid_ho ) ); + __osm_lid_mgr_set_physp_pi( p_mgr, p_port, p_physp, cl_hton16( min_lid_ho ) ); Exit: OSM_LOG_EXIT( p_mgr->p_log ); @@ -1367,7 +1378,7 @@ osm_lid_mgr_process_subnet( p_physp = osm_port_get_default_phys_ptr( p_port ); /* the proc returns the fact it sent a set port info */ - if (__osm_lid_mgr_set_physp_pi( p_mgr, p_physp, cl_hton16( min_lid_ho ))) + if (__osm_lid_mgr_set_physp_pi( p_mgr, p_port, p_physp, cl_hton16( min_lid_ho ))) p_mgr->send_set_reqs = TRUE; } } /* all ports */ From mst at mellanox.co.il Thu Feb 9 05:21:51 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 9 Feb 2006 15:21:51 +0200 Subject: [openib-general] test (Was Re: FW: [PATCH 1 of 3] mad: large RMPP support) In-Reply-To: References: Message-ID: <20060209132151.GB30541@mellanox.co.il> test, please ignore -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From sashak at voltaire.com Thu Feb 9 05:31:27 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 9 Feb 2006 15:31:27 +0200 Subject: [openib-general] [PATCH] management/*/autogen.sh: config dir test and creation Message-ID: <20060209133127.GA29512@sashak.voltaire.com> Hi, This will check config dir existance and create it if neccessary. As previous one but for libraries and diags. Sasha. This will check config dir existance and create it if neccessary. Signed-off-by: Sasha Khapyorsky --- diags/autogen.sh | 3 +++ libibcommon/autogen.sh | 3 +++ libibmad/autogen.sh | 3 +++ libibumad/autogen.sh | 3 +++ 4 files changed, 12 insertions(+), 0 deletions(-) 8886c81fefd19a1ae8103545e880b5680b4de7d1 diff --git a/diags/autogen.sh b/diags/autogen.sh index fd47839..4827884 100755 --- a/diags/autogen.sh +++ b/diags/autogen.sh @@ -1,5 +1,8 @@ #! /bin/sh +# create config dir if not exist +test -d config || mkdir config + set -x aclocal -I config libtoolize --force --copy diff --git a/libibcommon/autogen.sh b/libibcommon/autogen.sh index fd47839..4827884 100755 --- a/libibcommon/autogen.sh +++ b/libibcommon/autogen.sh @@ -1,5 +1,8 @@ #! /bin/sh +# create config dir if not exist +test -d config || mkdir config + set -x aclocal -I config libtoolize --force --copy diff --git a/libibmad/autogen.sh b/libibmad/autogen.sh index fd47839..4827884 100755 --- a/libibmad/autogen.sh +++ b/libibmad/autogen.sh @@ -1,5 +1,8 @@ #! /bin/sh +# create config dir if not exist +test -d config || mkdir config + set -x aclocal -I config libtoolize --force --copy diff --git a/libibumad/autogen.sh b/libibumad/autogen.sh index fd47839..4827884 100755 --- a/libibumad/autogen.sh +++ b/libibumad/autogen.sh @@ -1,5 +1,8 @@ #! /bin/sh +# create config dir if not exist +test -d config || mkdir config + set -x aclocal -I config libtoolize --force --copy -- 1.1.6 From ogerlitz at voltaire.com Thu Feb 9 05:51:08 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 9 Feb 2006 15:51:08 +0200 (IST) Subject: [openib-general] (no subject) Message-ID: Sean, Can you see if/what is there a way for a CMA consumer to set the QP timeout? Reading the .h files and the ib_cm/cma code the best i managed to find is the following setting in cm_init_qp_rts_attr qp_attr->timeout = cm_id_priv->local_ack_timeout; is cm_id_priv->local_ack_timeout related to req.remote_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT; req.local_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT; in cma_connect_ib ? thanks, Or. From mdidomenico at gmail.com Thu Feb 9 05:53:31 2006 From: mdidomenico at gmail.com (Michael Di Domenico) Date: Thu, 9 Feb 2006 08:53:31 -0500 Subject: [openib-general] Re: openib and mellanox hca problem In-Reply-To: <97a7c7ed0602081011m690e30deiba9789f5ce946201@mail.gmail.com> References: <20060208172641.GC28594@mellanox.co.il> <97a7c7ed0602081011m690e30deiba9789f5ce946201@mail.gmail.com> Message-ID: <97a7c7ed0602090553j46e142c2mc9e22c9e2b3e15c@mail.gmail.com> On 2/8/06, Michael Di Domenico wrote: > FYI... > Changed over to RHEL4 IA32 w/ SilverStorm Host Stack v3.2.0.0.21 and > now i get the below info and a working infiniband setup... > > Since I have two servers, I'm going to leave this one working and try > openib on the second machine... > > # uname -a > Linux linux14.silverstorm.com 2.6.9-5.ELsmp #1 SMP Wed Jan 5 19:30:39 > EST 2005 i686 i686 i386 GNU/Linux > > # lspci -vvv > 06:03.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev a1) > (prog-if 00 [Normal decode]) > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- > ParErr- Stepping- SERR+ FastB2B- > Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium > >TAbort- SERR- Latency: 64, Cache Line Size 10 > Bus: primary=06, secondary=07, subordinate=07, sec-latency=64 > I/O behind bridge: 0000f000-00000fff > Memory behind bridge: fe500000-fe7fffff > Prefetchable memory behind bridge: 00000000eac00000-00000000fbc00000 > Secondary status: 66Mhz+ FastB2B- ParErr- DEVSEL=medium > >TAbort- BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- >Reset- FastB2B- > Capabilities: [70] PCI-X bridge device. > Secondary Status: 64bit+, 133MHz+, SCD-, USC-, SCO-, SRD- Freq=3 > Status: Bus=6 Dev=3 Func=0 64bit+ 133MHz+ SCD- USC-, SCO-, SRD- > : Upstream: Capacity=512, Commitment Limit=512 > : Downstream: Capacity=128, Commitment Limit=128 > > 07:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1) > Subsystem: Mellanox Technologies MT23108 InfiniHost > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- > ParErr- Stepping- SERR+ FastB2B- > Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium > >TAbort- SERR- Latency: 64, Cache Line Size 10 > Interrupt: pin A routed to IRQ 217 > Region 0: Memory at fe700000 (64-bit, non-prefetchable) [size=1M] > Region 2: Memory at fb000000 (64-bit, prefetchable) [size=8M] > Region 4: Memory at f0000000 (64-bit, prefetchable) [size=128M] > Capabilities: [40] MSI-X: Enable- Mask- TabSize=32 > Vector table: BAR=0 offset=00082000 > PBA: BAR=0 offset=00082200 > Capabilities: [50] Vital Product Data > Capabilities: [60] Message Signalled Interrupts: 64bit+ > Queue=0/5 Enable- > Address: 0000000000000000 Data: 0000 > Capabilities: [70] PCI-X non-bridge device. > Command: DPERE- ERO- RBC=3 OST=1 > Status: Bus=7 Dev=0 Func=0 64bit+ 133MHz+ SCD- USC-, > DC=simple, DMMRBC=3, DMOST=1, DMCRS=0, RSCEM- My problems if you recall with the mellanox cards seem to be related to Linux Kernel 2.6.15.3... Installed RHEL4 instead of Fedora Core and get the same results using kernel 2.6.15.3 and RHEL4... Is there a list of supported kernels and Mellanox 23108 cards anywhere, thats compatible with OpenIB? 06:03.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev ff) (prog-if ff) !!! Unknown header type 7f 07:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev ff) (prog-if ff) !!! Unknown header type 7f From dotanb at mellanox.co.il Thu Feb 9 07:00:48 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Thu, 9 Feb 2006 17:00:48 +0200 Subject: [openib-general] core kernel changes for query SRQ Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3015CD0F9@mtlexch01.mtl.com> Add support to uverbs to handle querying userspace SRQs (Shared Receive Queue), including adding an ABI for marshalling requests and responses. The kernel midlayer already has ib_query_srq(). Signed-off-by: Dotan Barak Index: last_stable/drivers/infiniband/core/uverbs.h =================================================================== --- last_stable.orig/drivers/infiniband/core/uverbs.h 2006-02-07 17:04:05.000000000 +0200 +++ last_stable/drivers/infiniband/core/uverbs.h 2006-02-08 09:51:59.000000000 +0200 @@ -201,6 +201,7 @@ IB_UVERBS_DECLARE_CMD(attach_mcast); IB_UVERBS_DECLARE_CMD(detach_mcast); IB_UVERBS_DECLARE_CMD(create_srq); IB_UVERBS_DECLARE_CMD(modify_srq); +IB_UVERBS_DECLARE_CMD(query_srq); IB_UVERBS_DECLARE_CMD(destroy_srq); #endif /* UVERBS_H */ Index: last_stable/drivers/infiniband/core/uverbs_cmd.c =================================================================== --- last_stable.orig/drivers/infiniband/core/uverbs_cmd.c 2006-02-07 17:04:05.000000000 +0200 +++ last_stable/drivers/infiniband/core/uverbs_cmd.c 2006-02-08 09:56:04.000000000 +0200 @@ -2,6 +2,7 @@ * Copyright (c) 2005 Topspin Communications. All rights reserved. * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. * Copyright (c) 2005 PathScale, Inc. All rights reserved. + * Copyright (c) 2006 Mellanox Technologies. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -1823,6 +1824,49 @@ out: return ret ? ret : in_len; } +ssize_t ib_uverbs_query_srq(struct ib_uverbs_file *file, + const char __user *buf, + int in_len, int out_len) +{ + struct ib_uverbs_query_srq cmd; + struct ib_uverbs_query_srq_resp resp; + struct ib_srq_attr attr; + struct ib_srq *srq; + int ret; + + if (out_len < sizeof resp) + return -ENOSPC; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + mutex_lock(&ib_uverbs_idr_mutex); + + srq = idr_find(&ib_uverbs_srq_idr, cmd.srq_handle); + if (!srq || srq->uobject->context != file->ucontext) { + ret = -EINVAL; + goto out; + } + + ret = ib_query_srq(srq, &attr); + +out: + mutex_unlock(&ib_uverbs_idr_mutex); + if (!ret) { + memset(&resp, 0, sizeof resp); + + resp.max_wr = attr.max_wr; + resp.max_sge = attr.max_sge; + resp.srq_limit = attr.srq_limit; + + if (copy_to_user((void __user *) (unsigned long) cmd.response, + &resp, sizeof resp)) + ret = -EFAULT; + } + + return ret ? ret : in_len; +} + ssize_t ib_uverbs_destroy_srq(struct ib_uverbs_file *file, const char __user *buf, int in_len, int out_len) Index: last_stable/drivers/infiniband/core/uverbs_main.c =================================================================== --- last_stable.orig/drivers/infiniband/core/uverbs_main.c 2006-02-07 21:13:38.000000000 +0200 +++ last_stable/drivers/infiniband/core/uverbs_main.c 2006-02-08 09:56:39.000000000 +0200 @@ -107,6 +107,7 @@ static ssize_t (*uverbs_cmd_table[])(str [IB_USER_VERBS_CMD_DETACH_MCAST] = ib_uverbs_detach_mcast, [IB_USER_VERBS_CMD_CREATE_SRQ] = ib_uverbs_create_srq, [IB_USER_VERBS_CMD_MODIFY_SRQ] = ib_uverbs_modify_srq, + [IB_USER_VERBS_CMD_QUERY_SRQ] = ib_uverbs_query_srq, [IB_USER_VERBS_CMD_DESTROY_SRQ] = ib_uverbs_destroy_srq, }; Index: last_stable/drivers/infiniband/include/rdma/ib_user_verbs.h =================================================================== --- last_stable.orig/drivers/infiniband/include/rdma/ib_user_verbs.h 2006-02-07 17:04:01.000000000 +0200 +++ last_stable/drivers/infiniband/include/rdma/ib_user_verbs.h 2006-02-08 09:58:01.000000000 +0200 @@ -2,6 +2,7 @@ * Copyright (c) 2005 Topspin Communications. All rights reserved. * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. * Copyright (c) 2005 PathScale, Inc. All rights reserved. + * Copyright (c) 2006 Mellanox Technologies. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -610,6 +611,20 @@ struct ib_uverbs_modify_srq { __u64 driver_data[0]; }; +struct ib_uverbs_query_srq { + __u64 response; + __u32 srq_handle; + __u32 reserved; + __u64 driver_data[0]; +}; + +struct ib_uverbs_query_srq_resp { + __u32 max_wr; + __u32 max_sge; + __u32 srq_limit; + __u32 reserved; +}; + struct ib_uverbs_destroy_srq { __u64 response; __u32 srq_handle; Dotan Barak Software Verification Engineer Mellanox Technologies Tel: +972-4-9097200 Ext: 231 Fax: +972-4-9593245 P.O. Box 86 Yokneam 20692 ISRAEL. Home: +972-77-8841095 Cell: 052-4222383 [ May the fork be with you ] -------------- next part -------------- An HTML attachment was scrubbed... URL: From dotanb at mellanox.co.il Thu Feb 9 07:05:07 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Thu, 9 Feb 2006 17:05:07 +0200 Subject: [openib-general] libibverbs + libmthca changes for query SRQ Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3015CD0FE@mtlexch01.mtl.com> libibverbs and libmthca changes to handle quering SRQs. Essentially just adding API and support for passing the call through to provider plug-ins. Signed-off-by: Dotan Barak Index: last_stable/src/userspace/libibverbs/include/infiniband/driver.h =================================================================== --- last_stable.orig/src/userspace/libibverbs/include/infiniband/driver.h 2006-02-07 17:00:24.000000000 +0200 +++ last_stable/src/userspace/libibverbs/include/infiniband/driver.h 2006-02-08 09:38:24.000000000 +0200 @@ -107,6 +108,9 @@ int ibv_cmd_modify_srq(struct ibv_srq *s struct ibv_srq_attr *srq_attr, enum ibv_srq_attr_mask srq_attr_mask, struct ibv_modify_srq *cmd, size_t cmd_size); +int ibv_cmd_query_srq(struct ibv_srq *srq, + struct ibv_srq_attr *srq_attr, + struct ibv_query_srq *cmd, size_t cmd_size); int ibv_cmd_destroy_srq(struct ibv_srq *srq); int ibv_cmd_create_qp(struct ibv_pd *pd, Index: last_stable/src/userspace/libibverbs/include/infiniband/kern-abi.h =================================================================== --- last_stable.orig/src/userspace/libibverbs/include/infiniband/kern-abi.h 2006-02-07 17:00:24.000000000 +0200 +++ last_stable/src/userspace/libibverbs/include/infiniband/kern-abi.h 2006-02-08 09:25:29.000000000 +0200 @@ -676,6 +677,23 @@ struct ibv_modify_srq { __u64 driver_data[0]; }; +struct ibv_query_srq { + __u32 command; + __u16 in_words; + __u16 out_words; + __u64 response; + __u32 srq_handle; + __u32 reserved; + __u64 driver_data[0]; +}; + +struct ibv_query_srq_resp { + __u32 max_wr; + __u32 max_sge; + __u32 srq_limit; + __u32 reserved; +}; + struct ibv_destroy_srq { __u32 command; __u16 in_words; Index: last_stable/src/userspace/libibverbs/include/infiniband/verbs.h =================================================================== --- last_stable.orig/src/userspace/libibverbs/include/infiniband/verbs.h 2006-02-07 17:00:24.000000000 +0200 +++ last_stable/src/userspace/libibverbs/include/infiniband/verbs.h 2006-02-08 09:27:38.000000000 +0200 @@ -556,6 +557,8 @@ struct ibv_context_ops { int (*modify_srq)(struct ibv_srq *srq, struct ibv_srq_attr *srq_attr, enum ibv_srq_attr_mask srq_attr_mask); + int (*query_srq)(struct ibv_srq *srq, + struct ibv_srq_attr *srq_attr); int (*destroy_srq)(struct ibv_srq *srq); int (*post_srq_recv)(struct ibv_srq *srq, struct ibv_recv_wr *recv_wr, @@ -815,6 +818,15 @@ int ibv_modify_srq(struct ibv_srq *srq, enum ibv_srq_attr_mask srq_attr_mask); /** + * ibv_query_srq - Returns the attribute list and current values for the + * specified SRQ. + * @srq: The SRQ to query. + * @srq_attr: The attributes of the specified SRQ. + */ +int ibv_query_srq(struct ibv_srq *srq, + struct ibv_srq_attr *srq_attr); + +/** * ibv_destroy_srq - Destroys the specified SRQ. * @srq: The SRQ to destroy. */ Index: last_stable/src/userspace/libibverbs/src/cmd.c =================================================================== --- last_stable.orig/src/userspace/libibverbs/src/cmd.c 2006-02-07 17:00:24.000000000 +0200 +++ last_stable/src/userspace/libibverbs/src/cmd.c 2006-02-08 09:29:41.000000000 +0200 @@ -488,6 +489,25 @@ int ibv_cmd_modify_srq(struct ibv_srq *s return 0; } +int ibv_cmd_query_srq(struct ibv_srq *srq, + struct ibv_srq_attr *srq_attr, + struct ibv_query_srq *cmd, size_t cmd_size) +{ + struct ibv_query_srq_resp resp; + + IBV_INIT_CMD_RESP(cmd, cmd_size, QUERY_SRQ, &resp, sizeof resp); + cmd->srq_handle = srq->handle; + + if (write(srq->context->cmd_fd, cmd, cmd_size) != cmd_size) + return errno; + + srq_attr->max_wr = resp.max_wr; + srq_attr->max_sge = resp.max_sge; + srq_attr->srq_limit = resp.srq_limit; + + return 0; +} + static int ibv_cmd_destroy_srq_v1(struct ibv_srq *srq) { struct ibv_destroy_srq_v1 cmd; Index: last_stable/src/userspace/libibverbs/src/libibverbs.map =================================================================== --- last_stable.orig/src/userspace/libibverbs/src/libibverbs.map 2006-02-07 17:00:24.000000000 +0200 +++ last_stable/src/userspace/libibverbs/src/libibverbs.map 2006-02-08 09:30:18.000000000 +0200 @@ -25,6 +25,7 @@ IBVERBS_1.0 { ibv_ack_cq_events; ibv_create_srq; ibv_modify_srq; + ibv_query_srq; ibv_destroy_srq; ibv_create_qp; ibv_modify_qp; @@ -49,6 +50,7 @@ IBVERBS_1.0 { ibv_cmd_destroy_cq; ibv_cmd_create_srq; ibv_cmd_modify_srq; + ibv_cmd_query_srq; ibv_cmd_destroy_srq; ibv_cmd_create_qp; ibv_cmd_modify_qp; Index: last_stable/src/userspace/libibverbs/src/verbs.c =================================================================== --- last_stable.orig/src/userspace/libibverbs/src/verbs.c 2006-02-07 17:00:24.000000000 +0200 +++ last_stable/src/userspace/libibverbs/src/verbs.c 2006-02-08 09:31:25.000000000 +0200 @@ -280,6 +281,12 @@ int ibv_modify_srq(struct ibv_srq *srq, return srq->context->ops.modify_srq(srq, srq_attr, srq_attr_mask); } +int ibv_query_srq(struct ibv_srq *srq, + struct ibv_srq_attr *srq_attr) +{ + return srq->context->ops.query_srq(srq, srq_attr); +} + int ibv_destroy_srq(struct ibv_srq *srq) { return srq->context->ops.destroy_srq(srq); Index: last_stable/src/userspace/libmthca/src/mthca.c =================================================================== --- last_stable.orig/src/userspace/libmthca/src/mthca.c 2006-02-07 17:00:23.000000000 +0200 +++ last_stable/src/userspace/libmthca/src/mthca.c 2006-02-08 09:32:01.000000000 +0200 @@ -110,6 +111,7 @@ static struct ibv_context_ops mthca_ctx_ .destroy_cq = mthca_destroy_cq, .create_srq = mthca_create_srq, .modify_srq = mthca_modify_srq, + .query_srq = mthca_query_srq, .destroy_srq = mthca_destroy_srq, .create_qp = mthca_create_qp, .modify_qp = mthca_modify_qp, Index: last_stable/src/userspace/libmthca/src/mthca.h =================================================================== --- last_stable.orig/src/userspace/libmthca/src/mthca.h 2006-02-07 17:00:23.000000000 +0200 +++ last_stable/src/userspace/libmthca/src/mthca.h 2006-02-08 09:46:20.000000000 +0200 @@ -297,6 +298,8 @@ struct ibv_srq *mthca_create_srq(struct int mthca_modify_srq(struct ibv_srq *srq, struct ibv_srq_attr *attr, enum ibv_srq_attr_mask mask); +int mthca_query_srq(struct ibv_srq *srq, + struct ibv_srq_attr *attr); int mthca_destroy_srq(struct ibv_srq *srq); int mthca_alloc_srq_buf(struct ibv_pd *pd, struct ibv_srq_attr *attr, struct mthca_srq *srq); Index: last_stable/src/userspace/libmthca/src/verbs.c =================================================================== --- last_stable.orig/src/userspace/libmthca/src/verbs.c 2006-02-07 17:00:23.000000000 +0200 +++ last_stable/src/userspace/libmthca/src/verbs.c 2006-02-08 09:33:48.000000000 +0200 @@ -439,6 +440,14 @@ int mthca_modify_srq(struct ibv_srq *srq return ibv_cmd_modify_srq(srq, attr, attr_mask, &cmd, sizeof cmd); } +int mthca_query_srq(struct ibv_srq *srq, + struct ibv_srq_attr *attr) +{ + struct ibv_query_srq cmd; + + return ibv_cmd_query_srq(srq, attr, &cmd, sizeof cmd); +} + int mthca_destroy_srq(struct ibv_srq *srq) { int ret; Dotan Barak Software Verification Engineer Mellanox Technologies Tel: +972-4-9097200 Ext: 231 Fax: +972-4-9593245 P.O. Box 86 Yokneam 20692 ISRAEL. Home: +972-77-8841095 Cell: 052-4222383 [ May the fork be with you ] -------------- next part -------------- An HTML attachment was scrubbed... URL: From caitlinb at broadcom.com Thu Feb 9 07:13:33 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 9 Feb 2006 07:13:33 -0800 Subject: [dat-discussions] [openib-general] [RFC]DAT2.0immediatedataproposal Message-ID: <54AD0F12E08D1541B826BE97C98F99F1025E3E@NT-SJCA-0751.brcm.ad.broadcom.com> -----Original Message----- From: openib-general-bounces at openib.org on behalf of Sean Hefty Sent: Wed 2/8/2006 11:20 PM To: 'Roland Dreier'; Arlin Davis Cc: dat-discussions at yahoogroups.com; openib-general at openib.org Subject: RE: [dat-discussions] [openib-general] [RFC]DAT2.0immediatedataproposal >Hmm. Can you put a number on how much better RDMA write with >immediate is on current HCA hardware? How does using the underlying >OpenIB verbs ability to post a list of work requests compare (ie >posting an RDMA write followed by a send in one verbs call)? >Maybe "post multiple" is a better direction for DAT. A "post multiple" call as a general API makes sense, but I think that's a separate issue. Given that IB provides true immediate data with RDMA writes, a way should be available to make use of it. I don't know what the performance numbers between using a write with immediate versus a write followed by a send, but I don't think that anyone could argue that the write with immediate wouldn't perform better. To me, the question is whether write with immediate is supported as a transport specific extension, which was Arlin's original patch, or through some standard API. The attempt to make the API standard, so that iWarp could emulate it (poorly in my view), is what appears to be driving the disagreements. It also appears to me that the decisions are coming down to one of the following. If iWarp can emulate write with immediate, then a generic API should be used. If iWarp cannot properly emulate write with immediate, then the API should be transport specific. It's curious to me that in both cases, iWarp is driving the API decision and design for something that is an IB specific feature. - Sean ---------------------- But why define an IB specific feature when a transport neutral feature can be defined? Viewing the operation as Write with following Send maintains transport neutral semantics AND allows IB to encode it as a Write with Immediate. That avoids IB to use the silicon that already exists to support compressing the Write and Send into a single message. That is the real benefit, isn't it? And for both transports it enables the Provider to pass the 4 byte immediate data by value rather than by registered reference. So there is a definite benefit to IB, and a potential benefit to IP, and it works for both transports. The *only* thing gained by making it a transport specific method is the implicit 33rd bit in the "that RDMA Write payload you asked for has arrived" message. Is there a concrete example of any benefit from encoding a 33rd bit in the selection of Write with Immediate versus Write followed by 32-bit Send? From rdreier at cisco.com Thu Feb 9 07:32:31 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 09 Feb 2006 07:32:31 -0800 Subject: [openib-general] Re: openib and mellanox hca problem In-Reply-To: <97a7c7ed0602090553j46e142c2mc9e22c9e2b3e15c@mail.gmail.com> (Michael Di Domenico's message of "Thu, 9 Feb 2006 08:53:31 -0500") References: <20060208172641.GC28594@mellanox.co.il> <97a7c7ed0602081011m690e30deiba9789f5ce946201@mail.gmail.com> <97a7c7ed0602090553j46e142c2mc9e22c9e2b3e15c@mail.gmail.com> Message-ID: Michael> My problems if you recall with the mellanox cards seem to Michael> be related to Linux Kernel 2.6.15.3... Installed RHEL4 Michael> instead of Fedora Core and get the same results using Michael> kernel 2.6.15.3 and RHEL4... No, I think the problem is that something about your system is causing problems when the driver resets the HCA. Presumably the silverstorm stack does not reset the HCA. I had one thought: do you know the speed (100 MHz, 133 MHz, ...) of the PCI slot that your HCA is in? In any case you should be able to comment out the call to mthca_reset() in mthca_init_one() as a workaround for now. - R. From halr at voltaire.com Thu Feb 9 07:22:37 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Feb 2006 10:22:37 -0500 Subject: [openib-general] Re: [PATCH] Opensm - asserts before OSM_LOG_ENTER - cont. In-Reply-To: <5zr76g2829.fsf@mtl066.yok.mtl.com> References: <5zr76g2829.fsf@mtl066.yok.mtl.com> Message-ID: <1139498556.4450.627.camel@hal.voltaire.com> Hi Yael, On Mon, 2006-02-06 at 03:41, Yael Kalka wrote: > Hi Hal, > > The Patch Michael Tsirkin suggested for fixing the OSM_LOG_ENTER > problem works fine both for windows and for linux. > Here is the patch for this, instead of the previous one I sent. Thanks. Applied. -- Hal > Thanks, > Yael > > Signed-off-by: Yael Kalka > > Index: include/opensm/osm_log.h > =================================================================== > --- include/opensm/osm_log.h (revision 5307) > +++ include/opensm/osm_log.h (working copy) > @@ -71,17 +71,15 @@ BEGIN_C_DECLS > #define LOG_ENTRY_SIZE_MAX 4096 > #define BUF_SIZE LOG_ENTRY_SIZE_MAX > > -#define OSM_LOG_DEFINE_FUNC( NAME ) \ > - static const char osm_log_func_name[] = #NAME > +#define __func__ __FUNCTION__ > > #define OSM_LOG_ENTER( OSM_LOG_PTR, NAME ) \ > - OSM_LOG_DEFINE_FUNC( NAME ); \ > osm_log( OSM_LOG_PTR, OSM_LOG_FUNCS, \ > - "%s: [\n", osm_log_func_name ); > + "%s: [\n", __func__ ); > > #define OSM_LOG_EXIT( OSM_LOG_PTR ) \ > osm_log( OSM_LOG_PTR, OSM_LOG_FUNCS, \ > - "%s: ]\n", osm_log_func_name ); > + "%s: ]\n", __func__ ); > > /****h* OpenSM/Log > * NAME > From rdreier at cisco.com Thu Feb 9 07:36:18 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 09 Feb 2006 07:36:18 -0800 Subject: [openib-general] FW: [PATCH 1 of 3] mad: large RMPP support In-Reply-To: (Sean Hefty's message of "Wed, 8 Feb 2006 23:45:22 -0800") References: Message-ID: Jack> I'm concerned that an allocation of a 4K buffer may fail in Jack> a situation where lots of small allocations of around 256 Jack> bytes would succeed. Is your point that if we fail to Jack> allocate a 4K buffer, we're in deep trouble already? Note Jack> that I've only considered a 1000 host cluster. Sean> Yes - if we can't allocate a 4k buffer, it seems highly Sean> unlikely that we'd be able to allocate 1000 256-byte Sean> buffers. My rule of thumb is that we shouldn't rely on being able to allocate a contiguous buffer bigger than 4 KB, but assuming we can allocate 4 KB is fine. 4 KB is the lowest page size of any real architecture, and if the kernel is out of free pages then any allocation is likely to fail. Allocations of larger buffers may fail because of memory fragmentation, even with plenty of free memory. That is: a 4 KB buffer is fine. - R. From halr at voltaire.com Thu Feb 9 07:28:32 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Feb 2006 10:28:32 -0500 Subject: [openib-general] Re: [PATCH] Opensm - cl_event_wheel casting In-Reply-To: <5zpsm027hf.fsf@mtl066.yok.mtl.com> References: <5zpsm027hf.fsf@mtl066.yok.mtl.com> Message-ID: <1139498911.4450.642.camel@hal.voltaire.com> Hi Yael, On Mon, 2006-02-06 at 03:53, Yael Kalka wrote: > Hi Hal, > > The following patch adds the casting done in a clearer way - to avoid > compilation errors in windows. Also - added a clear message if the > timeout was trimmed (due to the casting). > > Thanks, > Yael > > Signed-off-by: Yael Kalka > > Index: complib/cl_event_wheel.c > =================================================================== > --- complib/cl_event_wheel.c (revision 5307) > +++ complib/cl_event_wheel.c (working copy) > @@ -426,8 +426,18 @@ cl_event_wheel_reg( > * cl_timer_stop(&p_event_wheel->timer); > */ > > + /* The timeout for the cl_timer_start should be given as uint32_t. > + if there is an overflow - warn about it. */ > + if ( timeout > (uint32_t)timeout ) > + { > + osm_log (p_event_wheel->p_log, OSM_LOG_INFO, > + "cl_event_wheel_reg: " > + "timeout requested is too large. Using timeout: %u \n", > + (uint32_t)timeout ); > + } > + > /* start the timer to the timeout [msec] */ > - cl_status = cl_timer_start(&p_event_wheel->timer, timeout); > + cl_status = cl_timer_start(&p_event_wheel->timer, (uint32_t)timeout); Shouldn't this use the max 32 bit timeout here rather than the low 32 bits ? -- Hal > if (cl_status != CL_SUCCESS) > { > From suri at baymicrosystems.com Thu Feb 9 07:40:53 2006 From: suri at baymicrosystems.com (Suresh Shelvapille) Date: Thu, 9 Feb 2006 10:40:53 -0500 Subject: [openib-general] port num in port priv In-Reply-To: <20060208221510.GE32082@mellanox.co.il> Message-ID: <200602091540.k19FewKX012081@mail.baymicrosystems.com> Hi: For a switch only one port_priv object is created in mad.c. As a result, it appears like the process_mad function is always called with a port number of zero. And the return path is always filled with zero as well (in smi.c). Should not this be the physical port number from which the mad packet came? Do I have to do something with the return path attributes when I send the packet out in my switch driver? I am using linux 2.6.12. Am I not reading something right? Thanks, Suri From dotanb at mellanox.co.il Thu Feb 9 07:48:51 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Thu, 9 Feb 2006 17:48:51 +0200 Subject: [openib-general] 1/2 core kernel changes for query QP Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3015CD129@mtlexch01.mtl.com> Add support to uverbs to handle querying userspace QPs (Queue Pair), including adding an ABI for marshalling requests and responses. The kernel midlayer already has ib_query_qp(). Signed-off-by: Dotan Barak Index: latest/drivers/infiniband/core/uverbs.h =================================================================== --- latest.orig/drivers/infiniband/core/uverbs.h 2006-02-09 08:03:56.000000000 +0200 +++ latest/drivers/infiniband/core/uverbs.h 2006-02-09 17:16:58.000000000 +0200 @@ -191,6 +191,7 @@ IB_UVERBS_DECLARE_CMD(req_notify_cq); IB_UVERBS_DECLARE_CMD(destroy_cq); IB_UVERBS_DECLARE_CMD(create_qp); IB_UVERBS_DECLARE_CMD(modify_qp); +IB_UVERBS_DECLARE_CMD(query_qp); IB_UVERBS_DECLARE_CMD(destroy_qp); IB_UVERBS_DECLARE_CMD(post_send); IB_UVERBS_DECLARE_CMD(post_recv); Index: latest/drivers/infiniband/core/uverbs_cmd.c =================================================================== --- latest.orig/drivers/infiniband/core/uverbs_cmd.c 2006-02-09 08:03:56.000000000 +0200 +++ latest/drivers/infiniband/core/uverbs_cmd.c 2006-02-09 17:35:51.000000000 +0200 @@ -1079,6 +1079,110 @@ out: return ret; } +ssize_t ib_uverbs_query_qp(struct ib_uverbs_file *file, + const char __user *buf, int in_len, + int out_len) +{ + struct ib_uverbs_query_qp cmd; + struct ib_uverbs_query_qp_resp resp; + struct ib_qp *qp; + struct ib_qp_attr *attr; + struct ib_qp_init_attr *init_attr; + int ret; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + attr = kmalloc(sizeof *attr, GFP_KERNEL); + if (!attr) + return -ENOMEM; + + init_attr = kmalloc(sizeof *init_attr, GFP_KERNEL); + if (!init_attr) { + kfree(attr); + return -ENOMEM; + } + + mutex_lock(&ib_uverbs_idr_mutex); + + qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle); + if (!qp || qp->uobject->context != file->ucontext) { + ret = -EINVAL; + goto out1; + } + + ret = ib_query_qp(qp, attr, cmd.attr_mask, init_attr); +out1: + mutex_unlock(&ib_uverbs_idr_mutex); + + if (ret) + goto out2; + + memset(&resp, 0, sizeof resp); + + resp.qp_state = attr->qp_state; + resp.cur_qp_state = attr->cur_qp_state; + resp.path_mtu = attr->path_mtu; + resp.path_mig_state = attr->path_mig_state; + resp.qkey = attr->qkey; + resp.rq_psn = attr->rq_psn; + resp.sq_psn = attr->sq_psn; + resp.dest_qp_num = attr->dest_qp_num; + resp.qp_access_flags = attr->qp_access_flags; + resp.pkey_index = attr->pkey_index; + resp.alt_pkey_index = attr->alt_pkey_index; + resp.en_sqd_async_notify = attr->en_sqd_async_notify; + resp.max_rd_atomic = attr->max_rd_atomic; + resp.max_dest_rd_atomic = attr->max_dest_rd_atomic; + resp.min_rnr_timer = attr->min_rnr_timer; + resp.port_num = attr->port_num; + resp.timeout = attr->timeout; + resp.retry_cnt = attr->retry_cnt; + resp.rnr_retry = attr->rnr_retry; + resp.alt_port_num = attr->alt_port_num; + resp.alt_timeout = attr->alt_timeout; + + memcpy(resp.dest.dgid, attr->ah_attr.grh.dgid.raw, 16); + resp.dest.flow_label = attr->ah_attr.grh.flow_label; + resp.dest.sgid_index = attr->ah_attr.grh.sgid_index; + resp.dest.hop_limit = attr->ah_attr.grh.hop_limit; + resp.dest.traffic_class = attr->ah_attr.grh.traffic_class; + resp.dest.dlid = attr->ah_attr.dlid; + resp.dest.sl = attr->ah_attr.sl; + resp.dest.src_path_bits = attr->ah_attr.src_path_bits; + resp.dest.static_rate = attr->ah_attr.static_rate; + resp.dest.is_global = (attr->ah_attr.ah_flags & IB_AH_GRH); + resp.dest.port_num = attr->ah_attr.port_num; + + memcpy(resp.alt_dest.dgid, attr->alt_ah_attr.grh.dgid.raw, 16); + resp.alt_dest.flow_label = attr->alt_ah_attr.grh.flow_label; + resp.alt_dest.sgid_index = attr->alt_ah_attr.grh.sgid_index; + resp.alt_dest.hop_limit = attr->alt_ah_attr.grh.hop_limit; + resp.alt_dest.traffic_class = attr->alt_ah_attr.grh.traffic_class; + resp.alt_dest.dlid = attr->alt_ah_attr.dlid; + resp.alt_dest.sl = attr->alt_ah_attr.sl; + resp.alt_dest.src_path_bits = attr->alt_ah_attr.src_path_bits; + resp.alt_dest.static_rate = attr->alt_ah_attr.static_rate; + resp.alt_dest.is_global = !!(attr->alt_ah_attr.ah_flags & IB_AH_GRH); + resp.alt_dest.port_num = attr->alt_ah_attr.port_num; + + resp.max_send_wr = init_attr->cap.max_send_wr; + resp.max_recv_wr = init_attr->cap.max_recv_wr; + resp.max_send_sge = init_attr->cap.max_send_sge; + resp.max_recv_sge = init_attr->cap.max_recv_sge; + resp.max_inline_data = init_attr->cap.max_inline_data; + resp.sq_sig_all = !!init_attr->sq_sig_type; + + if (copy_to_user((void __user *) (unsigned long) cmd.response, + &resp, sizeof resp)) + ret = -EFAULT; +out2: + kfree(attr); + kfree(init_attr); + + return ret ? ret : in_len; +} + ssize_t ib_uverbs_destroy_qp(struct ib_uverbs_file *file, const char __user *buf, int in_len, int out_len) Index: latest/drivers/infiniband/core/uverbs_main.c =================================================================== --- latest.orig/drivers/infiniband/core/uverbs_main.c 2006-02-09 17:05:56.000000000 +0200 +++ latest/drivers/infiniband/core/uverbs_main.c 2006-02-09 17:17:29.000000000 +0200 @@ -97,6 +97,7 @@ static ssize_t (*uverbs_cmd_table[])(str [IB_USER_VERBS_CMD_DESTROY_CQ] = ib_uverbs_destroy_cq, [IB_USER_VERBS_CMD_CREATE_QP] = ib_uverbs_create_qp, [IB_USER_VERBS_CMD_MODIFY_QP] = ib_uverbs_modify_qp, + [IB_USER_VERBS_CMD_QUERY_QP] = ib_uverbs_query_qp, [IB_USER_VERBS_CMD_DESTROY_QP] = ib_uverbs_destroy_qp, [IB_USER_VERBS_CMD_POST_SEND] = ib_uverbs_post_send, [IB_USER_VERBS_CMD_POST_RECV] = ib_uverbs_post_recv, Index: latest/drivers/infiniband/include/rdma/ib_user_verbs.h =================================================================== --- latest.orig/drivers/infiniband/include/rdma/ib_user_verbs.h 2006-02-09 08:03:53.000000000 +0200 +++ latest/drivers/infiniband/include/rdma/ib_user_verbs.h 2006-02-09 17:21:02.000000000 +0200 @@ -461,6 +461,47 @@ struct ib_uverbs_modify_qp { struct ib_uverbs_modify_qp_resp { }; +struct ib_uverbs_query_qp { + __u64 response; + __u32 qp_handle; + __u32 attr_mask; + __u64 driver_data[0]; +}; + +struct ib_uverbs_query_qp_resp { + struct ib_uverbs_qp_dest dest; + struct ib_uverbs_qp_dest alt_dest; + __u32 max_send_wr; + __u32 max_recv_wr; + __u32 max_send_sge; + __u32 max_recv_sge; + __u32 max_inline_data; + __u32 qkey; + __u32 rq_psn; + __u32 sq_psn; + __u32 dest_qp_num; + __u32 qp_access_flags; + __u16 pkey_index; + __u16 alt_pkey_index; + __u8 qp_state; + __u8 cur_qp_state; + __u8 path_mtu; + __u8 path_mig_state; + __u8 en_sqd_async_notify; + __u8 max_rd_atomic; + __u8 max_dest_rd_atomic; + __u8 min_rnr_timer; + __u8 port_num; + __u8 timeout; + __u8 retry_cnt; + __u8 rnr_retry; + __u8 alt_port_num; + __u8 alt_timeout; + __u8 sq_sig_all; + __u8 reserved[5]; + __u64 driver_data[0]; +}; + struct ib_uverbs_destroy_qp { __u64 response; __u32 qp_handle; Dotan Barak Software Verification Engineer Mellanox Technologies Tel: +972-4-9097200 Ext: 231 Fax: +972-4-9593245 P.O. Box 86 Yokneam 20692 ISRAEL. Home: +972-77-8841095 Cell: 052-4222383 [ May the fork be with you ] -------------- next part -------------- An HTML attachment was scrubbed... URL: From dotanb at mellanox.co.il Thu Feb 9 07:50:55 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Thu, 9 Feb 2006 17:50:55 +0200 Subject: [openib-general] 2/2 libibverbs + libmthca changes for query QP Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3015CD12E@mtlexch01.mtl.com> note: this implementation don't know how to handle the masks. libibverbs and libmthca changes to handle quering QPs. Essentially just adding API and support for passing the call through to provider plug-ins. Signed-off-by: Dotan Barak Index: openib_gen2/src/userspace/libibverbs/include/infiniband/driver.h =================================================================== --- openib_gen2.orig/src/userspace/libibverbs/include/infiniband/driver.h 2006-02-07 17:00:24.000000000 +0200 +++ openib_gen2/src/userspace/libibverbs/include/infiniband/driver.h 2006-02-08 15:54:22.000000000 +0200 @@ -115,6 +115,9 @@ int ibv_cmd_create_qp(struct ibv_pd *pd, int ibv_cmd_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, enum ibv_qp_attr_mask attr_mask, struct ibv_modify_qp *cmd, size_t cmd_size); +int ibv_cmd_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *qp_attr, + int qp_attr_mask, struct ibv_qp_init_attr *qp_init_attr, + struct ibv_query_qp *cmd, size_t cmd_size); int ibv_cmd_destroy_qp(struct ibv_qp *qp); int ibv_cmd_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr, struct ibv_send_wr **bad_wr); Index: openib_gen2/src/userspace/libibverbs/include/infiniband/kern-abi.h =================================================================== --- openib_gen2.orig/src/userspace/libibverbs/include/infiniband/kern-abi.h 2006-02-07 17:00:24.000000000 +0200 +++ openib_gen2/src/userspace/libibverbs/include/infiniband/kern-abi.h 2006-02-08 15:54:22.000000000 +0200 @@ -509,6 +509,50 @@ struct ibv_modify_qp { __u64 driver_data[0]; }; +struct ibv_query_qp { + __u32 command; + __u16 in_words; + __u16 out_words; + __u64 response; + __u32 qp_handle; + __u32 attr_mask; + __u64 driver_data[0]; +}; + +struct ibv_query_qp_resp { + struct ibv_qp_dest dest; + struct ibv_qp_dest alt_dest; + __u32 max_send_wr; + __u32 max_recv_wr; + __u32 max_send_sge; + __u32 max_recv_sge; + __u32 max_inline_data; + __u32 qkey; + __u32 rq_psn; + __u32 sq_psn; + __u32 dest_qp_num; + __u32 qp_access_flags; + __u16 pkey_index; + __u16 alt_pkey_index; + __u8 qp_state; + __u8 cur_qp_state; + __u8 path_mtu; + __u8 path_mig_state; + __u8 en_sqd_async_notify; + __u8 max_rd_atomic; + __u8 max_dest_rd_atomic; + __u8 min_rnr_timer; + __u8 port_num; + __u8 timeout; + __u8 retry_cnt; + __u8 rnr_retry; + __u8 alt_port_num; + __u8 alt_timeout; + __u8 sq_sig_all; + __u8 reserved[5]; + __u64 driver_data[0]; +}; + struct ibv_destroy_qp { __u32 command; __u16 in_words; Index: openib_gen2/src/userspace/libibverbs/include/infiniband/verbs.h =================================================================== --- openib_gen2.orig/src/userspace/libibverbs/include/infiniband/verbs.h 2006-02-07 17:00:24.000000000 +0200 +++ openib_gen2/src/userspace/libibverbs/include/infiniband/verbs.h 2006-02-08 15:54:22.000000000 +0200 @@ -563,6 +563,10 @@ struct ibv_context_ops { struct ibv_qp * (*create_qp)(struct ibv_pd *pd, struct ibv_qp_init_attr *attr); int (*modify_qp)(struct ibv_qp *qp, struct ibv_qp_attr *attr, enum ibv_qp_attr_mask attr_mask); + int (*query_qp)(struct ibv_qp *qp, + struct ibv_qp_attr *qp_attr, + int qp_attr_mask, + struct ibv_qp_init_attr *qp_init_attr); int (*destroy_qp)(struct ibv_qp *qp); int (*post_send)(struct ibv_qp *qp, struct ibv_send_wr *wr, struct ibv_send_wr **bad_wr); @@ -847,6 +851,20 @@ int ibv_modify_qp(struct ibv_qp *qp, str enum ibv_qp_attr_mask attr_mask); /** + * ibv_query_qp - Returns the attribute list and current values for the + * specified QP. + * @qp: The QP to query. + * @qp_attr: The attributes of the specified QP. + * @qp_attr_mask: A bit-mask used to select specific attributes to query. + * @qp_init_attr: Additional attributes of the selected QP. + * + * The qp_attr_mask may be used to limit the query to gathering only the + * selected attributes. + */ +int ibv_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *qp_attr, + int qp_attr_mask, struct ibv_qp_init_attr *qp_init_attr); + +/** * ibv_destroy_qp - Destroy a queue pair. */ int ibv_destroy_qp(struct ibv_qp *qp); Index: openib_gen2/src/userspace/libibverbs/src/cmd.c =================================================================== --- openib_gen2.orig/src/userspace/libibverbs/src/cmd.c 2006-02-07 17:00:24.000000000 +0200 +++ openib_gen2/src/userspace/libibverbs/src/cmd.c 2006-02-08 15:58:21.000000000 +0200 @@ -629,6 +629,86 @@ int ibv_cmd_modify_qp(struct ibv_qp *qp, return 0; } +int ibv_cmd_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *qp_attr, + int qp_attr_mask, + struct ibv_qp_init_attr *qp_init_attr, + struct ibv_query_qp *cmd, size_t cmd_size) +{ + struct ibv_query_qp_resp resp; + + IBV_INIT_CMD_RESP(cmd, cmd_size, QUERY_QP, &resp, sizeof resp); + cmd->qp_handle = qp->handle; + cmd->attr_mask = qp_attr_mask; + + if (write(qp->context->cmd_fd, cmd, cmd_size) != cmd_size) + return errno; + + qp_attr->qkey = resp.qkey; + qp_attr->rq_psn = resp.rq_psn; + qp_attr->sq_psn = resp.sq_psn; + qp_attr->dest_qp_num = resp.dest_qp_num; + qp_attr->qp_access_flags = resp.qp_access_flags; + qp_attr->pkey_index = resp.pkey_index; + qp_attr->alt_pkey_index = resp.alt_pkey_index; + qp_attr->qp_state = resp.qp_state; + qp_attr->cur_qp_state = resp.cur_qp_state; + qp_attr->path_mtu = resp.path_mtu; + qp_attr->path_mig_state = resp.path_mig_state; + qp_attr->en_sqd_async_notify = resp.en_sqd_async_notify; + qp_attr->max_rd_atomic = resp.max_rd_atomic; + qp_attr->max_dest_rd_atomic = resp.max_dest_rd_atomic; + qp_attr->min_rnr_timer = resp.min_rnr_timer; + qp_attr->port_num = resp.port_num; + qp_attr->timeout = resp.timeout; + qp_attr->retry_cnt = resp.retry_cnt; + qp_attr->rnr_retry = resp.rnr_retry; + qp_attr->alt_port_num = resp.alt_port_num; + qp_attr->alt_timeout = resp.alt_timeout; + qp_attr->cap.max_send_wr = resp.max_send_wr; + qp_attr->cap.max_recv_wr = resp.max_recv_wr; + qp_attr->cap.max_send_sge = resp.max_send_sge; + qp_attr->cap.max_recv_sge = resp.max_recv_sge; + qp_attr->cap.max_inline_data = resp.max_inline_data; + + memcpy(qp_attr->ah_attr.grh.dgid.raw, resp.dest.dgid, 16); + qp_attr->ah_attr.grh.flow_label = resp.dest.flow_label; + qp_attr->ah_attr.dlid = resp.dest.dlid; + qp_attr->ah_attr.grh.sgid_index = resp.dest.sgid_index; + qp_attr->ah_attr.grh.hop_limit = resp.dest.hop_limit; + qp_attr->ah_attr.grh.traffic_class = resp.dest.traffic_class; + qp_attr->ah_attr.sl = resp.dest.sl; + qp_attr->ah_attr.src_path_bits = resp.dest.src_path_bits; + qp_attr->ah_attr.static_rate = resp.dest.static_rate; + qp_attr->ah_attr.is_global = resp.dest.is_global; + qp_attr->ah_attr.port_num = resp.dest.port_num; + + memcpy(qp_attr->alt_ah_attr.grh.dgid.raw, resp.alt_dest.dgid, 16); + qp_attr->alt_ah_attr.grh.flow_label = resp.alt_dest.flow_label; + qp_attr->alt_ah_attr.dlid = resp.alt_dest.dlid; + qp_attr->alt_ah_attr.grh.sgid_index = resp.alt_dest.sgid_index; + qp_attr->alt_ah_attr.grh.hop_limit = resp.alt_dest.hop_limit; + qp_attr->alt_ah_attr.grh.traffic_class = resp.alt_dest.traffic_class; + qp_attr->alt_ah_attr.sl = resp.alt_dest.sl; + qp_attr->alt_ah_attr.src_path_bits = resp.alt_dest.src_path_bits; + qp_attr->alt_ah_attr.static_rate = resp.alt_dest.static_rate; + qp_attr->alt_ah_attr.is_global = resp.alt_dest.is_global; + qp_attr->alt_ah_attr.port_num = resp.alt_dest.port_num; + + qp_init_attr->qp_context = qp->qp_context; + qp_init_attr->send_cq = qp->send_cq; + qp_init_attr->recv_cq = qp->recv_cq; + qp_init_attr->srq = qp->srq; + qp_init_attr->qp_type = qp->qp_type; + qp_init_attr->cap.max_send_wr = resp.max_send_wr; + qp_init_attr->cap.max_recv_wr = resp.max_recv_wr; + qp_init_attr->cap.max_send_sge = resp.max_send_sge; + qp_init_attr->cap.max_recv_sge = resp.max_recv_sge; + qp_init_attr->cap.max_inline_data = resp.max_inline_data; + qp_init_attr->sq_sig_all = resp.sq_sig_all; + + return 0; +} + static int ibv_cmd_destroy_qp_v1(struct ibv_qp *qp) { struct ibv_destroy_qp_v1 cmd; Index: openib_gen2/src/userspace/libibverbs/src/libibverbs.map =================================================================== --- openib_gen2.orig/src/userspace/libibverbs/src/libibverbs.map 2006-02-07 17:00:24.000000000 +0200 +++ openib_gen2/src/userspace/libibverbs/src/libibverbs.map 2006-02-08 15:54:22.000000000 +0200 @@ -28,6 +28,7 @@ IBVERBS_1.0 { ibv_destroy_srq; ibv_create_qp; ibv_modify_qp; + ibv_query_qp; ibv_destroy_qp; ibv_create_ah; ibv_destroy_ah; @@ -52,6 +53,7 @@ IBVERBS_1.0 { ibv_cmd_destroy_srq; ibv_cmd_create_qp; ibv_cmd_modify_qp; + ibv_cmd_query_qp; ibv_cmd_destroy_qp; ibv_cmd_post_send; ibv_cmd_post_recv; Index: openib_gen2/src/userspace/libibverbs/src/verbs.c =================================================================== --- openib_gen2.orig/src/userspace/libibverbs/src/verbs.c 2006-02-07 17:00:24.000000000 +0200 +++ openib_gen2/src/userspace/libibverbs/src/verbs.c 2006-02-08 17:55:12.000000000 +0200 @@ -321,6 +321,21 @@ int ibv_modify_qp(struct ibv_qp *qp, str return 0; } +int ibv_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *qp_attr, + int qp_attr_mask, struct ibv_qp_init_attr *qp_init_attr) +{ + int ret; + + ret = qp->context->ops.query_qp(qp, qp_attr, qp_attr_mask, qp_init_attr); + if (ret) + return ret; + + if (qp_attr_mask & IBV_QP_STATE) + qp->state = qp_attr->qp_state; + + return 0; +} + int ibv_destroy_qp(struct ibv_qp *qp) { return qp->context->ops.destroy_qp(qp); Index: openib_gen2/src/userspace/libmthca/src/mthca.c =================================================================== --- openib_gen2.orig/src/userspace/libmthca/src/mthca.c 2006-02-07 17:00:23.000000000 +0200 +++ openib_gen2/src/userspace/libmthca/src/mthca.c 2006-02-08 15:54:22.000000000 +0200 @@ -113,6 +113,7 @@ static struct ibv_context_ops mthca_ctx_ .destroy_srq = mthca_destroy_srq, .create_qp = mthca_create_qp, .modify_qp = mthca_modify_qp, + .query_qp = mthca_query_qp, .destroy_qp = mthca_destroy_qp, .create_ah = mthca_create_ah, .destroy_ah = mthca_destroy_ah, Index: openib_gen2/src/userspace/libmthca/src/mthca.h =================================================================== --- openib_gen2.orig/src/userspace/libmthca/src/mthca.h 2006-02-07 17:00:23.000000000 +0200 +++ openib_gen2/src/userspace/libmthca/src/mthca.h 2006-02-08 15:54:22.000000000 +0200 @@ -311,6 +311,8 @@ int mthca_arbel_post_srq_recv(struct ibv struct ibv_qp *mthca_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr); int mthca_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, enum ibv_qp_attr_mask attr_mask); +int mthca_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, + int qp_attr_mask, struct ibv_qp_init_attr *qp_init_attr); int mthca_destroy_qp(struct ibv_qp *qp); void mthca_init_qp_indices(struct mthca_qp *qp); int mthca_tavor_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr, Index: openib_gen2/src/userspace/libmthca/src/verbs.c =================================================================== --- openib_gen2.orig/src/userspace/libmthca/src/verbs.c 2006-02-07 17:00:23.000000000 +0200 +++ openib_gen2/src/userspace/libmthca/src/verbs.c 2006-02-08 15:54:22.000000000 +0200 @@ -591,6 +591,14 @@ int mthca_modify_qp(struct ibv_qp *qp, s return ret; } +int mthca_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, + int qp_attr_mask, struct ibv_qp_init_attr *qp_init_attr) +{ + struct ibv_query_qp cmd; + + return ibv_cmd_query_qp(qp, attr, qp_attr_mask, qp_init_attr, &cmd, sizeof cmd); +} + int mthca_destroy_qp(struct ibv_qp *qp) { int ret; Dotan Barak Software Verification Engineer Mellanox Technologies Tel: +972-4-9097200 Ext: 231 Fax: +972-4-9593245 P.O. Box 86 Yokneam 20692 ISRAEL. Home: +972-77-8841095 Cell: 052-4222383 [ May the fork be with you ] -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Thu Feb 9 07:55:33 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 09 Feb 2006 07:55:33 -0800 Subject: [openib-general] port num in port priv In-Reply-To: <200602091540.k19FewKX012081@mail.baymicrosystems.com> (Suresh Shelvapille's message of "Thu, 9 Feb 2006 10:40:53 -0500") References: <200602091540.k19FewKX012081@mail.baymicrosystems.com> Message-ID: Suresh> For a switch only one port_priv object is created in Suresh> mad.c. As a result, it appears like the process_mad Suresh> function is always called with a port number of zero. And Suresh> the return path is always filled with zero as well (in Suresh> smi.c). Should not this be the physical port number from Suresh> which the mad packet came? You are breaking new ground by running the Linux IB stack on top of a switch device. The current code probably needs to be extended to give the upper layers the port number where a MAD was received, and also to make it possible to specify the port number that a directed route MAD will be sent on. Also I'm not sure whether the directed route handling will work for MADs whose final destination is not the local switch and which therefore need to be forwarded instead of processed. So there are quite a few things in the core that you will need to add for your device. Suresh> Do I have to do something with the return path attributes Suresh> when I send the packet out in my switch driver? I am using Suresh> linux 2.6.12. I would suggest using a newer kernel. Otherwise you will be struggling to merge your change back and forth. - R. From halr at voltaire.com Thu Feb 9 07:47:33 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Feb 2006 10:47:33 -0500 Subject: [openib-general] Re: [PATCH] Opensm - osm_db_file.c - windows fixes In-Reply-To: <5zoe1k271t.fsf@mtl066.yok.mtl.com> References: <5zoe1k271t.fsf@mtl066.yok.mtl.com> Message-ID: <1139500052.4450.686.camel@hal.voltaire.com> On Mon, 2006-02-06 at 04:03, Yael Kalka wrote: > Hi Hal, > > The following patch adds some changes in osm_db_file.c to match the > windows stack. Thanks. Applied with some commentary difference and error number change. -- Hal > Thanks, > Yael > > Signed-off-by: Yael Kalka > > Index: opensm/osm_db_files.c > =================================================================== > --- opensm/osm_db_files.c (revision 5307) > +++ opensm/osm_db_files.c (working copy) > @@ -172,6 +172,12 @@ osm_db_init( > if ( p_db_imp->db_dir_name == NULL ) > p_db_imp->db_dir_name = OSM_DEFAULT_CACHE_DIR; > > + /* create the directory if it doesn't exist */ > + /* There is difference between creating in windows and in linux */ > +#ifdef __WIN__ > + /* Check if the directory exists. If not - create it. */ > + CreateDirectory(p_db_imp->db_dir_name, NULL); > +#else /* __WIN__ */ > /* make sure the directory exists */ > if (lstat(p_db_imp->db_dir_name, &dstat)) > { > @@ -185,6 +191,7 @@ osm_db_init( > return 1; > } > } > +#endif > > p_db->p_log = p_log; > p_db->p_db_imp = (void*)p_db_imp; > @@ -466,6 +473,14 @@ osm_db_store( > fclose(p_file); > > /* move the domain file */ > + status = remove(p_domain_imp->file_name); > + if (status) > + { > + osm_log( p_log, OSM_LOG_ERROR, > + "osm_db_store: ERR 6909: " > + " Fail to remove file:%s (err:%u)\n", > + p_domain_imp->file_name, status); > + } > status = rename(p_tmp_file_name, p_domain_imp->file_name); > if (status) > { > From grave at ipno.in2p3.fr Thu Feb 9 08:31:10 2006 From: grave at ipno.in2p3.fr (Xavier Grave) Date: Thu, 09 Feb 2006 17:31:10 +0100 Subject: [openib-general] libsdp running nearly fine Message-ID: <1139502670.14833.14.camel@ipnnarval> Hi all, I have setup libsdp and it works quite well except if I try to send buffer with a size > 5100 bytes I get this kind of kernel messages : Unable to handle kernel paging request for data at address 0xd000080080085cc0 Faulting instruction address: 0xd0000000001dd3b4 Oops: Kernel access of bad area, sig: 7 [#1] SMP NR_CPUS=32 NUMA PSERIES LPAR Modules linked in: ipv6 nfsd exportfs nfs_acl lockd sunrpc ib_uverbs psmouse idv NIP: D0000000001DD3B4 LR: C00000000022C808 CTR: D0000000001DD25C REGS: c0000000b27d7330 TRAP: 0300 Not tainted (2.6.16-rc2) MSR: 8000000000009032 CR: 24000488 XER: 00000010 DAR: D000080080085CC0, DSISR: 0000000042000000 TASK = c000000004a12040[2760] 'client' THREAD: c0000000b27d4000 CPU: 3 GPR00: 0000000000020000 C0000000B27D75B0 D0000000002014A8 C0000000042A0B00 GPR04: C0000000049B3BA0 0000000000000002 0000000000481000 C0000000004CE588 GPR08: 0000000000020033 0000000000020033 D000080080085CC0 00000000000000F0 GPR12: 0000200000000000 C0000000003BC100 00000000100D0000 0000000000000000 GPR16: 0000000000000000 0000000010197EA8 0000000000000001 C0000000B27D7C98 GPR20: 0000000000000000 C0000000B27D7B08 C0000000B1AD1E60 C0000000049B3BA0 GPR24: 0000000000000002 C00000000474FC80 C000000007323A20 8000000000009032 GPR28: C00000000474FC98 C000000007323A00 C000000000407420 C000000007323A10 NIP [D0000000001DD3B4] .mthca_tavor_map_phys_fmr+0x158/0x190 [ib_mthca] LR [C00000000022C808] .ib_fmr_pool_map_phys +0x2a4/0x4a8 Call Trace: [C0000000B27D75B0] [C00000000022C5C4] .ib_fmr_pool_map_phys+0x60/0x4a8 (unrelia) [C0000000B27D7670] [C00000000024E7B0] .sdp_iocb_register +0x5c/0x11c [C0000000B27D7700] [C000000000253A8C] .sdp_send_data_queue_test +0x624/0xd7c [C0000000B27D7820] [C000000000254220] .sdp_send_data_queue +0x3c/0xb0 [C0000000B27D78C0] [C000000000255078] .sdp_inet_send +0x5d8/0xc9c [C0000000B27D7A10] [C00000000025BEEC] .sock_sendmsg +0x114/0x15c [C0000000B27D7C10] [C00000000025CACC] .sys_sendto +0xd0/0x110 [C0000000B27D7D90] [C00000000027BE9C] .compat_sys_socketcall +0x148/0x214 [C0000000B27D7E30] [C0000000000086F8] syscall_exit +0x0/0x40 Instruction dump: 81230020 e9430048 396000f0 90030040 60000000 60000000 60000000 e8080636 7d290214 79280020 91030024 91030020 <996a0000> 7c0004ac 2f850000 78a90020 kernel is compiled with infiniband svn drivers, power5 based server. I allocate the memory with memalign and getpagesize, I compiled the ib_sdp with zero copy buffer on. Do I need to remove these options ? xavier From ralphc at pathscale.com Thu Feb 9 09:09:19 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Thu, 09 Feb 2006 09:09:19 -0800 Subject: [openib-general] libsdp running nearly fine In-Reply-To: <1139502670.14833.14.camel@ipnnarval> References: <1139502670.14833.14.camel@ipnnarval> Message-ID: <1139504959.673.31.camel@brick.internal.keyresearch.com> My guess is there is a bug in the zero-copy code. Try "echo 1000000 > /sys/module/ib_sdp/parameters/sdp_zcopy_thrsh_src_default" and see if the problem still exists. This raises the zero-copy threshold. On Thu, 2006-02-09 at 17:31 +0100, Xavier Grave wrote: > Hi all, > > I have setup libsdp and it works quite well except if I try to send > buffer with a size > 5100 bytes I get this kind of kernel messages : > Unable to handle kernel paging request for data at address > 0xd000080080085cc0 > Faulting instruction address: > 0xd0000000001dd3b4 > Oops: Kernel access of bad area, sig: 7 > [#1] > SMP NR_CPUS=32 NUMA PSERIES > LPAR > Modules linked in: ipv6 nfsd exportfs nfs_acl lockd sunrpc ib_uverbs > psmouse idv > NIP: D0000000001DD3B4 LR: C00000000022C808 CTR: > D0000000001DD25C > REGS: c0000000b27d7330 TRAP: 0300 Not tainted > (2.6.16-rc2) > MSR: 8000000000009032 CR: 24000488 XER: > 00000010 > DAR: D000080080085CC0, DSISR: > 0000000042000000 > TASK = c000000004a12040[2760] 'client' THREAD: c0000000b27d4000 CPU: > 3 > GPR00: 0000000000020000 C0000000B27D75B0 D0000000002014A8 > C0000000042A0B00 > GPR04: C0000000049B3BA0 0000000000000002 0000000000481000 > C0000000004CE588 > GPR08: 0000000000020033 0000000000020033 D000080080085CC0 > 00000000000000F0 > GPR12: 0000200000000000 C0000000003BC100 00000000100D0000 > 0000000000000000 > GPR16: 0000000000000000 0000000010197EA8 0000000000000001 > C0000000B27D7C98 > GPR20: 0000000000000000 C0000000B27D7B08 C0000000B1AD1E60 > C0000000049B3BA0 > GPR24: 0000000000000002 C00000000474FC80 C000000007323A20 > 8000000000009032 > GPR28: C00000000474FC98 C000000007323A00 C000000000407420 > C000000007323A10 > NIP [D0000000001DD3B4] .mthca_tavor_map_phys_fmr+0x158/0x190 > [ib_mthca] > LR [C00000000022C808] .ib_fmr_pool_map_phys > +0x2a4/0x4a8 > Call > Trace: > [C0000000B27D75B0] [C00000000022C5C4] .ib_fmr_pool_map_phys+0x60/0x4a8 > (unrelia) > [C0000000B27D7670] [C00000000024E7B0] .sdp_iocb_register > +0x5c/0x11c > [C0000000B27D7700] [C000000000253A8C] .sdp_send_data_queue_test > +0x624/0xd7c > [C0000000B27D7820] [C000000000254220] .sdp_send_data_queue > +0x3c/0xb0 > [C0000000B27D78C0] [C000000000255078] .sdp_inet_send > +0x5d8/0xc9c > [C0000000B27D7A10] [C00000000025BEEC] .sock_sendmsg > +0x114/0x15c > [C0000000B27D7C10] [C00000000025CACC] .sys_sendto > +0xd0/0x110 > [C0000000B27D7D90] [C00000000027BE9C] .compat_sys_socketcall > +0x148/0x214 > [C0000000B27D7E30] [C0000000000086F8] syscall_exit > +0x0/0x40 > Instruction > dump: > 81230020 e9430048 396000f0 90030040 60000000 60000000 60000000 > e8080636 > 7d290214 79280020 91030024 91030020 <996a0000> 7c0004ac 2f850000 > 78a90020 > > kernel is compiled with infiniband svn drivers, power5 based server. > I allocate the memory with memalign and getpagesize, I compiled the > ib_sdp with zero copy buffer on. > Do I need to remove these options ? > > xavier > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Ralph Campbell From rdreier at cisco.com Thu Feb 9 09:10:49 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 09 Feb 2006 09:10:49 -0800 Subject: [openib-general] libsdp running nearly fine In-Reply-To: <1139502670.14833.14.camel@ipnnarval> (Xavier Grave's message of "Thu, 09 Feb 2006 17:31:10 +0100") References: <1139502670.14833.14.camel@ipnnarval> Message-ID: Can you send me the output of "objdump -d" on ib_mthca.ko? I'd like to try to figure out where in mthca_tavor_map_phys_fmr() the oops is happening, and so I'd like to see the compiled code. There's probably no need to spam the list with a huge assembly language listing. Thanks, Roland From halr at voltaire.com Thu Feb 9 09:01:36 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Feb 2006 12:01:36 -0500 Subject: [openib-general] Re: [PATCH] Opensm - fix casting for windows In-Reply-To: <5zmzh425wg.fsf@mtl066.yok.mtl.com> References: <5zmzh425wg.fsf@mtl066.yok.mtl.com> Message-ID: <1139504496.4450.904.camel@hal.voltaire.com> On Mon, 2006-02-06 at 04:27, Yael Kalka wrote: > Hi Hal, > > The following patch adds some missing casts and fixes object types to > fix compilation errors in the windows stack, aadds some changes in > osm_db_file.c to match the windows stack. Thanks. Applied. -- Hal From rdreier at cisco.com Thu Feb 9 09:29:26 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 09 Feb 2006 09:29:26 -0800 Subject: [openib-general] Re: 2/2 libibverbs + libmthca changes for query QP In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3015CD12E@mtlexch01.mtl.com> (Dotan Barak's message of "Thu, 9 Feb 2006 17:50:55 +0200") References: <6AB138A2AB8C8E4A98B9C0C3D52670E3015CD12E@mtlexch01.mtl.com> Message-ID: This reminds me of something I've been meaning to ask: what is the attr_mask of the query QP function used for? The verbs definition in the IB spec does not include an attribute mask. Obviously we don't have to follow the spec precisely but I'm wondering what the advantage is in having an attribute mask for query QP. None of the other query methods (including query SRQ) have an attribute mask. - R. From halr at voltaire.com Thu Feb 9 09:21:55 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Feb 2006 12:21:55 -0500 Subject: [openib-general] port num in port priv In-Reply-To: <200602091540.k19FewKX012081@mail.baymicrosystems.com> References: <200602091540.k19FewKX012081@mail.baymicrosystems.com> Message-ID: <1139505714.4450.966.camel@hal.voltaire.com> Hi Suresh, On Thu, 2006-02-09 at 10:40, Suresh Shelvapille wrote: > Hi: > > For a switch Note that this is some new ground for this code. > only one port_priv object is created in mad.c. A switch only has one port through which all MADs are processed (switch port 0 (either base or enhanced)) so I think this is correct. > As a result, it appears like the process_mad function is always called with a port number of > zero. That should be OK. The switch external port it was received on should be in the ib_wc as follows: struct ib_wc { u64 wr_id; enum ib_wc_status status; enum ib_wc_opcode opcode; u32 vendor_err; u32 byte_len; __be32 imm_data; u32 qp_num; u32 src_qp; int wc_flags; u16 pkey_index; u16 slid; u8 sl; u8 dlid_path_bits; u8 port_num; /* valid only for DR SMPs on switches */ }; and that is the one that needs to be used in the DR return path. > And the return path is always filled with zero as well (in smi.c). For switches, the one from the WC needs to be filled in and passed so that sounds wrong and needs fixing. Do you want to take a crack at this or should I ? > Should > not this be the physical port number from which the mad packet came? Yes. > Do I have to do something with the return path attributes when I send the > packet out in my switch driver? I am using linux 2.6.12. There haven't been any changes here since then that would affect this. -- Hal > Am I not reading something right? > > Thanks, > Suri > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Thu Feb 9 09:26:49 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Feb 2006 12:26:49 -0500 Subject: [openib-general] port num in port priv In-Reply-To: References: <200602091540.k19FewKX012081@mail.baymicrosystems.com> Message-ID: <1139505857.4450.975.camel@hal.voltaire.com> On Thu, 2006-02-09 at 10:55, Roland Dreier wrote: > Suresh> For a switch only one port_priv object is created in > Suresh> mad.c. As a result, it appears like the process_mad > Suresh> function is always called with a port number of zero. And > Suresh> the return path is always filled with zero as well (in > Suresh> smi.c). Should not this be the physical port number from > Suresh> which the mad packet came? > > You are breaking new ground by running the Linux IB stack on top of a > switch device. Yes, this is new ground as far as I know. > The current code probably needs to be extended to give > the upper layers the port number where a MAD was received, This is supported at the WC level. I didn't trace it all the way back to see what changes if any changes are needed at the MAD layer to pass this port number to SMI, but I think that at least is the first layer of this onion. > and also to > make it possible to specify the port number that a directed route MAD > will be sent on. > Also I'm not sure whether the directed route handling will work for > MADs whose final destination is not the local switch and which > therefore need to be forwarded instead of processed. This clearly has not been tested. -- Hal > So there are quite a few things in the core that you will need to add > for your device. > > Suresh> Do I have to do something with the return path attributes > Suresh> when I send the packet out in my switch driver? I am using > Suresh> linux 2.6.12. > > I would suggest using a newer kernel. Otherwise you will be > struggling to merge your change back and forth. > > - R. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mshefty at ichips.intel.com Thu Feb 9 09:38:48 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 09 Feb 2006 09:38:48 -0800 Subject: [openib-general] FW: [PATCH 1 of 3] mad: large RMPP support In-Reply-To: References: Message-ID: <43EB7E28.3050402@ichips.intel.com> Roland Dreier wrote: > My rule of thumb is that we shouldn't rely on being able to allocate a > contiguous buffer bigger than 4 KB, but assuming we can allocate 4 KB > is fine. 4 KB is the lowest page size of any real architecture, and > if the kernel is out of free pages then any allocation is likely to > fail. Allocations of larger buffers may fail because of memory > fragmentation, even with plenty of free memory. > > That is: a 4 KB buffer is fine. Given this, I think that we'll need to go with the linked list then. Maybe something like: struct ib_mad_segment { struct list_head list; u8 data[0]; }; struct ib_mad_send_buf { ... void *mad; /* first segment */ struct list_head rmpp_list; u32 segment_size; ... }; I'm undecided about whether all MADs should use the rmpp_list, with *mad referencing the data of the first segment. This keeps the code consistent, but would result in the first segment being larger (256-bytes) than additional segments (say 220-bytes). Users could then walk the list of buffers without calling a routine that needs to start at the beginning of the list every time. - Sean From mshefty at ichips.intel.com Thu Feb 9 09:58:29 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 09 Feb 2006 09:58:29 -0800 Subject: [openib-general] QP timeout In-Reply-To: References: Message-ID: <43EB82C5.4080607@ichips.intel.com> Or Gerlitz wrote: > Can you see if/what is there a way for a CMA consumer to set the QP > timeout? Reading the .h files and the ib_cm/cma code the best i managed > to find is the following setting in cm_init_qp_rts_attr > > qp_attr->timeout = cm_id_priv->local_ack_timeout; Local ACK timeout is set based on the packet life time set in the path record. Nothing prevents a CMA consumer from changing this value after route resolution completes, but doing so requires knowing the the resolved route is actually a path record. The path record is exposed to the consumer through the API. > is cm_id_priv->local_ack_timeout related to > > req.remote_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT; > req.local_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT; These values are IB CM specific values, and are only used to determine how long to wait between retries of IB CM messages. - Sean From mshefty at ichips.intel.com Thu Feb 9 10:11:17 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 09 Feb 2006 10:11:17 -0800 Subject: [openib-general] Re: 2/2 libibverbs + libmthca changes for query QP In-Reply-To: References: <6AB138A2AB8C8E4A98B9C0C3D52670E3015CD12E@mtlexch01.mtl.com> Message-ID: <43EB85C5.5030803@ichips.intel.com> Roland Dreier wrote: > This reminds me of something I've been meaning to ask: what is the > attr_mask of the query QP function used for? The verbs definition in > the IB spec does not include an attribute mask. Obviously we don't > have to follow the spec precisely but I'm wondering what the advantage > is in having an attribute mask for query QP. None of the other query > methods (including query SRQ) have an attribute mask. My understanding of the original code that this came from is that it permitted querying attributes that may be stored in memory, rather than requiring reading the card. E.g. if all you needed was the QPN, then you could obtain only that information. Given that things like the QPN, QP type, etc. are already available through the API, I'm not sure that the mask is needed. - Sean From halr at voltaire.com Thu Feb 9 10:04:15 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Feb 2006 13:04:15 -0500 Subject: [openib-general] Re: [PATCH] Opensm - osm_sa_path_record.c - variable declaration In-Reply-To: <5zlkwo24ek.fsf@mtl066.yok.mtl.com> References: <5zlkwo24ek.fsf@mtl066.yok.mtl.com> Message-ID: <1139508255.4450.1048.camel@hal.voltaire.com> On Mon, 2006-02-06 at 05:00, Yael Kalka wrote: > Hi Hal, > > There was an issue discussed a while ago regarding declaration of > several variables inside the function, in the code handling path > record for multicast. Declaration in the middle of the function > doesn't compile on windows, and in the past you said that the preffered > approach by you is to add parenthesis on the code handling the > multicast path records. This patch adds these parenthesis. Thanks. Applied. -- Hal From caitlinb at broadcom.com Thu Feb 9 10:23:49 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 9 Feb 2006 10:23:49 -0800 Subject: [openib-general] Re: 2/2 libibverbs + libmthca changes for query QP Message-ID: <54AD0F12E08D1541B826BE97C98F99F122DCED@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > Roland Dreier wrote: >> This reminds me of something I've been meaning to ask: what is the >> attr_mask of the query QP function used for? The verbs definition in >> the IB spec does not include an attribute mask. Obviously we don't >> have to follow the spec precisely but I'm wondering what the >> advantage is in having an attribute mask for query QP. None of the >> other query methods (including query SRQ) have an attribute mask. > > My understanding of the original code that this came from is > that it permitted querying attributes that may be stored in > memory, rather than requiring reading the card. E.g. if all > you needed was the QPN, then you could obtain only that > information. Given that things like the QPN, QP type, etc. > are already available through the API, I'm not sure that the mask is > needed. > Anticipating what queries are expensive (require a round-trip to the card) is inherently model dependent. Having an attribute mask allows the consumer to state *exactly* what informatino is required, and allowing each driver to decide whether a round-trip to the card is needed for *this* set of information. It should be clear that the provider MAY supply answers that were not requested but SHOULD NOT take extra steps to obtain that information. That is, a provider could first copy all the data in had on-host, and then check a flag to see if a round-trip to the device was required. If so it might just copy everything from the device. This style was selected in both DAT and IT-API because it simpler to use this type of attribute uniformly rather than expecting the consumer to remember which queries might be issues on *some* devices. From iod00d at hp.com Thu Feb 9 10:25:37 2006 From: iod00d at hp.com (Grant Grundler) Date: Thu, 9 Feb 2006 10:25:37 -0800 Subject: [openib-general] Re: openib and mellanox hca problem In-Reply-To: References: <20060208172641.GC28594@mellanox.co.il> <97a7c7ed0602081011m690e30deiba9789f5ce946201@mail.gmail.com> <97a7c7ed0602090553j46e142c2mc9e22c9e2b3e15c@mail.gmail.com> Message-ID: <20060209182537.GH19594@esmail.cup.hp.com> On Thu, Feb 09, 2006 at 07:32:31AM -0800, Roland Dreier wrote: > Michael> My problems if you recall with the mellanox cards seem to > Michael> be related to Linux Kernel 2.6.15.3... Installed RHEL4 > Michael> instead of Fedora Core and get the same results using > Michael> kernel 2.6.15.3 and RHEL4... > > No, I think the problem is that something about your system is causing > problems when the driver resets the HCA. Presumably the silverstorm > stack does not reset the HCA. > > I had one thought: do you know the speed (100 MHz, 133 MHz, ...) of > the PCI slot that your HCA is in? Or can you move the card to a different slot and try again? Sometimes cards will work in one slot but not another due to timing differences or signal quality differences between slots. grant > > In any case you should be able to comment out the call to > mthca_reset() in mthca_init_one() as a workaround for now. > > - R. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Thu Feb 9 10:21:30 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Feb 2006 13:21:30 -0500 Subject: [openib-general] Re: [PATCH] Opensm - osm_ucast_mgr.c - use dynamic alloc In-Reply-To: <5zk6c8245i.fsf@mtl066.yok.mtl.com> References: <5zk6c8245i.fsf@mtl066.yok.mtl.com> Message-ID: <1139509289.4450.1083.camel@hal.voltaire.com> On Mon, 2006-02-06 at 05:05, Yael Kalka wrote: > Hi Hal, > > The original static allocation doesn't compile in Windows. > The attached patch replaces it with dynamic allocation. Thanks. Applied. -- Hal From roy.k.larsen at intel.com Thu Feb 9 10:33:23 2006 From: roy.k.larsen at intel.com (Larsen, Roy K) Date: Thu, 9 Feb 2006 10:33:23 -0800 Subject: [dat-discussions] [openib-general][RFC]DAT2.0immediatedataproposal Message-ID: <468F3FDA28AA87429AD807992E22D07E086B8526@orsmsx408> >But why define an IB specific feature when a transport neutral feature >can be defined? > >Viewing the operation as Write with following Send maintains transport >neutral semantics AND allows IB to encode it as a Write with Immediate. > >That avoids IB to use the silicon that already exists to support >compressing >the Write and Send into a single message. That is the real benefit, isn't >it? No, it's not.... >And for both transports it enables the Provider to pass the 4 byte >immediate >data by value rather than by registered reference. So there is a definite >benefit >to IB, and a potential benefit to IP, and it works for both transports. > >The *only* thing gained by making it a transport specific method is >the implicit 33rd bit in the "that RDMA Write payload you asked for >has arrived" message. > Ok, finally. A realization that the semantics of write/send are not the same as IB write with immediate data. And the difference is important. The proposed emulation could not pass a black box test since nothing distinguishes an "immediate" receive message from standard one containing rkeys or any other random data an application my need to exchange through send/receive. A true write with immediate data can pass such a black box test because it offers a unique service whereas the proposed emulation does not. It is a "helper" function that uses existing services. I have no objection to a write/send helper function, just call it that and not write with immediate data. Leave the true immediate data service as an extension as first proposed. >Is there a concrete example of any benefit from encoding a 33rd bit in >the selection of Write with Immediate versus Write followed by 32-bit Send? Yes, as stated several times, applications that use the send/receive facility to exchange information such as rkeys as well as using write immediate services must be able to unambiguously tell the difference between receive indications. Putting a requirement on the application to make that distinction by their own devices provides no additional service that they don't already have in existing APIs. Roy From sean.hefty at intel.com Thu Feb 9 10:40:31 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 9 Feb 2006 10:40:31 -0800 Subject: [openib-general] [PATCH] [RFC] - example user mode rdma ping/pongprogram using CMA In-Reply-To: <1139416042.26808.14.camel@stevo-desktop> Message-ID: >Attached is a user-mode program, called rping, that uses librdmacm and >libibverbs to implement a ping-pong program over an RC connection. The >program utilizes SEND, RECV, RDMA READ, and WRITE ops, as well as cq >channels to get cq events, and rdma_get_event() to detect CMA events. >It is multi-threaded. > >I've built it as an example program in librdmacm/examples and tested it >with mthca. It is useful to test CMA as well as all the major rdma >operations in a transport-neutral way. > >If you all find it has utility, please pull it into librdmacm/examples. What exactly should I see that indicates that the test works? I enabled debug output. The server side ends up at "rdma_bind_addr worked". The client side creates a QP, calls rping_setup_buffers, then disconnects. The parameters that I used are: rping -s -a 192.168.0.101 -p 7174 rping -c -a 192.168.0.101 -p 7174 Without debugging turned on, should I see output confirming that the test has executed correctly? - Sean From caitlinb at broadcom.com Thu Feb 9 11:10:06 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 9 Feb 2006 11:10:06 -0800 Subject: [openib-general] [PATCH] [RFC] - example user mode rdma ping/pongprogram using CMA Message-ID: <54AD0F12E08D1541B826BE97C98F99F122DD04@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: >> Attached is a user-mode program, called rping, that uses librdmacm >> and libibverbs to implement a ping-pong program over an RC >> connection. The program utilizes SEND, RECV, RDMA READ, and WRITE >> ops, as well as cq channels to get cq events, and rdma_get_event() >> to detect CMA events. It is multi-threaded. >> >> I've built it as an example program in librdmacm/examples and tested >> it with mthca. It is useful to test CMA as well as all the major >> rdma operations in a transport-neutral way. >> >> If you all find it has utility, please pull it into >> librdmacm/examples. > > What exactly should I see that indicates that the test works? > I enabled debug output. The server side ends up at > "rdma_bind_addr worked". The client side creates a QP, calls > rping_setup_buffers, then disconnects. The parameters that I used > are: > > rping -s -a 192.168.0.101 -p 7174 > rping -c -a 192.168.0.101 -p 7174 > > Without debugging turned on, should I see output confirming > that the test has executed correctly? > That actually raises an excellent point relevant to the point of starting to accumulate a collection of test programs that might someday be called a suite. Having uniform expectations on what a test tool should output to stdout, stderr, what exit codes should be used and even some uniformity on switches would be very good. Is there an existing test program that should be considered as a template for others writing new test tools? From swise at opengridcomputing.com Thu Feb 9 11:11:07 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 09 Feb 2006 13:11:07 -0600 Subject: [openib-general] [PATCH] [RFC] - example user mode rdma ping/pongprogram using CMA In-Reply-To: References: Message-ID: <1139512267.27745.20.camel@stevo-desktop> Sounds like the connection didn't get setup... By default it should run continually ping/ponging until you hit ctrl-c. Without -d or -v, you won't see any output. If it exited, then something must have failed. -d should show more info. The -vV flags will display the ping/pong messages and validate the data. I assume you can ping between the two systems over ipoib interfaces? On Thu, 2006-02-09 at 10:40 -0800, Sean Hefty wrote: > >Attached is a user-mode program, called rping, that uses librdmacm and > >libibverbs to implement a ping-pong program over an RC connection. The > >program utilizes SEND, RECV, RDMA READ, and WRITE ops, as well as cq > >channels to get cq events, and rdma_get_event() to detect CMA events. > >It is multi-threaded. > > > >I've built it as an example program in librdmacm/examples and tested it > >with mthca. It is useful to test CMA as well as all the major rdma > >operations in a transport-neutral way. > > > >If you all find it has utility, please pull it into librdmacm/examples. > > What exactly should I see that indicates that the test works? I enabled debug > output. The server side ends up at "rdma_bind_addr worked". The client side > creates a QP, calls rping_setup_buffers, then disconnects. The parameters that > I used are: > > rping -s -a 192.168.0.101 -p 7174 > rping -c -a 192.168.0.101 -p 7174 > > Without debugging turned on, should I see output confirming that the test has > executed correctly? > > - Sean From halr at voltaire.com Thu Feb 9 11:00:54 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Feb 2006 14:00:54 -0500 Subject: [openib-general] Re: [PATCH] Opensm - add syslog prints in windows In-Reply-To: <5zfymw1wuk.fsf@mtl066.yok.mtl.com> References: <5zfymw1wuk.fsf@mtl066.yok.mtl.com> Message-ID: <1139511652.4450.1187.camel@hal.voltaire.com> On Mon, 2006-02-06 at 07:43, Yael Kalka wrote: > Hi Hal, > > Currently SYSLOG prints are not executed under Windows. > The following patch adds these printings to the Windows stack as well. Thanks. Applied. -- Hal From sean.hefty at intel.com Thu Feb 9 11:16:39 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 9 Feb 2006 11:16:39 -0800 Subject: [openib-general] [PATCH] [RFC] - example user mode rdmaping/pongprogram using CMA In-Reply-To: <1139512267.27745.20.camel@stevo-desktop> Message-ID: >Sounds like the connection didn't get setup... That was my assumption as well. Okay - I'll start debugging this. Nothing in the code jumped out at me as incorrect. Have you tested this on IB, or only iWarp? >By default it should run continually ping/ponging until you hit ctrl-c. >Without -d or -v, you won't see any output. If it exited, then something >must have failed. -d should show more info. The -vV flags will display >the ping/pong messages and validate the data. > >I assume you can ping between the two systems over ipoib interfaces? Yes. - Sean From ardavis at ichips.intel.com Thu Feb 9 11:18:11 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Thu, 09 Feb 2006 11:18:11 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal In-Reply-To: References: <6.2.0.14.2.20060208134317.02654d00@esmail.cup.hp.com> <43EA8085.5070208@ichips.intel.com> Message-ID: <43EB9573.2030105@ichips.intel.com> Roland Dreier wrote: > >Hmm. Can you put a number on how much better RDMA write with >immediate is on current HCA hardware? How does using the underlying >OpenIB verbs ability to post a list of work requests compare (ie >posting an RDMA write followed by a send in one verbs call)? >Maybe "post multiple" is a better direction for DAT. > > With post multiple, unlike immediate data, you don't have the ability to distinguish between a normal receive and a rdma write completion indication on the other end. This is the uniqueness of the service that cannot be provided by the post multiple. Yes, post multiple would be a nice option for DAT it is just a different service. It would also be required to conform to the semantics rules of the bundled operations so you could not do any optimization tricks under the covers with an IB rdma_write_immediate operation. -arlin > - R. > > > From halr at voltaire.com Thu Feb 9 11:08:47 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Feb 2006 14:08:47 -0500 Subject: [openib-general] Re: [PATCH] Opensm - clean osm_vendor_mlx_sa.c code In-Reply-To: <5zhd7c1x1f.fsf@mtl066.yok.mtl.com> References: <5zhd7c1x1f.fsf@mtl066.yok.mtl.com> Message-ID: <1139512126.4450.1212.camel@hal.voltaire.com> Hi Yael, On Mon, 2006-02-06 at 07:39, Yael Kalka wrote: > Hi Hal, > > Currently in osm_vendor_mlx_sa.c the sent context is saved arbitrarily > as nodeInfo_context. This results in need for strange castings from > long to pointer and vice-versa. The following patch adds another > possible context - arbitrary context, which will be used in this case. Thanks. Applied with one question below. BTW, I have no way to test this (other than that things still work for OpenIB). Is this still one code base for gen1 too ? -- Hal > Thanks, > Yael > > Signed-off-by: Yael Kalka > > Index: libvendor/osm_vendor_mlx_sa.c > =================================================================== > --- libvendor/osm_vendor_mlx_sa.c (revision 5307) > +++ libvendor/osm_vendor_mlx_sa.c (working copy) > @@ -96,9 +96,9 @@ __osmv_sa_mad_rcv_cb( > goto Exit; > } > > - /* obtain the sent context since we store it during send in the ni_ctx */ > + /* obtain the sent context */ > p_query_req_copy = > - (osmv_query_req_t *)CAST_P2LONG(p_req_madw->context.ni_context.node_guid); > + (osmv_query_req_t *)(p_req_madw->context.arb_context.context1); > > /* provide the context of the original request in the result */ > query_res.query_context = p_query_req_copy->query_context; > @@ -207,7 +207,7 @@ __osmv_sa_mad_err_cb( > > /* Obtain the sent context etc */ > p_query_req_copy = > - (osmv_query_req_t *)CAST_P2LONG(p_madw->context.ni_context.node_guid); > + (osmv_query_req_t *)(p_madw->context.arb_context.context1); > > /* provide the context of the original request in the result */ > query_res.query_context = p_query_req_copy->query_context; > @@ -561,10 +561,17 @@ __osmv_send_sa_req( > /* > Provide the address to send to > */ > + /* Patch to handle IBAL - host order , where it should take destination lid in network order */ > +#ifdef OSM_VENDOR_INTF_AL > + p_madw->mad_addr.dest_lid = p_bind->sm_lid; > +#else > p_madw->mad_addr.dest_lid = cl_hton16(p_bind->sm_lid); > +#endif > p_madw->mad_addr.addr_type.smi.source_lid = > cl_hton16(p_bind->lid); > p_madw->mad_addr.addr_type.gsi.remote_qp = CL_HTON32(1); > + p_madw->mad_addr.addr_type.gsi.remote_qkey = IB_QP1_WELL_KNOWN_Q_KEY; > + p_madw->mad_addr.addr_type.gsi.pkey = IB_DEFAULT_PKEY; > p_madw->resp_expected = TRUE; > p_madw->fail_msg = CL_DISP_MSGID_NONE; > > @@ -574,12 +581,11 @@ __osmv_send_sa_req( > Since we can not rely on the client to keep it arroud until > the response - we duplicate it and will later dispose it (in CB). > To store on the MADW we cast it into what opensm has: > - p_madw->context.ni_context.node_guid > + p_madw->context.arb_context.context1 > */ > p_query_req_copy = cl_malloc(sizeof(*p_query_req_copy)); > *p_query_req_copy = *p_query_req; > - p_madw->context.ni_context.node_guid = > - (ib_net64_t)CAST_P2LONG(p_query_req_copy); > + p_madw->context.arb_context.context1 = p_query_req_copy; > > /* we can support async as well as sync calls */ > sync = ((p_query_req->flags & OSM_SA_FLAGS_SYNC) == OSM_SA_FLAGS_SYNC); > Index: include/opensm/osm_madw.h > =================================================================== > --- include/opensm/osm_madw.h (revision 5307) > +++ include/opensm/osm_madw.h (working copy) > @@ -315,6 +315,22 @@ typedef struct _osm_vla_context > boolean_t set_method; > } osm_vla_context_t; > /*********/ > +/****s* OpenSM: MAD Wrapper/osm_arbitrary_context_t > +* NAME > +* osm_sa_context_t > +* > +* DESCRIPTION > +* Context needed by arbitrary recipient. > +* > +* SYNOPSIS > +*/ > +typedef struct _osm_arbitrary_context > +{ > + void* context1; > + void* context2; > +} osm_arbitrary_context_t; > +/*********/ > + > /****s* OpenSM: MAD Wrapper/osm_madw_context_t > * NAME > * osm_madw_context_t > @@ -335,6 +351,7 @@ typedef union _osm_madw_context > osm_smi_context_t smi_context; > osm_slvl_context_t slvl_context; > osm_pkey_context_t pkey_context; > + osm_arbitrary_context_t arb_context; Should this be carried for for all vendor layers or only the ones which need this ? > } osm_madw_context_t; > /*********/ > > @@ -880,6 +897,34 @@ osm_madw_get_vla_context_ptr( > } > /* > * PARAMETERS > +* p_madw > +* [in] Pointer to an osm_madw_t object. > +* > +* RETURN VALUES > +* Pointer to the start of the context structure. > +* > +* NOTES > +* > +* SEE ALSO > +*********/ > + > +/****f* OpenSM: MAD Wrapper/osm_madw_get_arbitrary_context_ptr > +* NAME > +* osm_madw_get_arbitrary_context_ptr > +* > +* DESCRIPTION > +* Gets a pointer to the arbitrary context in this MAD. > +* > +* SYNOPSIS > +*/ > +static inline osm_arbitrary_context_t* > +osm_madw_get_arbitrary_context_ptr( > + IN const osm_madw_t* const p_madw ) > +{ > + return( (osm_arbitrary_context_t*)&p_madw->context ); > +} > +/* > +* PARAMETERS > * p_madw > * [in] Pointer to an osm_madw_t object. > * > From swise at opengridcomputing.com Thu Feb 9 11:19:16 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 09 Feb 2006 13:19:16 -0600 Subject: [openib-general] [PATCH] [RFC] - example user mode rdmaping/pongprogram using CMA In-Reply-To: References: Message-ID: <1139512756.27745.26.camel@stevo-desktop> On Thu, 2006-02-09 at 11:16 -0800, Sean Hefty wrote: > >Sounds like the connection didn't get setup... > > That was my assumption as well. Okay - I'll start debugging this. Nothing in > the code jumped out at me as incorrect. Have you tested this on IB, or only > iWarp? > I tested it over IB only, using the latest trunk from yesterday and mthca, on 2.6.15.2. Since the amso driver doesn't have user-mode support, I couldn't test it with iWARP. However, its based on krping, which is a kernel-mode ULP in the iwarp branch that works over both IB and the amso iwarp card. Hollar if you have any questions. From caitlinb at broadcom.com Thu Feb 9 11:40:26 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 9 Feb 2006 11:40:26 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal Message-ID: <54AD0F12E08D1541B826BE97C98F99F122DD0A@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > Roland Dreier wrote: > >> >> Hmm. Can you put a number on how much better RDMA write with >> immediate is on current HCA hardware? How does using the underlying >> OpenIB verbs ability to post a list of work requests compare (ie >> posting an RDMA write followed by a send in one verbs call)? >> Maybe "post multiple" is a better direction for DAT. >> >> > With post multiple, unlike immediate data, you don't have the > ability to distinguish between a normal receive and a rdma > write completion indication on the other end. This is the > uniqueness of the service that cannot be provided by the post > multiple. Yes, post multiple would be a nice option for DAT > it is just a different service. It would also be required to > conform to the semantics rules of the bundled operations so > you could not do any optimization tricks under the covers > with an IB rdma_write_immediate operation. > A post_multiple also requires defining a single "DTO" data structure. If the post multiple is atomic (meaning all make it or none do) then it requires an intermediate data structure to have been created. If it is not atomic there really isn't reason for it to not just be a utility function layered above DAT. What I'm not seeing with the immediate is this urgent need by the application to be able to use the same 32-bit value for both an immediate and a 4 byte message that requires an entire additional API just to support it. Why can't the application just add a bool to the send message? Or encode the 32-bits so that they come from disjoint domains? There seems to be agreement that a consolidated write-and-send call would enable the application to get the benefits of rdma write with immediate whenever the application could distinguish the two. I cannot see why doing this is almost free for virtually all applications, and trivial for the remainder. Adding and documenting an extra call to deal with such an extreme corner case that is being presented only in the abstract is just not justified. This extra capability has to have enough functionality for enough applications to justify keeping it on the books, writing test cases for it, etc. We already made a similar decision in having a 128-bit IA Address. That means we cannot support a host that interfaces to the Internet with IPv6 and an InfiniBand network that not only had global GIDs, but allocated a global subnetwork a network id that was already in use as a valid public IPv6 network. The complexity of dealing with an IA Address that was 128+1 bits was simply not jusitified to deal with an extreme corner case that could very easily be avoided (there is no shortage of "site local" network IDs in the IPv6/GID format, so using a global network prefix that was disjoint from the official IPv6 hierarchy would be just plain silly). So far I haven't seen any explanation as to why an application has a need to encode this 33rd bit of their message in this terribly transport specific matter. Is there some severe performance penalty to slightly restructuring the send message so that it is no longer ambiguous with the immeidate data? From paul.baxter at dsl.pipex.com Thu Feb 9 12:17:36 2006 From: paul.baxter at dsl.pipex.com (Paul Baxter) Date: Thu, 9 Feb 2006 20:17:36 -0000 Subject: [openib-general] NFS performance and general disk network export advice (Linux-Windows) Message-ID: <009e01c62db5$dece4b70$8000000a@blorp> I'm looking to export a filesystem from each of four linux 64bit boxes to a single Windows server 2003 64bit Ed. Has anyone achieved this already using an IB transport? Can I use NFS over IPoIB cross platform? i.e. do both ends support a solution? Is NFS over RDMA compatible with Windows (pretty sure the answer is no to this one but love to be proven wrong). I've attached Tom's announcement of the latest to the bottom of this email. I don't think Windows has the RDMA abstraction (yet)? Are windows IB drivers (Openib or Mellanox) compatible with these options? Do I layer Windows services for Unix on top of the Windows IB drivers and IPoIB to achieve a cross platform NFS? Has anyone done much in the way of NFS performance comparisons of NFS over IPoIB in cross-platform situations vs say Gigabit ethernet. Does it work :) What is large file throughput and processor loading - I'm aiming for 150-200 MB/s on large files on 4x SDR IB (possibly DDR if we can fit the bigger 144 port switch chassis into our rack layout for 50-ish nodes). Are there any alternatives to using NFS that may be better and that would 'transparently' receive a performance boost with IB compared with using a simple NFS/gigabit ethernet solution. Must be fairly straightforward, ideally application neutral (configure a drive and load/unload script for Linux and it just happens) and compatible between Win2003 and Linux? Alternatives using perhaps Samba on the Linux side? My lack of knowledge of IB in the windows world has got me concerned over whether this is actually achievable (easily). I hope to be trying this once we get a Windows 2003 machine, but hope someone can encourage me that its a breeze prior to my coming unstuck in a month or so! Some detail about the bit I do understand: I will be using a patched Linux kernel (realtime preemption patches ) but prefer not to apply/track too many kernel patches as the kernel evolves. The NFS patches suggested by Tom in his announcement below make me a little nervous. The application will alternate between a real-time mode with (probably) no NFS (or similar network exporting of the disk) and an archiving mode where Linux will load relevant network filesystem modules and let the windows machine read the disks. The reason for this odd load/unload behaviour is because our current experience with NFS has been that the driver is prone to putting multi-millisecond glitches that have a habit of upsetting (soft) real-time behaviour at the sorts of timing latencies we're looking at (milliseond or two). NFS (and network cards) do like to batch up work and then run these from interrupt contexts. SoftIRQs help tremendously but don't seem to be the complete answer. Paul Baxter Tom's announcement: > We have released an updated NFS/RDMA client for Linux at > the project's Sourceforge site: > > > > > > This release updates the RPC/RDMA support as follows: > Linux 2.6.15.2 supported > Integrates with RPC via 2.6.15 transport switch > Employs OpenIB RDMA verbs API (not kDAPL) > Dual BSD/GPL2 licensing > > There are no protocol changes in this release, it is identical to > the previous release (and the IETF draft) in this respect. The > client has been tested with NFSv3 and passes the Connectathon > test suite. > > At present, the client requires some additional transport switch > patches to be applied to the Linux kernel, these are available at > Chuck Lever's patches page: > > > The related CITI NFS/RDMA server project is currently available > for 2.6.14 from: > > > > > > This server is functional but only supports small RDMA inline data > transfers, and a single request in flight. So, its performance is quite > far from the potential. However, it is functional and is the server > we pass Connectathon with! > > The server project is now being developed by Open Grid Computing, > moving to the OpenIB common RDMA verbs API. We'll be making > updates to both client and server as they become available. There's > a lot more to do. > > We look forward to comments and feedback from the various standards > and open source communities on this. Feel free to use the mailing list > on the sourceforge project site, or any of these lists (which we usually > monitor) but cc at least me and James Lentini (jlentini at netapp.com). > > Thanks, > Tom Talpey, for the various NFS/RDMA projects. From krause at cup.hp.com Thu Feb 9 12:25:06 2006 From: krause at cup.hp.com (Michael Krause) Date: Thu, 09 Feb 2006 12:25:06 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal In-Reply-To: <43EA8085.5070208@ichips.intel.com> References: <6.2.0.14.2.20060208134317.02654d00@esmail.cup.hp.com> <43EA8085.5070208@ichips.intel.com> Message-ID: <6.2.0.14.2.20060209122022.023baf20@esmail.cup.hp.com> At 03:36 PM 2/8/2006, Arlin Davis wrote: >Roland Dreier wrote: > >> Michael> So, here we have a long discussion on attempting to >> Michael> perpetuate a concept that is not universal across >> Michael> transports and was deemed to have minimal value that most >> Michael> wanted to see removed from the architecture. >> >>But this discussion is being driven by an application developer who >>does see value in immediate data. >> >>Arlin, can you quantify the benefit you see from RDMA write with >>immediate vs. RDMA write followed by a send? >> >> >We need speed and simplicity. > >A very latency sensitive application that requires immediate notification >of RDMA write completion on the remote node without ANY latency penalties >associated with combining operations, HCA priority rules across QPs, wire >congestion, etc. An application that has no requirement for messaging >outside of remote rdma write completion notifications. The application >would not have to register and manage additional message buffers on either >side, we can just size the queues accordingly and post zero byte messages. >We need something that would be equivelent to setting there polling on the >last byte of inbound data. But, since data ordering within an operation is >not guaranteed that is not an option. So, rdma with immediate data is the >most optimal and simplistic method for indication of RDMA-write completion >that we have available today. In fact, I would like to see it increased in >size to make it even more useful. RDMA Write with Immediate is part of the IB Extended Transport Header. It is a fixed-sized quantity and not one subject to change, i.e. increasing its size. Your argument above reinforces that the particular application need is IB-specific and thus should not be part of a general API but a transport-specific API. If the application will only operate optimally using immediate data, then it is only suitable for an IB fabric. This reinforces the need for a transport-specific API. Those applications that simply want to enable completion notification when a RDMA Write has occurred can use a general purpose API that is interconnect independent and whose code is predicated upon a RDMA Write - Send set of operations. This will enable application portability across all interconnect types. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From roy.k.larsen at intel.com Thu Feb 9 12:31:35 2006 From: roy.k.larsen at intel.com (Larsen, Roy K) Date: Thu, 9 Feb 2006 12:31:35 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal Message-ID: <468F3FDA28AA87429AD807992E22D07E086F55E9@orsmsx408> >>> Hmm. Can you put a number on how much better RDMA write with >>> immediate is on current HCA hardware? How does using the underlying >>> OpenIB verbs ability to post a list of work requests compare (ie >>> posting an RDMA write followed by a send in one verbs call)? >>> Maybe "post multiple" is a better direction for DAT. >>> >>> >> With post multiple, unlike immediate data, you don't have the >> ability to distinguish between a normal receive and a rdma >> write completion indication on the other end. This is the >> uniqueness of the service that cannot be provided by the post >> multiple. Yes, post multiple would be a nice option for DAT >> it is just a different service. It would also be required to >> conform to the semantics rules of the bundled operations so >> you could not do any optimization tricks under the covers >> with an IB rdma_write_immediate operation. >> > >A post_multiple also requires defining a single "DTO" data >structure. If the post multiple is atomic (meaning all make >it or none do) then it requires an intermediate data structure >to have been created. If it is not atomic there really isn't >reason for it to not just be a utility function layered >above DAT. That is very good point. And since the emulated immediate data service can't make the atomic guarantee it is the killer argument for just making the service plain - a potentially more efficient write/send. > >What I'm not seeing with the immediate is this urgent need >by the application to be able to use the same 32-bit value >for both an immediate and a 4 byte message that requires >an entire additional API just to support it. Why can't >the application just add a bool to the send message? >Or encode the 32-bits so that they come from disjoint >domains? Some applications can do as you suggest. Some applications can make good use of unambiguous indications where the buffer size, content, or arrival timing is not constrained. Some don't need write notification at all. What's your point? > >There seems to be agreement that a consolidated write-and-send >call would enable the application to get the benefits of >rdma write with immediate whenever the application could >distinguish the two. Well, I think there is agreement that *some* applications can use write-and-send in a beneficial way. But then again, nothing prevents them from doing that now. They do not need an additional API. But again, I don't have an issue with defining a helper function. I do have an issue with defining an API and semantic that says the target side needs to be coded in a way to always deal with both "true" immediate data and emulation. Just define a write/send helper API and the UPL can be coded in a consistent manner if that is a beneficial service. If a true unambiguous indication service is more beneficial or required, it can use the extension and accept the extra complexity. To demand extra complexity in applications that obviously don't need the true immediate data semantic is just wrong in my option. > >I cannot see why doing this is almost free for virtually >all applications, and trivial for the remainder. Adding >and documenting an extra call to deal with such an >extreme corner case that is being presented only in >the abstract is just not justified. This extra capability >has to have enough functionality for enough applications >to justify keeping it on the books, writing test cases >for it, etc. All we're asking is that a write/send combined API not be called immediate data unless it fits the semantics of immediate data. I am puzzled at the resistance this is getting. There is a standards body specification for immediate data. If it is not followed, don't call it immediate data. It's that simple. For those transports that can provide the service, the UPL may be able to gain access to it through an extension. Roy From mst at mellanox.co.il Thu Feb 9 12:36:00 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 9 Feb 2006 22:36:00 +0200 Subject: [openib-general] Re: libsdp running nearly fine In-Reply-To: <1139502670.14833.14.camel@ipnnarval> References: <1139502670.14833.14.camel@ipnnarval> Message-ID: <20060209203600.GA1974@mellanox.co.il> Quoting r. Xavier Grave : > Subject: libsdp running nearly fine > > Hi all, > > I have setup libsdp and it works quite well except if I try to send > buffer with a size > 5100 bytes I get this kind of kernel messages : > Unable to handle kernel paging request for data at address > 0xd000080080085cc0 > Faulting instruction address: > 0xd0000000001dd3b4 > Oops: Kernel access of bad area, sig: 7 > [#1] > SMP NR_CPUS=32 NUMA PSERIES > LPAR > Modules linked in: ipv6 nfsd exportfs nfs_acl lockd sunrpc ib_uverbs > psmouse idv > NIP: D0000000001DD3B4 LR: C00000000022C808 CTR: > D0000000001DD25C > REGS: c0000000b27d7330 TRAP: 0300 Not tainted > (2.6.16-rc2) > MSR: 8000000000009032 CR: 24000488 XER: > 00000010 > DAR: D000080080085CC0, DSISR: > 0000000042000000 > TASK = c000000004a12040[2760] 'client' THREAD: c0000000b27d4000 CPU: > 3 > GPR00: 0000000000020000 C0000000B27D75B0 D0000000002014A8 > C0000000042A0B00 > GPR04: C0000000049B3BA0 0000000000000002 0000000000481000 > C0000000004CE588 > GPR08: 0000000000020033 0000000000020033 D000080080085CC0 > 00000000000000F0 > GPR12: 0000200000000000 C0000000003BC100 00000000100D0000 > 0000000000000000 > GPR16: 0000000000000000 0000000010197EA8 0000000000000001 > C0000000B27D7C98 > GPR20: 0000000000000000 C0000000B27D7B08 C0000000B1AD1E60 > C0000000049B3BA0 > GPR24: 0000000000000002 C00000000474FC80 C000000007323A20 > 8000000000009032 > GPR28: C00000000474FC98 C000000007323A00 C000000000407420 > C000000007323A10 > NIP [D0000000001DD3B4] .mthca_tavor_map_phys_fmr+0x158/0x190 > [ib_mthca] > LR [C00000000022C808] .ib_fmr_pool_map_phys > +0x2a4/0x4a8 > Call > Trace: > [C0000000B27D75B0] [C00000000022C5C4] .ib_fmr_pool_map_phys+0x60/0x4a8 > (unrelia) > [C0000000B27D7670] [C00000000024E7B0] .sdp_iocb_register > +0x5c/0x11c > [C0000000B27D7700] [C000000000253A8C] .sdp_send_data_queue_test > +0x624/0xd7c > [C0000000B27D7820] [C000000000254220] .sdp_send_data_queue > +0x3c/0xb0 > [C0000000B27D78C0] [C000000000255078] .sdp_inet_send > +0x5d8/0xc9c > [C0000000B27D7A10] [C00000000025BEEC] .sock_sendmsg > +0x114/0x15c > [C0000000B27D7C10] [C00000000025CACC] .sys_sendto > +0xd0/0x110 > [C0000000B27D7D90] [C00000000027BE9C] .compat_sys_socketcall > +0x148/0x214 > [C0000000B27D7E30] [C0000000000086F8] syscall_exit > +0x0/0x40 > Instruction > dump: > 81230020 e9430048 396000f0 90030040 60000000 60000000 60000000 > e8080636 > 7d290214 79280020 91030024 91030020 <996a0000> 7c0004ac 2f850000 > 78a90020 > > kernel is compiled with infiniband svn drivers, power5 based server. > I allocate the memory with memalign and getpagesize, I compiled the > ib_sdp with zero copy buffer on. > Do I need to remove these options ? > > xavier Yes, zcopy is currently Intel only - I dont use the DMA API properly. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Thu Feb 9 12:39:05 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 9 Feb 2006 22:39:05 +0200 Subject: [openib-general] Re: 2/2 libibverbs + libmthca changes for query QP In-Reply-To: References: <6AB138A2AB8C8E4A98B9C0C3D52670E3015CD12E@mtlexch01.mtl.com> Message-ID: <20060209203905.GB1974@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: 2/2 libibverbs + libmthca changes for query QP > > This reminds me of something I've been meaning to ask: what is the > attr_mask of the query QP function used for? The verbs definition in > the IB spec does not include an attribute mask. Obviously we don't > have to follow the spec precisely but I'm wondering what the advantage > is in having an attribute mask for query QP. None of the other query > methods (including query SRQ) have an attribute mask. > > - R. Mellanox driver does not need it. AFAIK the eHCA driver seems to use this mask. Can one of the eHCA authors explain why? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From Arkady.Kanevsky at netapp.com Thu Feb 9 12:38:57 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Thu, 9 Feb 2006 15:38:57 -0500 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal Message-ID: Why both Immediate Data and the Stag which was used for RDMA Write? Immediate data already contains info in response to what operation the RDMA Write has completed locally. Stag would make sence if Stag invalidation also put in the mix. But for MPI RMR_context have a long lifecycle so not clear which apps will be interested in combining Invalidation with RDMA Write with Immediate data. Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Caitlin Bestler [mailto:caitlinb at broadcom.com] > Sent: Tuesday, February 07, 2006 3:03 PM > To: Larsen, Roy K; dat-discussions at yahoogroups.com; Arlin > Davis; Hefty, Sean > Cc: openib-general at openib.org > Subject: RE: [dat-discussions] [openib-general] [RFC] > DAT2.0immediatedataproposal > > openib-general-bounces at openib.org wrote: > > Caitlin Bestler wrote: > >> > >> Arlin Davis wrote: > >>> Sean Hefty wrote: > >>> > >>>>> The requirement is to provide an API that supports RDMA writes > >>>>> with immediate data. A send that follows an RDMA write is not > >>>>> immediate data, and the API should not be constructed around > >>>>> trying to make it so. > >>>>> > >>>>> > >>>> > >>>> To be clear, I believe that write with immediate should > be part of > >>>> the normal APIs, rather than an extension, but should be > designed > >>>> around those devices that provide it natively. > >>>> > >>>> > >>> I totally agree. A standard RDMA write with immediate API can be > >>> very useful to RDMA applications based on the requirements (native > >>> support) set forth in my earlier email. It is analogous to the new > >>> dat_ep_post_send_with_invalidate() call; a call that supports a > >>> native iWARP transport operation but provides no > provisions to help > >>> other transports emulate. So, other transports simply return > >>> NOT_SUPPORTED and add it natively in the future if it makes sense. > >>> > >>> -arlin > >> > >> What is proposed in a definition of > >> 'dat_ep_post_rdma_write_with_immediate' > >> that can be implemented over iWARP using the sequence of messages > >> that were intended to support the same purpose (i.e., letting the > >> other side know that an RDMA Write transfer has been fully > received). > > > > No, iWARP *CAN NOT* implement write immediate data any > better than IB > > can implement send with invalidate. Immediate data > > *MUST* be indicated to the ULP unambiguously. Imposing an > algorithm > > on the application to infer immediate data arrival is hack, > pure and > > simple. An application is free to perform a write/send if > that is the > > semantic they want. Why does iWARP get transport unique > APIs but not > > IB? I find this attempt to bastardize the IB semantic of immediate > > data a little curious. > > > > The transports aren't getting anything. Features are there > for applications, especially when the feature can be defined > in a way that makes sense without explaining transport mechanics. > > Completing a transaction, complete with supplying a > transaction response and releasing the advertised STag > associated with the transaction is something that makes sense > in the application domain and conforms to normal DAT ordering rules. > > "Provide information about an RDMA Write to a receive operation" > also meets that definition -- as long as it conforms to the > existing ordering rules. Shifting to an 8 byte message over > iWARP to allow for the write length *and* immediate 'tag' > is certainly doable. We could even consider having the DAT > Provider supply the 'buffer' silently in the DTO itself. > > With that definition the consumer would get a receive > completion that told them that their peer's RDMA Write had > been successfully placed, how long it is (the length) and > which one (a tag). > > I think that is of value. iWARP can implement it as two work > requests and maintain the overall semantics. > > Are you arguing that iWARP should NOT provide this service > until it can do it in a single work request? It seems to me > that allowing an extra work request and completion is a > fairly simple accomodation as opposed to using an alternate > algorithm in the main transaction processing of the application. > > If we enable the applicatin can query how a remote write with > immediate will complete outside of the transaction loop then > we can allow the application to have *no* overhead inside the > main transaction loop, and *identical* logic on the sending side. > > And IB *could* implement send with invalidate by simply > agreeing on how the RKey to be invalidated is communicated > between the IB providers (perhaps as an immediate). > > But more to the point, I don't see how the more flexible > definition of write with immediate negatively impacts the IB > implementation of the feature. IB providers do not need to > allow for the extra work requests. They are not being asked > to place the immediate data into the receive buffer, or to do > any extra work at all. > > > > > Yahoo! Groups Links > > <*> To visit your group on the web, go to: > http://groups.yahoo.com/group/dat-discussions/ > > <*> To unsubscribe from this group, send an email to: > dat-discussions-unsubscribe at yahoogroups.com > > <*> Your use of Yahoo! Groups is subject to: > http://docs.yahoo.com/info/terms/ > > > From Arkady.Kanevsky at netapp.com Thu Feb 9 12:47:20 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Thu, 9 Feb 2006 15:47:20 -0500 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal Message-ID: Caitlin, can you clarify this. Are you proposing that Consumer encode a bit of Immediate Data to specify that it is immediate data? iWARP will pass it in Send message and IB in Immediate Data. Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Caitlin Bestler [mailto:caitlinb at broadcom.com] > Sent: Thursday, February 09, 2006 2:40 PM > To: Arlin Davis; Roland Dreier > Cc: dat-discussions at yahoogroups.com; openib-general at openib.org > Subject: RE: [dat-discussions] [openib-general] [RFC] > DAT2.0immediatedataproposal > > openib-general-bounces at openib.org wrote: > > Roland Dreier wrote: > > > >> > >> Hmm. Can you put a number on how much better RDMA write with > >> immediate is on current HCA hardware? How does using the > underlying > >> OpenIB verbs ability to post a list of work requests compare (ie > >> posting an RDMA write followed by a send in one verbs call)? > >> Maybe "post multiple" is a better direction for DAT. > >> > >> > > With post multiple, unlike immediate data, you don't have > the ability > > to distinguish between a normal receive and a rdma write completion > > indication on the other end. This is the uniqueness of the service > > that cannot be provided by the post multiple. Yes, post > multiple would > > be a nice option for DAT it is just a different service. It > would also > > be required to conform to the semantics rules of the bundled > > operations so you could not do any optimization tricks under the > > covers with an IB rdma_write_immediate operation. > > > > A post_multiple also requires defining a single "DTO" data > structure. If the post multiple is atomic (meaning all make > it or none do) then it requires an intermediate data > structure to have been created. If it is not atomic there > really isn't reason for it to not just be a utility function > layered above DAT. > > What I'm not seeing with the immediate is this urgent need by > the application to be able to use the same 32-bit value for > both an immediate and a 4 byte message that requires an > entire additional API just to support it. Why can't the > application just add a bool to the send message? > Or encode the 32-bits so that they come from disjoint domains? > > There seems to be agreement that a consolidated > write-and-send call would enable the application to get the > benefits of rdma write with immediate whenever the > application could distinguish the two. > > I cannot see why doing this is almost free for virtually all > applications, and trivial for the remainder. Adding and > documenting an extra call to deal with such an extreme corner > case that is being presented only in the abstract is just not > justified. This extra capability has to have enough > functionality for enough applications to justify keeping it > on the books, writing test cases for it, etc. > > We already made a similar decision in having a 128-bit IA > Address. That means we cannot support a host that interfaces > to the Internet with IPv6 and an InfiniBand network that not > only had global GIDs, but allocated a global subnetwork a > network id that was already in use as a valid public IPv6 network. > > The complexity of dealing with an IA Address that was > 128+1 bits was simply not jusitified to deal with > an extreme corner case that could very easily be avoided > (there is no shortage of "site local" network IDs in the > IPv6/GID format, so using a global network prefix that was > disjoint from the official IPv6 hierarchy would be just plain silly). > > So far I haven't seen any explanation as to why an > application has a need to encode this 33rd bit of their > message in this terribly transport specific matter. Is there > some severe performance penalty to slightly restructuring the > send message so that it is no longer ambiguous with the > immeidate data? > > > > > Yahoo! Groups Links > > <*> To visit your group on the web, go to: > http://groups.yahoo.com/group/dat-discussions/ > > <*> To unsubscribe from this group, send an email to: > dat-discussions-unsubscribe at yahoogroups.com > > <*> Your use of Yahoo! Groups is subject to: > http://docs.yahoo.com/info/terms/ > > > From caitlinb at broadcom.com Thu Feb 9 12:50:40 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 9 Feb 2006 12:50:40 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal Message-ID: <54AD0F12E08D1541B826BE97C98F99F122DD23@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: >>>> Hmm. Can you put a number on how much better RDMA write with >>>> immediate is on current HCA hardware? How does using the >>>> underlying OpenIB verbs ability to post a list of work requests >>>> compare (ie posting an RDMA write followed by a send in one verbs >>>> call)? Maybe "post multiple" is a better direction for DAT. >>>> >>>> >>> With post multiple, unlike immediate data, you don't have the >>> ability to distinguish between a normal receive and a rdma write >>> completion indication on the other end. This is the uniqueness of >>> the service that cannot be provided by the post multiple. Yes, post >>> multiple would be a nice option for DAT it is just a different >>> service. It would also be required to conform to the semantics >>> rules of the bundled operations so you could not do any >>> optimization tricks under the covers with an IB >>> rdma_write_immediate operation. >>> >> >> A post_multiple also requires defining a single "DTO" data structure. >> If the post multiple is atomic (meaning all make it or none do) then >> it requires an intermediate data structure to have been created. If >> it is not atomic there really isn't reason for it to not just be a >> utility function layered above DAT. > > That is very good point. And since the emulated immediate > data service can't make the atomic guarantee it is the killer > argument for just making the service plain - a potentially more > efficient write/send. > >> >> What I'm not seeing with the immediate is this urgent need by the >> application to be able to use the same 32-bit value for both an >> immediate and a 4 byte message that requires an entire additional API >> just to support it. Why can't the application just add a bool to >> the send message? Or encode the 32-bits so that they come from >> disjoint domains? > > Some applications can do as you suggest. Some applications > can make good use of unambiguous indications where the buffer > size, content, or arrival timing is not constrained. Some > don't need write notification at all. What's your point? > >> >> There seems to be agreement that a consolidated write-and-send call >> would enable the application to get the benefits of rdma write with >> immediate whenever the application could distinguish the two. > > Well, I think there is agreement that *some* applications can > use write-and-send in a beneficial way. But then again, > nothing prevents them from doing that now. They do not need > an additional API. But again, I don't have an issue with > defining a helper function. I do have an issue with defining > an API and semantic that says the target side needs to be > coded in a way to always deal with both "true" immediate data > and emulation. Just define a write/send helper API and the > UPL can be coded in a consistent manner if that is a > beneficial service. If a true unambiguous indication service > is more beneficial or required, it can use the extension and > accept the extra complexity. To demand extra complexity in > applications that obviously don't need the true immediate > data semantic is just wrong in my option. > >> >> I cannot see why doing this is almost free for virtually all >> applications, and trivial for the remainder. Adding and documenting >> an extra call to deal with such an extreme corner case that is being >> presented only in the abstract is just not justified. This extra >> capability has to have enough functionality for enough applications >> to justify keeping it on the books, writing test cases for it, etc. > > All we're asking is that a write/send combined API not be > called immediate data unless it fits the semantics of > immediate data. I am puzzled at the resistance this is > getting. There is a standards body specification for > immediate data. If it is not followed, don't call it > immediate data. It's that simple. For those transports that > can provide the service, the UPL may be able to gain access to it > through an extension. > I have no objection to calling this "dat_ep_post_rdma_write_with_notifier" and labelling the 32-bit data as a "notifier tag". Even on iWARP transports small send data can be in-lined, avoiding the need for buffers to be registered. A special API where the length of the "send buffer" is known in advance makes this even easier. What I still fail to see is a rationale that works down from the application layer on why an application would need still one more page in their cookbook. Creating an entire new method to enable a strange method of signalling one bit of information to the other end doesn't seem like much of a payoff to me. From caitlinb at broadcom.com Thu Feb 9 12:54:49 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 9 Feb 2006 12:54:49 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal Message-ID: <54AD0F12E08D1541B826BE97C98F99F122DD26@NT-SJCA-0751.brcm.ad.broadcom.com> dat-discussions at yahoogroups.com wrote: > Caitlin, > can you clarify this. > Are you proposing that Consumer encode a bit of Immediate > Data to specify that it is immediate data? > iWARP will pass it in Send message and IB in Immediate Data. > If we agreed that there was some accute need for this 33rd bit coming down from the application layer then creating an iWARP untagged message that encoded the first 32 bits, the length of the RDMA write and the magic bonus bit would indeed be a possible solution. I am skeptical that there is a true application derived need for this bonus bit that justifies the complexity required to document it. If the application only needs this bonus bit when running over IB then it really doesn't need it at all. From Arkady.Kanevsky at netapp.com Thu Feb 9 12:56:00 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Thu, 9 Feb 2006 15:56:00 -0500 Subject: [dat-discussions] [openib-general][RFC] DAT2.0immediatedataproposal Message-ID: Mike, but then the combined operation can as easily be handle by a "multiple post operation". What is the need specific transport-independent RDMA Write with immediate data. I am still concern over the need of Consumer Recv side to separate recv of Immediate Data from "regular" Recv. Consumer "knows" what it expect to match the posted Recv. There is one to one mapping between non-pure RDMA transfer ops of one side with Recv of another. Sure ULP may use the same size buffers for all. But how many ULPs mix the Immediate Data size messages ( 4 bytes on IB ) with normal Sends of the same exact size. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 ________________________________ From: Michael Krause [mailto:krause at cup.hp.com] Sent: Thursday, February 09, 2006 3:25 PM To: Arlin Davis Cc: dat-discussions at yahoogroups.com; openib-general at openib.org Subject: Re: [dat-discussions] [openib-general][RFC] DAT2.0immediatedataproposal At 03:36 PM 2/8/2006, Arlin Davis wrote: Roland Dreier wrote: Michael> So, here we have a long discussion on attempting to Michael> perpetuate a concept that is not universal across Michael> transports and was deemed to have minimal value that most Michael> wanted to see removed from the architecture. But this discussion is being driven by an application developer who does see value in immediate data. Arlin, can you quantify the benefit you see from RDMA write with immediate vs. RDMA write followed by a send? We need speed and simplicity. A very latency sensitive application that requires immediate notification of RDMA write completion on the remote node without ANY latency penalties associated with combining operations, HCA priority rules across QPs, wire congestion, etc. An application that has no requirement for messaging outside of remote rdma write completion notifications. The application would not have to register and manage additional message buffers on either side, we can just size the queues accordingly and post zero byte messages. We need something that would be equivelent to setting there polling on the last byte of inbound data. But, since data ordering within an operation is not guaranteed that is not an option. So, rdma with immediate data is the most optimal and simplistic method for indication of RDMA-write completion that we have available today. In fact, I would like to see it increased in size to make it even more useful. RDMA Write with Immediate is part of the IB Extended Transport Header. It is a fixed-sized quantity and not one subject to change, i.e. increasing its size. Your argument above reinforces that the particular application need is IB-specific and thus should not be part of a general API but a transport-specific API. If the application will only operate optimally using immediate data, then it is only suitable for an IB fabric. This reinforces the need for a transport-specific API. Those applications that simply want to enable completion notification when a RDMA Write has occurred can use a general purpose API that is interconnect independent and whose code is predicated upon a RDMA Write - Send set of operations. This will enable application portability across all interconnect types. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From Arkady.Kanevsky at netapp.com Thu Feb 9 13:02:55 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Thu, 9 Feb 2006 16:02:55 -0500 Subject: [dat-discussions] [openib-general] [RFC]DAT2.0immediatedataproposal Message-ID: Roy, and if tomorrow iWARP decides to support Immediate data with variable length. API does not changes. Semantic does not changes and IB will not be able to support it. I am trying to define the semantic and API which will not have to be modified for each rev of the transport. Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Larsen, Roy K [mailto:roy.k.larsen at intel.com] > Sent: Thursday, February 09, 2006 3:32 PM > To: dat-discussions at yahoogroups.com; Arlin Davis; Roland Dreier > Cc: openib-general at openib.org > Subject: RE: [dat-discussions] [openib-general] > [RFC]DAT2.0immediatedataproposal > > >>> Hmm. Can you put a number on how much better RDMA write with > >>> immediate is on current HCA hardware? How does using the > underlying > >>> OpenIB verbs ability to post a list of work requests compare (ie > >>> posting an RDMA write followed by a send in one verbs call)? > >>> Maybe "post multiple" is a better direction for DAT. > >>> > >>> > >> With post multiple, unlike immediate data, you don't have > the ability > >> to distinguish between a normal receive and a rdma write > completion > >> indication on the other end. This is the uniqueness of the service > >> that cannot be provided by the post multiple. Yes, post multiple > >> would be a nice option for DAT it is just a different service. It > >> would also be required to conform to the semantics rules of the > >> bundled operations so you could not do any optimization > tricks under > >> the covers with an IB rdma_write_immediate operation. > >> > > > >A post_multiple also requires defining a single "DTO" data > structure. > >If the post multiple is atomic (meaning all make it or none > do) then it > >requires an intermediate data structure to have been > created. If it is > >not atomic there really isn't reason for it to not just be a utility > >function layered above DAT. > > That is very good point. And since the emulated immediate > data service can't make the atomic guarantee it is the killer > argument for just making the service plain - a potentially > more efficient write/send. > > > > >What I'm not seeing with the immediate is this urgent need by the > >application to be able to use the same 32-bit value for both an > >immediate and a 4 byte message that requires an entire > additional API > >just to support it. Why can't the application just add a > bool to the > >send message? > >Or encode the 32-bits so that they come from disjoint domains? > > Some applications can do as you suggest. Some applications > can make good use of unambiguous indications where the buffer > size, content, or arrival timing is not constrained. Some > don't need write notification at all. What's your point? > > > > >There seems to be agreement that a consolidated write-and-send call > >would enable the application to get the benefits of rdma write with > >immediate whenever the application could distinguish the two. > > Well, I think there is agreement that *some* applications can > use write-and-send in a beneficial way. But then again, > nothing prevents them from doing that now. They do not need > an additional API. But again, I don't have an issue with > defining a helper function. I do have an issue with defining > an API and semantic that says the target side needs to be > coded in a way to always deal with both "true" immediate data > and emulation. Just define a write/send helper API and the > UPL can be coded in a consistent manner if that is a > beneficial service. If a true unambiguous indication service > is more beneficial or required, it can use the extension and > accept the extra complexity. To demand extra complexity in > applications that obviously don't need the true immediate > data semantic is just wrong in my option. > > > > >I cannot see why doing this is almost free for virtually all > >applications, and trivial for the remainder. Adding and > documenting an > >extra call to deal with such an extreme corner case that is being > >presented only in the abstract is just not justified. This extra > >capability has to have enough functionality for enough > applications to > >justify keeping it on the books, writing test cases for it, etc. > > All we're asking is that a write/send combined API not be > called immediate data unless it fits the semantics of > immediate data. I am puzzled at the resistance this is > getting. There is a standards body specification for > immediate data. If it is not followed, don't call it > immediate data. It's that simple. For those transports that > can provide the service, the UPL may be able to gain access > to it through an extension. > > Roy > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From mst at mellanox.co.il Thu Feb 9 13:08:13 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 9 Feb 2006 23:08:13 +0200 Subject: [openib-general] test, please ignore Message-ID: <20060209210813.GB4277@mellanox.co.il> test, please ignore -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From Thomas.Talpey at netapp.com Thu Feb 9 13:14:56 2006 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Thu, 09 Feb 2006 16:14:56 -0500 Subject: [openib-general] NFS performance and general disk network export advice (Linux-Windows) In-Reply-To: <009e01c62db5$dece4b70$8000000a@blorp> References: <009e01c62db5$dece4b70$8000000a@blorp> Message-ID: <7.0.1.0.2.20060209155610.040e4468@netapp.com> At 03:17 PM 2/9/2006, Paul Baxter wrote: >I'm looking to export a filesystem from each of four linux 64bit boxes to a >single Windows server 2003 64bit Ed. > >Has anyone achieved this already using an IB transport? Can I use NFS over >IPoIB cross platform? i.e. do both ends support a solution? > >Is NFS over RDMA compatible with Windows (pretty sure the answer is no to >this one but love to be proven wrong). I've attached Tom's announcement of >the latest to the bottom of this email. I don't think Windows has the RDMA >abstraction (yet)? Not the code I posted! :-) But sure, it's possible to implement NFS/RDMA on Windows. Let us know when you're ready to test. ;-) >Are windows IB drivers (Openib or Mellanox) compatible with these options? >Do I layer Windows services for Unix on top of the Windows IB drivers and >IPoIB to achieve a cross platform NFS? You could do this but your real challenge is the upper layer IFS interface. You would need to implement a Windows filesystem for NFS first. Of course, there are such beasts, Hummingbird's comes to mind. The code I posted uses strictly the OpenIB RDMA interfaces, plus CMA for address resolution and making connections. By the way, it will work over iWARP too. >Has anyone done much in the way of NFS performance comparisons of NFS over >IPoIB in cross-platform situations vs say Gigabit ethernet. Does it work :) >What is large file throughput and processor loading - I'm aiming for 150-200 >MB/s on large files on 4x SDR IB (possibly DDR if we can fit the bigger 144 >port switch chassis into our rack layout for 50-ish nodes). NFS over IPoIB does work, but is nowhere near as low-overhead as native NFA over RDMA. There are several issues with an IPoIB implementation, first of all the fact that an IPoIB solution is quite a bit less optimal than a native 10GbE NIC: - The UD connection typically has a single message in flight, which negates much of the streaming throughput capable with RC. - The IPoIB layer is an emulation, and does not generally perform the hardware checksumming and large segment offload that even 100Mb NICs provide. - The network stack is still in the loop on both ends, adding computational overhead and latency. - The data must still be copied. I have seen native zero-copy zero-touch NFS/RDMA streaming at full PCI/X throughput using only about 20% of a dual-processor 2GHz Xeon. Typically, most network stacks top out at 100% CPU at perhaps half this rate on similar platforms. I'd expect IPoIB to be even less due to the reasons above. >Are there any alternatives to using NFS that may be better and that would >'transparently' receive a performance boost with IB compared with using a >simple NFS/gigabit ethernet solution. Must be fairly straightforward, >ideally application neutral (configure a drive and load/unload script for >Linux and it just happens) and compatible between Win2003 and Linux? >Alternatives using perhaps Samba on the Linux side? > >My lack of knowledge of IB in the windows world has got me concerned over >whether this is actually achievable (easily). > >I hope to be trying this once we get a Windows 2003 machine, but hope >someone can encourage me that its a breeze prior to my coming unstuck in a >month or so! > >Some detail about the bit I do understand: > >I will be using a patched Linux kernel (realtime preemption patches ) but >prefer not to apply/track too many kernel patches as the kernel evolves. The >NFS patches suggested by Tom in his announcement below make me a little >nervous. The most important patches for integrating the NFS/RDMA client are already in the 2.6.15 kernel, but there is additional work which is still in progress. These are the patches I refer to. One of the major ones is the ability to dynamically load RPC transports, such as the NFS/RDMA module. So you do need some sort of patch to use the client, currently. The transport switch continues to evolve and become integrated into the kernel, so the need for this particular patch will fall away eventually. FYI, the transport switch is much more general than NFS/RDMA - it's the underpinning of IPv6 support for the NFS client. Your real issue in working with NFS/RDMA in the way you describe is the availability of the server. The Linux NFS/RDMA server is still very much under development, and will take time just to be ready for experimentation. Especially, it will take time to get it to a state where it can perform the way you require (performance). Please feel free to contact me offline if you want to talk about details of actually setting this up. With a stock 2.6.15.2 kernel and a couple of IB cards you could get it going just to get started. Tom. > >The application will alternate between a real-time mode with (probably) no >NFS (or similar network exporting of the disk) and an archiving mode where >Linux will load relevant network filesystem modules and let the windows >machine read the disks. > >The reason for this odd load/unload behaviour is because our current >experience with NFS has been that the driver is prone to putting >multi-millisecond glitches that have a habit of upsetting (soft) real-time >behaviour at the sorts of timing latencies we're looking at (milliseond or >two). NFS (and network cards) do like to batch up work and then run these >from interrupt contexts. SoftIRQs help tremendously but don't seem to be the >complete answer. > >Paul Baxter > >Tom's announcement: >> We have released an updated NFS/RDMA client for Linux at >> the project's Sourceforge site: >> >> >> >> > >> >> This release updates the RPC/RDMA support as follows: >> Linux 2.6.15.2 supported >> Integrates with RPC via 2.6.15 transport switch >> Employs OpenIB RDMA verbs API (not kDAPL) >> Dual BSD/GPL2 licensing >> >> There are no protocol changes in this release, it is identical to >> the previous release (and the IETF draft) in this respect. The >> client has been tested with NFSv3 and passes the Connectathon >> test suite. >> >> At present, the client requires some additional transport switch >> patches to be applied to the Linux kernel, these are available at >> Chuck Lever's patches page: >> >> >> The related CITI NFS/RDMA server project is currently available >> for 2.6.14 from: >> >> >> >> >MA_stage2_2005-12-19.patch> >> >> This server is functional but only supports small RDMA inline data >> transfers, and a single request in flight. So, its performance is quite >> far from the potential. However, it is functional and is the server >> we pass Connectathon with! >> >> The server project is now being developed by Open Grid Computing, >> moving to the OpenIB common RDMA verbs API. We'll be making >> updates to both client and server as they become available. There's >> a lot more to do. >> >> We look forward to comments and feedback from the various standards >> and open source communities on this. Feel free to use the mailing list >> on the sourceforge project site, or any of these lists (which we usually >> monitor) but cc at least me and James Lentini (jlentini at netapp.com). >> >> Thanks, >> Tom Talpey, for the various NFS/RDMA projects. > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From roy.k.larsen at intel.com Thu Feb 9 13:32:24 2006 From: roy.k.larsen at intel.com (Larsen, Roy K) Date: Thu, 9 Feb 2006 13:32:24 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal Message-ID: <468F3FDA28AA87429AD807992E22D07E086F57A0@orsmsx408> >> All we're asking is that a write/send combined API not be >> called immediate data unless it fits the semantics of >> immediate data. I am puzzled at the resistance this is >> getting. There is a standards body specification for >> immediate data. If it is not followed, don't call it >> immediate data. It's that simple. For those transports that >> can provide the service, the UPL may be able to gain access to it >> through an extension. >> > >I have no objection to calling this >"dat_ep_post_rdma_write_with_notifier" >and labelling the 32-bit data as a "notifier tag". If this MUST be implanted by the provider as a (possibly optimized) write followed by a send, that sounds good to me. All transports can support it and provide the same semantic. No need for application schism. However, I wouldn't place a restriction on the size of the notifier tag. Somewhere along the line, the send data has to reside in a registered buffer. Might as well have the ULP supply it and let it define the contents and size. > >Even on iWARP transports small send data can be in-lined, >avoiding the need for buffers to be registered. A special >API where the length of the "send buffer" is known in >advance makes this even easier. Ah, I wasn't aware iWARP could carry inline data. I take it that's not possible on an iWARP RDMA write PDU however. > >What I still fail to see is a rationale that works down >from the application layer on why an application would >need still one more page in their cookbook. Creating an >entire new method to enable a strange method of signalling >one bit of information to the other end doesn't seem like >much of a payoff to me. Of course the semantics are much more that signaling one bit. Nevertheless, if the contention is that applications don't need that bit, that all they need are write/send semantics, then by all means, simply define an API that gives them that and this thread is closed. Provider writers for transports that can supply a true immediate data service would be free to waste their time supplying an unused service through an extension. But that business decision should be left to the provider writer, not his mailing list. Roy From caitlinb at broadcom.com Thu Feb 9 13:47:53 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 9 Feb 2006 13:47:53 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal Message-ID: <54AD0F12E08D1541B826BE97C98F99F122DD3F@NT-SJCA-0751.brcm.ad.broadcom.com> Larsen, Roy K wrote: >> >> Even on iWARP transports small send data can be in-lined, avoiding >> the need for buffers to be registered. A special API where the >> length of the "send buffer" is known in advance makes this even >> easier. > > Ah, I wasn't aware iWARP could carry inline data. I take it > that's not possible on an iWARP RDMA write PDU however. > On the wire the data is always "in-line", only an RDMA Read Request references data that is not part of the message. The iWARP protocols do not specify much about the local interface. That role has been taken by the RDMAC verbs and RNIC-PI so far. The standard functionality defined in the RDMAC verbs do not mandate support for Inline Send work request. Neither do the IBTA verbs. The option shows up in APIs, and in firmware, because it is a valuable optimization that improves latency in the Device/host exchange independent of the wire protocol. In the vast majority of cases, the user verbs can implement inline sends very easily whenever the data is shorter than the SGL would have been. So in the sense that you can view the SQ itself as a "registered buffer" then it is true. But there is no need for a *separate* registered buffer. From roy.k.larsen at intel.com Thu Feb 9 13:57:33 2006 From: roy.k.larsen at intel.com (Larsen, Roy K) Date: Thu, 9 Feb 2006 13:57:33 -0800 Subject: [dat-discussions] [openib-general] [RFC]DAT2.0immediatedataproposal Message-ID: <468F3FDA28AA87429AD807992E22D07E086F5878@orsmsx408> >Roy, >and if tomorrow iWARP decides to support Immediate data with variable >length. API does not changes. Semantic does not changes and IB >will not be able to support it. > >I am trying to define the semantic and API which will not have to be >modified for each rev of the transport. Arkady, Simply define the API as all the parameters needed to do an RDMA write followed by a send. This is semantically all that many seem to believe is required. I would not restrict the size or contents of the send buffer supplied by the ULP. Could even be a zero length buffer just to trigger the receive completion. Don't try to make the operation any more magical than that. All transports can implement it consistently and the ULP can handle it consistently too. I can't see how anyone could object to that API since it is providing the service desired consistently among all transports. That said, I am not conceding that this service is the equivalent to IB RDMA write with immediate data and want to see a general extension API added for this and any future transport service that won't be supported by the DAPL API. Roy > >Arkady Kanevsky email: arkady at netapp.com >Network Appliance Inc. phone: 781-768-5395 >1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 >Waltham, MA 02451 central phone: 781-768-5300 > From ftillier at silverstorm.com Thu Feb 9 14:02:23 2006 From: ftillier at silverstorm.com (Fab Tillier) Date: Thu, 9 Feb 2006 14:02:23 -0800 Subject: [openib-general] RE: [Openib-windows] NFS performance and general disk network export advice (Linux-Windows) In-Reply-To: <009e01c62db5$dece4b70$8000000a@blorp> Message-ID: <000b01c62dc4$8781aba0$6701a8c0@infiniconsys.com> Hi Paul, > I'm looking to export a filesystem from each of four linux > 64bit boxes to a single Windows server 2003 64bit Ed. > > Has anyone achieved this already using an IB transport? Can > I use NFS over IPoIB cross platform? i.e. do both ends > support a solution? IPoIB will interoperate cross platform, so any higher-level services you layer above TCP/IP or UDP/IP should work fine. > Is NFS over RDMA compatible with Windows (pretty sure the > answer is no to this one but love to be proven wrong). I've > attached Tom's announcement of the latest to the bottom of > this email. I don't think Windows has the RDMA abstraction > (yet)? There is no NFS over RDMA file system for OpenIB Windows. It would be great to have it, but the focus is currently on getting the core stack stable and released. The long term goals, at least from my perspective, is to match functionality between OpenIB Linux and Windows, even if the APIs aren't identical. The reality is that the iWARP crowd hasn't really been involved in the Windows project, and have not driven any requirements, so that stack is continuing to be focused on IB only. I don't have a timeline for getting functionality matched up, and we could certainly use more hands on deck for the Windows project. > Are windows IB drivers (Openib or Mellanox) compatible with > these options? > Do I layer Windows services for Unix on top of the Windows IB > drivers and IPoIB to achieve a cross platform NFS? I don't know what you would need to do to get NFS working on Windows, but that should be an orthogonal problem to getting IB working. If NFS works on Windows over GbE, it should work without a problem over IPoIB. > Has anyone done much in the way of NFS performance > comparisons of NFS over IPoIB in cross-platform situations > vs say Gigabit ethernet. Does it work :) What is large file > throughput and processor loading - I'm aiming for 150-200 > MB/s on large files on 4x SDR IB (possibly DDR if we can > fit the bigger 144 port switch chassis into our rack layout > for 50-ish nodes). I can tell you that IPoIB performance on Windows is pretty awful. The reason for that is that the IPoIB driver shoehorns itself into the NDIS stack as a 802.3 Ethernet NIC, and thus gets 6-byte Ethernet MAC addresses. Further, Windows doesn't have any IB knowledge, so the IPoIB driver is responsible for all ARP and DHCP encapsulation to match the IPoIB protocol on the wire. This involves snooping both outbound and inbound packets to see if they need conversion, which does nasty stuff to performance. Depending on the host CPU, 150-200MB/s should be achievable (I've seen 150+MB/s in some of my testing). > Are there any alternatives to using NFS that may be better > and that would 'transparently' receive a performance boost > with IB compared with using a simple NFS/gigabit ethernet > solution. Must be fairly straightforward, ideally application > neutral (configure a drive and load/unload script for Linux > and it just happens) and compatible between Win2003 and > Linux? Alternatives using perhaps Samba on the Linux side? If you only have a single Windows box that has to read data from one or more Linux boxes, you might have some success with making the Linux boxes SRP targets, and then using the Windows SRP driver to access the Linux boxes. The SRP target driver would have to handle SRP commands and perform local disk access. Of course, the file system would have to be Windows compatible with this solution, but you should be able to get the full RDMA performance since there would be no network stack involved. You'd also need to make sure that only a single system accesses the data on the disks exported as SRP targets to prevent corruption as those disks would appear as locally attached drives to the Windows box. I am unaware of an SRP target implementation for Linux, though, so that may not be a viable option for you. > My lack of knowledge of IB in the windows world has got me > concerned over whether this is actually achievable (easily). > > I hope to be trying this once we get a Windows 2003 machine, > but hope someone can encourage me that its a breeze prior to > my coming unstuck in a month or so! The IB stuff should be a breeze to get functional and interoperating. Whether performance matches your requirements/expectations is another thing. Do report back if you have any questions or run into any problems along the way. - Fab From mdidomenico at gmail.com Thu Feb 9 14:01:36 2006 From: mdidomenico at gmail.com (Michael Di Domenico) Date: Thu, 9 Feb 2006 17:01:36 -0500 Subject: [openib-general] Re: openib and mellanox hca problem In-Reply-To: <20060209182537.GH19594@esmail.cup.hp.com> References: <20060208172641.GC28594@mellanox.co.il> <97a7c7ed0602081011m690e30deiba9789f5ce946201@mail.gmail.com> <97a7c7ed0602090553j46e142c2mc9e22c9e2b3e15c@mail.gmail.com> <20060209182537.GH19594@esmail.cup.hp.com> Message-ID: <97a7c7ed0602091401r2b55ba96r999f3043c46b601c@mail.gmail.com> On 2/9/06, Grant Grundler wrote: > On Thu, Feb 09, 2006 at 07:32:31AM -0800, Roland Dreier wrote: > > Michael> My problems if you recall with the mellanox cards seem to > > Michael> be related to Linux Kernel 2.6.15.3... Installed RHEL4 > > Michael> instead of Fedora Core and get the same results using > > Michael> kernel 2.6.15.3 and RHEL4... > > > > No, I think the problem is that something about your system is causing > > problems when the driver resets the HCA. Presumably the silverstorm > > stack does not reset the HCA. > > > > I had one thought: do you know the speed (100 MHz, 133 MHz, ...) of > > the PCI slot that your HCA is in? > > Or can you move the card to a different slot and try again? > Sometimes cards will work in one slot but not another due > to timing differences or signal quality differences between slots. > > > > In any case you should be able to comment out the call to > > mthca_reset() in mthca_init_one() as a workaround for now. Commenting out the mthca_reset seems to have fixed the problem... My apologies... So i'm now running RHEL4 w/ Kernel 2.6.15.3 with the openib stack... here's some outputs if you guys want it, from lspci and dmesg I'm still curious why the reset caused an issue and whether it is thought that this is a random occurance because of the hardware i have? -------------- next part -------------- A non-text attachment was scrubbed... Name: dmesg.out Type: application/octet-stream Size: 15421 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: lspci.out Type: application/octet-stream Size: 15442 bytes Desc: not available URL: From rdreier at cisco.com Thu Feb 9 14:10:04 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 09 Feb 2006 14:10:04 -0800 Subject: [openib-general] Re: openib and mellanox hca problem In-Reply-To: <97a7c7ed0602091401r2b55ba96r999f3043c46b601c@mail.gmail.com> (Michael Di Domenico's message of "Thu, 9 Feb 2006 17:01:36 -0500") References: <20060208172641.GC28594@mellanox.co.il> <97a7c7ed0602081011m690e30deiba9789f5ce946201@mail.gmail.com> <97a7c7ed0602090553j46e142c2mc9e22c9e2b3e15c@mail.gmail.com> <20060209182537.GH19594@esmail.cup.hp.com> <97a7c7ed0602091401r2b55ba96r999f3043c46b601c@mail.gmail.com> Message-ID: Michael> I'm still curious why the reset caused an issue and Michael> whether it is thought that this is a random occurance Michael> because of the hardware i have? It is definitely something to do with the particular setup you have. Hence my previous question: do you know the speed of the PCI slot you are using? I've never seen this issue, but I only test with systems that have 133 MHz PCI-X slots. There may be some problem with slower (eg 100 MHz) PCI slots. - R. From mdidomenico at gmail.com Thu Feb 9 14:21:27 2006 From: mdidomenico at gmail.com (Michael Di Domenico) Date: Thu, 9 Feb 2006 17:21:27 -0500 Subject: [openib-general] Re: openib and mellanox hca problem In-Reply-To: References: <20060208172641.GC28594@mellanox.co.il> <97a7c7ed0602081011m690e30deiba9789f5ce946201@mail.gmail.com> <97a7c7ed0602090553j46e142c2mc9e22c9e2b3e15c@mail.gmail.com> <20060209182537.GH19594@esmail.cup.hp.com> <97a7c7ed0602091401r2b55ba96r999f3043c46b601c@mail.gmail.com> Message-ID: <97a7c7ed0602091421t416b5e01lff2b8982c4fb0d4c@mail.gmail.com> On 2/9/06, Roland Dreier wrote: > Michael> I'm still curious why the reset caused an issue and > Michael> whether it is thought that this is a random occurance > Michael> because of the hardware i have? > > It is definitely something to do with the particular setup you have. > Hence my previous question: do you know the speed of the PCI slot you > are using? I've never seen this issue, but I only test with systems > that have 133 MHz PCI-X slots. There may be some problem with slower > (eg 100 MHz) PCI slots. They're 2.0ghz Zeon machines that are older but not very old, there is a good chance they are 100Mhz, but i'll look into more and get back to you... From suri at baymicrosystems.com Thu Feb 9 14:22:33 2006 From: suri at baymicrosystems.com (Suresh Shelvapille) Date: Thu, 9 Feb 2006 17:22:33 -0500 Subject: [openib-general] port num in port priv In-Reply-To: <1139505714.4450.966.camel@hal.voltaire.com> Message-ID: <200602092222.k19MMcqo020155@mail.baymicrosystems.com> Hal: I am only addressing the process_mad issue here: > > As a result, it appears like the process_mad function is always called > with a port number of > > zero. > > That should be OK. The switch external port it was received on should be > in the ib_wc as follows: > > struct ib_wc { > u64 wr_id; > enum ib_wc_status status; > enum ib_wc_opcode opcode; > u32 vendor_err; > u32 byte_len; > __be32 imm_data; > u32 qp_num; > u32 src_qp; > int wc_flags; > u16 pkey_index; > u16 slid; > u8 sl; > u8 dlid_path_bits; > u8 port_num; /* valid only for DR SMPs > on switches */ > }; > > and that is the one that needs to be used in the DR return path. > [SS] if we are supposed to get the port num from the ib_wc, in process_mad(0) I can do that for all SMP methods such as getportinfo() etc... But, the ib_wc parameter is null when process_mad is called from show_pma_counter() function in sysfs.c? So should I be switching between the port_num parameter vs. sib_wc->port_num depending on whether ib_wc is NULL or not? Thanks, Suri From rdreier at cisco.com Thu Feb 9 14:29:07 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 09 Feb 2006 14:29:07 -0800 Subject: [openib-general] port num in port priv In-Reply-To: <200602092222.k19MMcqo020155@mail.baymicrosystems.com> (Suresh Shelvapille's message of "Thu, 9 Feb 2006 17:22:33 -0500") References: <200602092222.k19MMcqo020155@mail.baymicrosystems.com> Message-ID: Suresh> [SS] if we are supposed to get the port num from the Suresh> ib_wc, in process_mad(0) I can do that for all SMP methods Suresh> such as getportinfo() etc... But, the ib_wc parameter is Suresh> null when process_mad is called from show_pma_counter() Suresh> function in sysfs.c? For performance management queries, you should not need to look at the physical port where the MAD was received. The port being queried is in the PortSelect field of the PortCounters query. Unfortunately the sysfs PMA counters support will not be that useful, since it will only show counters for port 0; but that's a different issue. - R. From halr at voltaire.com Thu Feb 9 14:43:23 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Feb 2006 17:43:23 -0500 Subject: [openib-general] port num in port priv In-Reply-To: <200602092222.k19MMcqo020155@mail.baymicrosystems.com> References: <200602092222.k19MMcqo020155@mail.baymicrosystems.com> Message-ID: <1139525003.4450.2049.camel@hal.voltaire.com> Hi Suresh, On Thu, 2006-02-09 at 17:22, Suresh Shelvapille wrote: > Hal: > > I am only addressing the process_mad issue here: > > > > > > As a result, it appears like the process_mad function is always called > > with a port number of > > > zero. > > > > That should be OK. The switch external port it was received on should be > > in the ib_wc as follows: > > > > struct ib_wc { > > u64 wr_id; > > enum ib_wc_status status; > > enum ib_wc_opcode opcode; > > u32 vendor_err; > > u32 byte_len; > > __be32 imm_data; > > u32 qp_num; > > u32 src_qp; > > int wc_flags; > > u16 pkey_index; > > u16 slid; > > u8 sl; > > u8 dlid_path_bits; > > u8 port_num; /* valid only for DR SMPs > > on switches */ > > }; > > > > and that is the one that needs to be used in the DR return path. > > > [SS] if we are supposed to get the port num from the ib_wc, in > process_mad(0) I can do that for all SMP methods such as getportinfo() > etc... > But, the ib_wc parameter is null when process_mad is called from > show_pma_counter() function in sysfs.c? PMA uses LR (LID routed) rather than DR packets. LR packets do not fill in port number. Note that there are both LR and DR SMPs. port_num is only valid for DR SMPs. > So should I be switching between the port_num parameter vs. sib_wc->port_num > depending on whether ib_wc is NULL or not? I think the driver needs to fill in the port_num field in the ib_wc and this is only needed for DR SMPs (as the comment indicates). -- Hal From tzachid at mellanox.co.il Thu Feb 9 14:58:04 2006 From: tzachid at mellanox.co.il (Tzachi Dar) Date: Fri, 10 Feb 2006 00:58:04 +0200 Subject: [openib-general] RE: [Openib-windows] NFS performance and general disk network exportadvice (Linux-Windows) Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3015CD19E@mtlexch01.mtl.com> Here is some help from my side. I'm not so familiar with NFS but I'll try to help. IPoIB works well between Linux and windows, so running some NFS client should work well if both sides uses IPoIB. If, and this is a big if, both the NFS client on windows and the Linux sides are using (TCP) sockets, one can use SDP in order to communicate with high bandwidth. Please note that even if both are using sockets, there is a need to verify if the windows side is using the sockets from the user mode or from the kernel. Currently SDP can only work with sockets at the user level and if the client is actually a driver we should either write a connecting layer, or find another client. SDP on windows currently reaches ~1250MB of bandwidth, so I believe that you will get the BW you need. Thanks Tzachi > -----Original Message----- > From: openib-windows-bounces at openib.org > [mailto:openib-windows-bounces at openib.org] On Behalf Of Paul Baxter > Sent: Thursday, February 09, 2006 10:18 PM > To: openib-windows at openib.org > Cc: openib-general at openib.org > Subject: [Openib-windows] NFS performance and general disk > network exportadvice (Linux-Windows) > > I'm looking to export a filesystem from each of four linux > 64bit boxes to a single Windows server 2003 64bit Ed. > > Has anyone achieved this already using an IB transport? Can I > use NFS over IPoIB cross platform? i.e. do both ends support > a solution? > > Is NFS over RDMA compatible with Windows (pretty sure the > answer is no to this one but love to be proven wrong). I've > attached Tom's announcement of the latest to the bottom of > this email. I don't think Windows has the RDMA abstraction (yet)? > > Are windows IB drivers (Openib or Mellanox) compatible with > these options? > Do I layer Windows services for Unix on top of the Windows IB > drivers and IPoIB to achieve a cross platform NFS? > > Has anyone done much in the way of NFS performance > comparisons of NFS over IPoIB in cross-platform situations vs > say Gigabit ethernet. Does it work :) What is large file > throughput and processor loading - I'm aiming for 150-200 > MB/s on large files on 4x SDR IB (possibly DDR if we can fit > the bigger 144 port switch chassis into our rack layout for > 50-ish nodes). > > Are there any alternatives to using NFS that may be better > and that would 'transparently' receive a performance boost > with IB compared with using a simple NFS/gigabit ethernet > solution. Must be fairly straightforward, ideally application > neutral (configure a drive and load/unload script for Linux > and it just happens) and compatible between Win2003 and Linux? > Alternatives using perhaps Samba on the Linux side? > > My lack of knowledge of IB in the windows world has got me > concerned over whether this is actually achievable (easily). > > I hope to be trying this once we get a Windows 2003 machine, > but hope someone can encourage me that its a breeze prior to > my coming unstuck in a month or so! > > Some detail about the bit I do understand: > > I will be using a patched Linux kernel (realtime preemption > patches ) but prefer not to apply/track too many kernel > patches as the kernel evolves. The NFS patches suggested by > Tom in his announcement below make me a little nervous. > > The application will alternate between a real-time mode with > (probably) no NFS (or similar network exporting of the disk) > and an archiving mode where Linux will load relevant network > filesystem modules and let the windows machine read the disks. > > The reason for this odd load/unload behaviour is because our > current experience with NFS has been that the driver is prone > to putting multi-millisecond glitches that have a habit of > upsetting (soft) real-time behaviour at the sorts of timing > latencies we're looking at (milliseond or two). NFS (and > network cards) do like to batch up work and then run these > from interrupt contexts. SoftIRQs help tremendously but don't > seem to be the complete answer. > > Paul Baxter > > Tom's announcement: > > We have released an updated NFS/RDMA client for Linux at > the project's > > Sourceforge site: > > > > > > > > > > d=178973> > > > > This release updates the RPC/RDMA support as follows: > > Linux 2.6.15.2 supported > > Integrates with RPC via 2.6.15 transport switch Employs OpenIB RDMA > > verbs API (not kDAPL) Dual BSD/GPL2 licensing > > > > There are no protocol changes in this release, it is > identical to the > > previous release (and the IETF draft) in this respect. The > client has > > been tested with NFSv3 and passes the Connectathon test suite. > > > > At present, the client requires some additional transport switch > > patches to be applied to the Linux kernel, these are available at > > Chuck Lever's patches page: > > > > > > > The related CITI NFS/RDMA server project is currently available for > > 2.6.14 from: > > > > > > > > > > MA_stage2_2005-12-19.patch> > > > > This server is functional but only supports small RDMA inline data > > transfers, and a single request in flight. So, its performance is > > quite far from the potential. However, it is functional and is the > > server we pass Connectathon with! > > > > The server project is now being developed by Open Grid Computing, > > moving to the OpenIB common RDMA verbs API. We'll be making > updates to > > both client and server as they become available. There's a > lot more to > > do. > > > > We look forward to comments and feedback from the various standards > > and open source communities on this. Feel free to use the > mailing list > > on the sourceforge project site, or any of these lists (which we > > usually > > monitor) but cc at least me and James Lentini (jlentini at netapp.com). > > > > Thanks, > > Tom Talpey, for the various NFS/RDMA projects. > > _______________________________________________ > openib-windows mailing list > openib-windows at openib.org > http://openib.org/mailman/listinfo/openib-windows > From suri at baymicrosystems.com Thu Feb 9 14:56:26 2006 From: suri at baymicrosystems.com (Suresh Shelvapille) Date: Thu, 9 Feb 2006 17:56:26 -0500 Subject: [openib-general] port num in port priv In-Reply-To: Message-ID: <200602092256.k19MuVsv020823@mail.baymicrosystems.com> > > Unfortunately the sysfs PMA counters support will not be that useful, > since it will only show counters for port 0; but that's a different issue. > > - R. [SS] That's a bummer, so don't we need to create as many file descriptors as the number of physical ports on a switch, so that we can gather different port related stats from the device locally (not talking about subnet manager queries)! Thanks, Suri From ardavis at ichips.intel.com Thu Feb 9 14:56:39 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Thu, 09 Feb 2006 14:56:39 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal In-Reply-To: <6.2.0.14.2.20060209122022.023baf20@esmail.cup.hp.com> References: <6.2.0.14.2.20060208134317.02654d00@esmail.cup.hp.com> <43EA8085.5070208@ichips.intel.com> <6.2.0.14.2.20060209122022.023baf20@esmail.cup.hp.com> Message-ID: <43EBC8A7.1080406@ichips.intel.com> Michael Krause wrote: > RDMA Write with Immediate is part of the IB Extended Transport > Header. It is a fixed-sized quantity and not one subject to change, > i.e. increasing its size. > > Your argument above reinforces that the particular application need is > IB-specific and thus should not be part of a general API but a > transport-specific API. If the application will only operate > optimally using immediate data, then it is only suitable for an IB > fabric. This reinforces the need for a transport-specific API. I agree. I will move the IB immediate data service back into the extension interface and update the OpenIB uDAPL provider patch. > > Those applications that simply want to enable completion notification > when a RDMA Write has occurred can use a general purpose API that is > interconnect independent and whose code is predicated upon a RDMA > Write - Send set of operations. This will enable application > portability across all interconnect types. I will defer this to Arkady to draft. -arlin From Arkady.Kanevsky at netapp.com Thu Feb 9 15:04:07 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Thu, 9 Feb 2006 18:04:07 -0500 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal Message-ID: Arlin, This can be done. But I have an issue that extension call violate Transport Requirement. Currently, the matching semantic is well-defined since Recv only matches Send. Since Spec does not have any idea what operations are defined in extension(s) there is a problem with the transport requirements. We can, of course, make some generic statement that with does not cover APIs that are defined in extensions. The API requirements are easier to handle. Since they have been written as "Nonrequirement" for the APIs we decide to define yet. (I will need to review chapter 5 to make we had followed this in all cases.) Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Arlin Davis [mailto:ardavis at ichips.intel.com] > Sent: Thursday, February 09, 2006 5:57 PM > To: Michael Krause > Cc: dat-discussions at yahoogroups.com; > openib-general at openib.org; Kanevsky, Arkady > Subject: Re: [dat-discussions] [openib-general] [RFC] > DAT2.0immediatedataproposal > > Michael Krause wrote: > > > RDMA Write with Immediate is part of the IB Extended > Transport Header. > > It is a fixed-sized quantity and not one subject to change, i.e. > > increasing its size. > > > > Your argument above reinforces that the particular > application need is > > IB-specific and thus should not be part of a general API but a > > transport-specific API. If the application will only operate > > optimally using immediate data, then it is only suitable for an IB > > fabric. This reinforces the need for a transport-specific API. > > I agree. I will move the IB immediate data service back into > the extension interface and update the OpenIB uDAPL provider patch. > > > > > Those applications that simply want to enable completion > notification > > when a RDMA Write has occurred can use a general purpose > API that is > > interconnect independent and whose code is predicated upon a RDMA > > Write - Send set of operations. This will enable application > > portability across all interconnect types. > > I will defer this to Arkady to draft. > > -arlin > From suri at baymicrosystems.com Thu Feb 9 15:06:22 2006 From: suri at baymicrosystems.com (Suresh Shelvapille) Date: Thu, 9 Feb 2006 18:06:22 -0500 Subject: [openib-general] port num in port priv In-Reply-To: <1139505714.4450.966.camel@hal.voltaire.com> Message-ID: <200602092306.k19N6Rh6021023@mail.baymicrosystems.com> Hal and folks: > > For switches, the one from the WC needs to be filled in and passed so > that sounds wrong and needs fixing. Do you want to take a crack at this > or should I ? > [SS] I am not as IB core stack savvy as you guys are, so I would appreciate it if you can provide the changes...I will be happy to review and test it. Right now, I have kludged my sw driver to handle DR SMPs to make the subnet manager happy. Thanks a lot, Suri From halr at voltaire.com Thu Feb 9 15:06:29 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Feb 2006 18:06:29 -0500 Subject: [openib-general] port num in port priv In-Reply-To: <200602092256.k19MuVsv020823@mail.baymicrosystems.com> References: <200602092256.k19MuVsv020823@mail.baymicrosystems.com> Message-ID: <1139526157.4450.2111.camel@hal.voltaire.com> On Thu, 2006-02-09 at 17:56, Suresh Shelvapille wrote: > > > > Unfortunately the sysfs PMA counters support will not be that useful, > > since it will only show counters for port 0; but that's a different issue. > > > > - R. > [SS] That's a bummer, so don't we need to create as many file descriptors as > the number of physical ports on a switch, so that we can gather different > port related stats from the device locally (not talking about subnet manager > queries)! The PMA counters are gathered by a tool named perfquery. Port number is in the PortSelect component of the PMA attribute. You can't get PMA counters until the port is active. -- Hal From halr at voltaire.com Thu Feb 9 15:12:58 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Feb 2006 18:12:58 -0500 Subject: [openib-general] port num in port priv In-Reply-To: <200602092306.k19N6Rh6021023@mail.baymicrosystems.com> References: <200602092306.k19N6Rh6021023@mail.baymicrosystems.com> Message-ID: <1139526291.4450.2121.camel@hal.voltaire.com> Hi Suri, On Thu, 2006-02-09 at 18:06, Suresh Shelvapille wrote: > Hal and folks: > > > > > For switches, the one from the WC needs to be filled in and passed so > > that sounds wrong and needs fixing. Do you want to take a crack at this > > or should I ? > > > [SS] I am not as IB core stack savvy as you guys are, so I would appreciate > it if you can provide the changes...I will be happy to review and test it. I'll supply a patch but you will need to test it and provide feedback. OK ? -- Hal > Right now, I have kludged my sw driver to handle DR SMPs to make the subnet > manager happy. > > Thanks a lot, > Suri > From mst at mellanox.co.il Thu Feb 9 15:46:54 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 10 Feb 2006 01:46:54 +0200 Subject: [openib-general] Re: FW: [PATCH 1 of 3] mad: large RMPP support In-Reply-To: <43EB7E28.3050402@ichips.intel.com> References: <43EB7E28.3050402@ichips.intel.com> Message-ID: <20060209234654.GB5447@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: FW: [PATCH 1 of 3] mad: large RMPP support > > Roland Dreier wrote: > >My rule of thumb is that we shouldn't rely on being able to allocate a > >contiguous buffer bigger than 4 KB, but assuming we can allocate 4 KB > >is fine. 4 KB is the lowest page size of any real architecture, and > >if the kernel is out of free pages then any allocation is likely to > >fail. Allocations of larger buffers may fail because of memory > >fragmentation, even with plenty of free memory. > > > >That is: a 4 KB buffer is fine. > > Given this, I think that we'll need to go with the linked list then. Maybe > something like: > > struct ib_mad_segment { > struct list_head list; > u8 data[0]; > }; > > struct ib_mad_send_buf { > ... > void *mad; /* first segment */ > struct list_head rmpp_list; > u32 segment_size; > ... > }; Given that the last segment has a different size, it seems cleaner to just keep the segment size part of ib_mad_segment structure. > I'm undecided about whether all MADs should use the rmpp_list, with *mad > referencing the data of the first segment. This keeps the code consistent, > but would result in the first segment being larger (256-bytes) than > additional segments (say 220-bytes). I dont htink its a good idea. Recall that when you send segments starting from the second one, you need the header from *mad. So this gets ugly very quickly. > Users could then walk the list of buffers without calling a routine that > needs to start at the beginning of the list every time. > > - Sean On the other hand, it makes sense to keep the single mad case as simple as possible. So that's a good reason to have the rmpp list include segments starting from the second one. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mshefty at ichips.intel.com Thu Feb 9 16:02:51 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 09 Feb 2006 16:02:51 -0800 Subject: [openib-general] Re: FW: [PATCH 1 of 3] mad: large RMPP support In-Reply-To: <20060209234654.GB5447@mellanox.co.il> References: <43EB7E28.3050402@ichips.intel.com> <20060209234654.GB5447@mellanox.co.il> Message-ID: <43EBD82B.3090804@ichips.intel.com> Michael S. Tsirkin wrote: > Given that the last segment has a different size, it seems cleaner > to just keep the segment size part of ib_mad_segment structure. The last segment should provide any necessary padding, so that the resulting MAD is 256-bytes. Segments 2 through n should be the same size. > I dont htink its a good idea. > Recall that when you send segments starting from the second one, > you need the header from *mad. So this gets ugly very quickly. This is true whether the first segment is in the rmpp_list or not. From a user's viewpoint, they can walk all segments using the list operations. Otherwise, they need to reference the first segment using *mad, then all other segments using a list. I do see the issue that the first segment requires an offset, whereas, others do not. > On the other hand, it makes sense to keep the single mad case > as simple as possible. So that's a good reason to have the rmpp list > include segments starting from the second one. For single segment MADs, the rmpp_list can be ignored by the user. It's just that the internal code can be easier. We won't have to special case tracking the last segment acked as being either referenced by *mad or a pointer to a segment. Hmm... okay, how about this idea? For single segment MADs, only *mad is used. For multiple segment MADs, *mad references the repeated header only. All data segments are in the rmpp_list. Does anything outside of userspace send a multi-segment RMPP MAD? Is it likely that a kernel component would need to? - Sean From mst at mellanox.co.il Thu Feb 9 16:54:23 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 10 Feb 2006 02:54:23 +0200 Subject: [openib-general] Re: FW: [PATCH 1 of 3] mad: large RMPP support In-Reply-To: <43EBD82B.3090804@ichips.intel.com> References: <43EB7E28.3050402@ichips.intel.com> <20060209234654.GB5447@mellanox.co.il> <43EBD82B.3090804@ichips.intel.com> Message-ID: <20060210005423.GA7174@mellanox.co.il> Sean, at least I am a bit confused at this point. Let's go back and summarize the list of issues you see with the last patchset, OK? As far as I can see, we decided that the list of segments is the right approach. It also seems you are also inclining towards uniform handling of the first segment and the rest of them, so I hope this means the simplification achieved by always using an s/g list of size 2 is also accepted. It seems to me the only issue left is the extra list walks needed when we look up the segment by number. The simplest solution to that would probably be tracking the chunk addressed by seg_num, or something along these lines. Right? Some more comments below. Quoting Sean Hefty : > > Michael S. Tsirkin wrote: > >Given that the last segment has a different size, it seems cleaner > >to just keep the segment size part of ib_mad_segment structure. > > The last segment should provide any necessary padding, so that the > resulting MAD is 256-bytes. Segments 2 through n should be the same size. Uh, right. So segment size could just be a define. > >I dont htink its a good idea. > >Recall that when you send segments starting from the second one, > >you need the header from *mad. So this gets ugly very quickly. > > This is true whether the first segment is in the rmpp_list or not. From a > user's viewpoint, they can walk all segments using the list operations. > Otherwise, they need to reference the first segment using *mad, then all > other segments using a list. I do see the issue that the first segment > requires an offset, whereas, others do not. > > >On the other hand, it makes sense to keep the single mad case > >as simple as possible. So that's a good reason to have the rmpp list > >include segments starting from the second one. > > For single segment MADs, the rmpp_list can be ignored by the user. It's > just that the internal code can be easier. We won't have to special case > tracking the last segment acked as being either referenced by *mad or a > pointer to a segment. > > Hmm... okay, how about this idea? For single segment MADs, only *mad is > used. For multiple segment MADs, *mad references the repeated header only. > All data segments are in the rmpp_list. I dont really think it matters that much how we shuffle the buffers around. What matters to me is making the rmpp and mad code as simple as possible, and hiding all these details away from the user. > Does anything outside of userspace send a multi-segment RMPP MAD? Is it > likely that a kernel component would need to? > > - Sean > Sean, I think Jack's patch solves all these issues quite nicely: - There's an API function to get a MAD segment by index, so that users never have to know about how RMPP works - they get a pointer and size and can fill it in. - Anyone can fill these buffers: user_mad or another kernel component. - For sending data, there's a unified approach by using a s/g list of size 2 uniformly. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mshefty at ichips.intel.com Thu Feb 9 17:11:04 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 09 Feb 2006 17:11:04 -0800 Subject: [openib-general] Re: FW: [PATCH 1 of 3] mad: large RMPP support In-Reply-To: <20060210005423.GA7174@mellanox.co.il> References: <43EB7E28.3050402@ichips.intel.com> <20060209234654.GB5447@mellanox.co.il> <43EBD82B.3090804@ichips.intel.com> <20060210005423.GA7174@mellanox.co.il> Message-ID: <43EBE828.8080409@ichips.intel.com> Michael S. Tsirkin wrote: > As far as I can see, we decided that the list of segments is the right approach. Agreed. > It also seems you are also inclining towards uniform handling of the first > segment and the rest of them, so I hope this means the simplification > achieved by always using an s/g list of size 2 is also accepted. I would prefer to keep single segment MADs to a single SGE, but this isn't a big deal. For a list of segments, my preference is to match what was done for the receive side on completions. The first ib_mad_recv_buf is also in the rmpp_list. This made the code simpler when dealing with multiple receive buffers that comprised a single MAD. I'm anticipating that the same will be true for send handling. > It seems to me the only issue left is the extra list walks needed when we > look up the segment by number. The simplest solution to that would > probably be tracking the chunk addressed by seg_num, or something > along these lines. This is the primary issue. Both the internal code and user should be able to access segments without walking through the list multiple times. Internally, the RMPP code can do this using a couple of additional pointers. Externally, a user should have access to the segment list directly, rather than through a function call. The latest idea that I was suggesting was to separate the MAD header from the data segments for multi-segment/RMPP active MADs. Let *mad reference the header, and rmpp_list reference all data segments. Non-RMPP MADs would use *mad to reference the entire buffer. For an RMPP MAD (one with RMPP active) that fits into a single segment, I can go either way. - Sean From mshefty at ichips.intel.com Thu Feb 9 17:17:48 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 09 Feb 2006 17:17:48 -0800 Subject: [openib-general] Re: FW: [PATCH 1 of 3] mad: large RMPP support In-Reply-To: <20060210005423.GA7174@mellanox.co.il> References: <43EB7E28.3050402@ichips.intel.com> <20060209234654.GB5447@mellanox.co.il> <43EBD82B.3090804@ichips.intel.com> <20060210005423.GA7174@mellanox.co.il> Message-ID: <43EBE9BC.9040003@ichips.intel.com> Michael S. Tsirkin wrote: >>The last segment should provide any necessary padding, so that the >>resulting MAD is 256-bytes. Segments 2 through n should be the same size. > > Uh, right. So segment size could just be a define. To be clear, the segment size still varies by class, but would be consistent for a given class. > I dont really think it matters that much how we shuffle the buffers around. > What matters to me is making the rmpp and mad code as simple as possible, > and hiding all these details away from the user. I agree. I was trying to suggest an implementation that I think will result in simpler code in the end. > - There's an API function to get a MAD segment by index, > so that users never have to know about how RMPP works - > they get a pointer and size and can fill it in. The problem is that it results in walking the list every time the user wants to see the next segment. This is an O(n^2) operation. In any case I can think of, all that's needed is a single walk through the list to copy the userspace MAD data into the appropriate data segments. In this case I think it makes sense to expose the list implementation to gain the extra performance. - Sean From rdreier at cisco.com Thu Feb 9 17:22:19 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 09 Feb 2006 17:22:19 -0800 Subject: [openib-general] Re: FW: [PATCH 1 of 3] mad: large RMPP support In-Reply-To: <43EBE9BC.9040003@ichips.intel.com> (Sean Hefty's message of "Thu, 09 Feb 2006 17:17:48 -0800") References: <43EB7E28.3050402@ichips.intel.com> <20060209234654.GB5447@mellanox.co.il> <43EBD82B.3090804@ichips.intel.com> <20060210005423.GA7174@mellanox.co.il> <43EBE9BC.9040003@ichips.intel.com> Message-ID: Sean> The problem is that it results in walking the list every Sean> time the user wants to see the next segment. This is an Sean> O(n^2) operation. In any case I can think of, all that's Sean> needed is a single walk through the list to copy the Sean> userspace MAD data into the appropriate data segments. In Sean> this case I think it makes sense to expose the list Sean> implementation to gain the extra performance. If we want to keep the internals hidden, then we could expose some sort of iterator object that clients can use to walk the list in O(n). - R. From sean.hefty at intel.com Thu Feb 9 17:25:18 2006 From: sean.hefty at intel.com (Hefty, Sean) Date: Thu, 9 Feb 2006 17:25:18 -0800 Subject: [openib-general] [PATCH v2] [RFC] - example user mode rdma ping/pongprogram using CMA Message-ID: Here's an updated version of rping. I restructured to code to make it more modular, reduce the size of some of the functions, simplify some areas, and make it more consistent. The updated version worked for my limited testing. Please review the changes to see if I changed any of the intended functionality. Signed-off-by: Sean Hefty --- Index: Makefile.am =================================================================== --- Makefile.am (revision 5098) +++ Makefile.am (working copy) @@ -18,9 +18,11 @@ endif src_librdmacm_la_SOURCES = src/cma.c src_librdmacm_la_LDFLAGS = -avoid-version $(rdmacm_version_script) -bin_PROGRAMS = examples/ucmatose +bin_PROGRAMS = examples/ucmatose examples/rping examples_ucmatose_SOURCES = examples/cmatose.c examples_ucmatose_LDADD = $(top_builddir)/src/librdmacm.la +examples_rping_SOURCES = examples/rping.c +examples_rping_LDADD = $(top_builddir)/src/librdmacm.la librdmacmincludedir = $(includedir)/rdma Index: examples/rping.c =================================================================== --- examples/rping.c (revision 0) +++ examples/rping.c (revision 0) @@ -0,0 +1,1051 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +static int debug = 0; +#define DEBUG_LOG if (debug) printf + +/* + * rping "ping/pong" loop: + * client sends source rkey/addr/len + * server receives source rkey/add/len + * server rdma reads "ping" data from source + * server sends "go ahead" on rdma read completion + * client sends sink rkey/addr/len + * server receives sink rkey/addr/len + * server rdma writes "pong" data to sink + * server sends "go ahead" on rdma write completion + * + */ + +/* + * These states are used to signal events between the completion handler + * and the main client or server thread. + * + * Once CONNECTED, they cycle through RDMA_READ_ADV, RDMA_WRITE_ADV, + * and RDMA_WRITE_COMPLETE for each ping. + */ +enum test_state { + IDLE = 1, + CONNECT_REQUEST, + ADDR_RESOLVED, + ROUTE_RESOLVED, + CONNECTED, + RDMA_READ_ADV, + RDMA_READ_COMPLETE, + RDMA_WRITE_ADV, + RDMA_WRITE_COMPLETE, + ERROR +}; + +struct rping_rdma_info { + uint64_t buf; + uint32_t rkey; + uint32_t size; +}; + +/* + * Default max buffer size for IO... + */ +#define RPING_BUFSIZE 64*1024 +#define RPING_SQ_DEPTH 16 + +/* + * Control block struct. + */ +struct rping_cb { + int server; /* 0 iff client */ + pthread_t cqthread; + struct ibv_comp_channel *channel; + struct ibv_cq *cq; + struct ibv_pd *pd; + struct ibv_qp *qp; + + struct ibv_recv_wr rq_wr; /* recv work request record */ + struct ibv_sge recv_sgl; /* recv single SGE */ + struct rping_rdma_info recv_buf;/* malloc'd buffer */ + struct ibv_mr *recv_mr; /* MR associated with this buffer */ + + struct ibv_send_wr sq_wr; /* send work requrest record */ + struct ibv_sge send_sgl; + struct rping_rdma_info send_buf;/* single send buf */ + struct ibv_mr *send_mr; + + struct ibv_send_wr rdma_sq_wr; /* rdma work request record */ + struct ibv_sge rdma_sgl; /* rdma single SGE */ + char *rdma_buf; /* used as rdma sink */ + struct ibv_mr *rdma_mr; + + uint32_t remote_rkey; /* remote guys RKEY */ + uint64_t remote_addr; /* remote guys TO */ + uint32_t remote_len; /* remote guys LEN */ + + char *start_buf; /* rdma read src */ + struct ibv_mr *start_mr; + + enum test_state state; /* used for cond/signalling */ + sem_t sem; + + uint16_t port; /* dst port in NBO */ + uint32_t addr; /* dst addr in NBO */ + char *addr_str; /* dst addr string */ + int verbose; /* verbose logging */ + int count; /* ping count */ + int size; /* ping data size */ + int validate; /* validate ping data */ + + /* CM stuff */ + pthread_t cmthread; + struct rdma_cm_id *cm_id; /* connection on client side,*/ + /* listener on service side. */ + struct rdma_cm_id *child_cm_id; /* connection on server side */ +}; + +static void rping_cma_event_handler(struct rdma_cm_id *cma_id, + struct rdma_cm_event *event) +{ + int ret; + struct rping_cb *cb = cma_id->context; + + DEBUG_LOG("cma_event type %d cma_id %p (%s)\n", event->event, cma_id, + (cma_id == cb->cm_id) ? "parent" : "child"); + + switch (event->event) { + case RDMA_CM_EVENT_ADDR_RESOLVED: + cb->state = ADDR_RESOLVED; + ret = rdma_resolve_route(cma_id, 2000); + if (ret) { + fprintf(stderr, "rdma_resolve_route error %d\n", ret); + sem_post(&cb->sem); + } + break; + + case RDMA_CM_EVENT_ROUTE_RESOLVED: + cb->state = ROUTE_RESOLVED; + sem_post(&cb->sem); + break; + + case RDMA_CM_EVENT_CONNECT_REQUEST: + cb->state = CONNECT_REQUEST; + cb->child_cm_id = cma_id; + DEBUG_LOG("child cma %p\n", cb->child_cm_id); + sem_post(&cb->sem); + break; + + case RDMA_CM_EVENT_ESTABLISHED: + DEBUG_LOG("ESTABLISHED\n"); + cb->state = CONNECTED; + sem_post(&cb->sem); + break; + + case RDMA_CM_EVENT_ADDR_ERROR: + case RDMA_CM_EVENT_ROUTE_ERROR: + case RDMA_CM_EVENT_CONNECT_ERROR: + case RDMA_CM_EVENT_UNREACHABLE: + case RDMA_CM_EVENT_REJECTED: + fprintf(stderr, "cma event %d, error %d\n", event->event, + event->status); + sem_post(&cb->sem); + break; + + case RDMA_CM_EVENT_DISCONNECTED: + fprintf(stderr, "DISCONNECT EVENT...\n"); + sem_post(&cb->sem); + break; + + case RDMA_CM_EVENT_DEVICE_REMOVAL: + fprintf(stderr, "cma detected device removal!!!!\n"); + break; + + default: + fprintf(stderr, "oof bad type!\n"); + sem_post(&cb->sem); + break; + } +} + +static int server_recv(struct rping_cb *cb, struct ibv_wc *wc) +{ + if (wc->byte_len != sizeof(cb->recv_buf)) { + fprintf(stderr, "Received bogus data, size %d\n", wc->byte_len); + return -1; + } + + cb->remote_rkey = cb->recv_buf.rkey; + cb->remote_addr = cb->recv_buf.buf; + cb->remote_len = cb->recv_buf.size; + DEBUG_LOG("Received rkey %x addr %llx len %d from peer\n", + cb->remote_rkey, cb->remote_addr, cb->remote_len); + + if (cb->state == CONNECTED || cb->state == RDMA_WRITE_COMPLETE) + cb->state = RDMA_READ_ADV; + else + cb->state = RDMA_WRITE_ADV; + + return 0; +} + +static int client_recv(struct rping_cb *cb, struct ibv_wc *wc) +{ + if (wc->byte_len != sizeof(cb->recv_buf)) { + fprintf(stderr, "Received bogus data, size %d\n", wc->byte_len); + return -1; + } + + if (cb->state == RDMA_READ_ADV) + cb->state = RDMA_WRITE_ADV; + else + cb->state = RDMA_WRITE_COMPLETE; + + return 0; +} + +static void rping_cq_event_handler(struct rping_cb *cb) +{ + struct ibv_wc wc; + struct ibv_recv_wr *bad_wr; + int ret; + + while ((ret = ibv_poll_cq(cb->cq, 1, &wc)) == 1) { + if (wc.status) { + fprintf(stderr, "cq completion failed status %d\n", + wc.status); + goto error; + } + + switch (wc.opcode) { + case IBV_WC_SEND: + DEBUG_LOG("send completion\n"); + break; + + case IBV_WC_RDMA_WRITE: + DEBUG_LOG("rdma write completion\n"); + cb->state = RDMA_WRITE_COMPLETE; + sem_post(&cb->sem); + break; + + case IBV_WC_RDMA_READ: + DEBUG_LOG("rdma read completion\n"); + cb->state = RDMA_READ_COMPLETE; + sem_post(&cb->sem); + break; + + case IBV_WC_RECV: + DEBUG_LOG("recv completion\n"); + ret = cb->server ? server_recv(cb, &wc) : + client_recv(cb, &wc); + if (ret) { + fprintf(stderr, "recv wc error: %d\n", ret); + goto error; + } + + ret = ibv_post_recv(cb->qp, &cb->rq_wr, &bad_wr); + if (ret) { + fprintf(stderr, "post recv error: %d\n", ret); + goto error; + } + sem_post(&cb->sem); + break; + + default: + DEBUG_LOG("unknown!!!!! completion\n"); + goto error; + } + } + if (ret) { + fprintf(stderr, "poll error %d\n", ret); + goto error; + } + return; + +error: + cb->state = ERROR; + sem_post(&cb->sem); +} + +static int rping_accept(struct rping_cb *cb) +{ + struct rdma_conn_param conn_param; + int ret; + + DEBUG_LOG("accepting client connection request\n"); + + memset(&conn_param, 0, sizeof conn_param); + conn_param.responder_resources = 1; + conn_param.initiator_depth = 1; + + ret = rdma_accept(cb->child_cm_id, &conn_param); + if (ret) { + fprintf(stderr, "rdma_accept error: %d\n", ret); + return ret; + } + + sem_wait(&cb->sem); + if (cb->state != CONNECTED) { + fprintf(stderr, "wait for CONNECTED state %d\n", cb->state); + return -1; + } + return 0; +} + +static void rping_setup_wr(struct rping_cb *cb) +{ + cb->recv_sgl.addr = (uint64_t) (unsigned long) &cb->recv_buf; + cb->recv_sgl.length = sizeof cb->recv_buf; + cb->recv_sgl.lkey = cb->recv_mr->lkey; + cb->rq_wr.sg_list = &cb->recv_sgl; + cb->rq_wr.num_sge = 1; + + cb->send_sgl.addr = (uint64_t) (unsigned long) &cb->send_buf; + cb->send_sgl.length = sizeof cb->send_buf; + cb->send_sgl.lkey = cb->send_mr->lkey; + + cb->sq_wr.opcode = IBV_WR_SEND; + cb->sq_wr.send_flags = IBV_SEND_SIGNALED; + cb->sq_wr.sg_list = &cb->send_sgl; + cb->sq_wr.num_sge = 1; + + cb->rdma_sgl.addr = (uint64_t) (unsigned long) cb->rdma_buf; + cb->rdma_sgl.lkey = cb->rdma_mr->lkey; + cb->rdma_sq_wr.send_flags = IBV_SEND_SIGNALED; + cb->rdma_sq_wr.sg_list = &cb->rdma_sgl; + cb->rdma_sq_wr.num_sge = 1; +} + +static int rping_setup_buffers(struct rping_cb *cb) +{ + int ret; + + DEBUG_LOG("rping_setup_buffers called on cb %p\n", cb); + + cb->recv_mr = ibv_reg_mr(cb->pd, &cb->recv_buf, sizeof cb->recv_buf, + IBV_ACCESS_LOCAL_WRITE); + if (!cb->recv_mr) { + fprintf(stderr, "recv_buf reg_mr failed\n"); + return errno; + } + + cb->send_mr = ibv_reg_mr(cb->pd, &cb->send_buf, sizeof cb->send_buf, 0); + if (!cb->send_mr) { + fprintf(stderr, "send_buf reg_mr failed\n"); + ret = errno; + goto err1; + } + + cb->rdma_buf = malloc(cb->size); + if (!cb->rdma_buf) { + fprintf(stderr, "rdma_buf malloc failed\n"); + ret = -ENOMEM; + goto err2; + } + + cb->rdma_mr = ibv_reg_mr(cb->pd, cb->rdma_buf, cb->size, + IBV_ACCESS_LOCAL_WRITE | + IBV_ACCESS_REMOTE_READ | + IBV_ACCESS_REMOTE_WRITE); + if (!cb->rdma_mr) { + fprintf(stderr, "rdma_buf reg_mr failed\n"); + ret = errno; + goto err3; + } + + if (!cb->server) { + cb->start_buf = malloc(cb->size); + if (!cb->start_buf) { + fprintf(stderr, "start_buf malloc failed\n"); + ret = -ENOMEM; + goto err4; + } + + cb->start_mr = ibv_reg_mr(cb->pd, cb->start_buf, cb->size, + IBV_ACCESS_LOCAL_WRITE | + IBV_ACCESS_REMOTE_READ | + IBV_ACCESS_REMOTE_WRITE); + if (!cb->start_mr) { + fprintf(stderr, "start_buf reg_mr failed\n"); + ret = errno; + goto err5; + } + } + + rping_setup_wr(cb); + DEBUG_LOG("allocated & registered buffers...\n"); + return 0; + +err5: + free(cb->start_buf); +err4: + ibv_dereg_mr(cb->rdma_mr); +err3: + free(cb->rdma_buf); +err2: + ibv_dereg_mr(cb->send_mr); +err1: + ibv_dereg_mr(cb->recv_mr); + return ret; +} + +static void rping_free_buffers(struct rping_cb *cb) +{ + DEBUG_LOG("rping_free_buffers called on cb %p\n", cb); + ibv_dereg_mr(cb->recv_mr); + ibv_dereg_mr(cb->send_mr); + ibv_dereg_mr(cb->rdma_mr); + free(cb->rdma_buf); + if (!cb->server) { + ibv_dereg_mr(cb->start_mr); + free(cb->start_buf); + } +} + +static int rping_create_qp(struct rping_cb *cb) +{ + struct ibv_qp_init_attr init_attr; + // struct ibv_qp_attr qp_attr; + int ret; + + memset(&init_attr, 0, sizeof(init_attr)); + init_attr.cap.max_send_wr = RPING_SQ_DEPTH; + init_attr.cap.max_recv_wr = 2; + init_attr.cap.max_recv_sge = 1; + init_attr.cap.max_send_sge = 1; + init_attr.qp_type = IBV_QPT_RC; + init_attr.send_cq = cb->cq; + init_attr.recv_cq = cb->cq; + + if (cb->server) { + ret = rdma_create_qp(cb->child_cm_id, cb->pd, &init_attr); + if (!ret) + cb->qp = cb->child_cm_id->qp; + } else { + ret = rdma_create_qp(cb->cm_id, cb->pd, &init_attr); + if (!ret) + cb->qp = cb->cm_id->qp; + } + +// if (ret) { +// cb->qp = NULL; +// return ret; +// } +// +// qp_attr.qp_access_flags = IBV_ACCESS_REMOTE_READ| +// IBV_ACCESS_REMOTE_WRITE; +// ret = ibv_modify_qp(cb->qp, &qp_attr, IBV_QP_ACCESS_FLAGS); +// if (ret) +// printf("ibv_modify_qp returned %d\n", ret); + return ret; +} + +static void rping_free_qp(struct rping_cb *cb) +{ + ibv_destroy_qp(cb->qp); + ibv_destroy_cq(cb->cq); + ibv_destroy_comp_channel(cb->channel); + ibv_dealloc_pd(cb->pd); +} + +static int rping_setup_qp(struct rping_cb *cb, struct rdma_cm_id *cm_id) +{ + int ret; + + cb->pd = ibv_alloc_pd(cm_id->verbs); + if (!cb->pd) { + fprintf(stderr, "ibv_alloc_pd failed\n"); + return errno; + } + DEBUG_LOG("created pd %p\n", cb->pd); + + cb->channel = ibv_create_comp_channel(cm_id->verbs); + if (!cb->channel) { + fprintf(stderr, "ibv_create_comp_channel failed\n"); + ret = errno; + goto err1; + } + DEBUG_LOG("created channel %p\n", cb->channel); + + cb->cq = ibv_create_cq(cm_id->verbs, RPING_SQ_DEPTH * 2, cb, + cb->channel, 0); + if (!cb->cq) { + fprintf(stderr, "ibv_create_cq failed\n"); + ret = errno; + goto err2; + } + DEBUG_LOG("created cq %p\n", cb->cq); + + ret = ibv_req_notify_cq(cb->cq, 0); + if (ret) { + fprintf(stderr, "ibv_create_cq failed\n"); + ret = errno; + goto err3; + } + + ret = rping_create_qp(cb); + if (ret) { + fprintf(stderr, "rping_create_qp failed: %d\n", ret); + goto err3; + } + DEBUG_LOG("created qp %p\n", cb->qp); + return 0; + +err3: + ibv_destroy_cq(cb->cq); +err2: + ibv_destroy_comp_channel(cb->channel); +err1: + ibv_dealloc_pd(cb->pd); + return ret; +} + +static void *cm_thread(void *arg) +{ + struct rdma_cm_event *event; + int ret; + + while (1) { + ret = rdma_get_cm_event(&event); + if (ret) { + fprintf(stderr, "rdma_get_cm_event err %d\n", ret); + exit(ret); + } + rping_cma_event_handler(event->id, event); + rdma_ack_cm_event(event); + } +} + +static void *cq_thread(void *arg) +{ + struct rping_cb *cb = arg; + struct ibv_cq *ev_cq; + void *ev_ctx; + int ret; + + DEBUG_LOG("cq_thread started.\n"); + + while (1) { + ret = ibv_get_cq_event(cb->channel, &ev_cq, &ev_ctx); + if (ret) { + fprintf(stderr, "Failed to get cq event!\n"); + exit(ret); + } + if (ev_cq != cb->cq) { + fprintf(stderr, "Unkown CQ!\n"); + exit(-1); + } + ret = ibv_req_notify_cq(cb->cq, 0); + if (ret) { + fprintf(stderr, "Failed to set notify!\n"); + exit(ret); + } + rping_cq_event_handler(cb); + ibv_ack_cq_events(cb->cq, 1); + } +} + +static void rping_format_send(struct rping_cb *cb, char *buf, struct ibv_mr *mr) +{ + struct rping_rdma_info *info = &cb->send_buf; + + info->buf = (uint64_t) (unsigned long) buf; + info->rkey = mr->rkey; + info->size = cb->size; + + DEBUG_LOG("RDMA addr %llx rkey %x len %d\n", + info->buf, info->rkey, info->size); +} + +static void rping_test_server(struct rping_cb *cb) +{ + struct ibv_send_wr *bad_wr; + int ret; + + while (1) { + /* Wait for client's Start STAG/TO/Len */ + sem_wait(&cb->sem); + if (cb->state != RDMA_READ_ADV) { + fprintf(stderr, "wait for RDMA_READ_ADV state %d\n", + cb->state); + break; + } + + DEBUG_LOG("server received sink adv\n"); + + /* Issue RDMA Read. */ + cb->rdma_sq_wr.opcode = IBV_WR_RDMA_READ; + cb->rdma_sq_wr.wr.rdma.rkey = cb->remote_rkey; + cb->rdma_sq_wr.wr.rdma.remote_addr = cb->remote_addr; + cb->rdma_sq_wr.sg_list->length = cb->remote_len; + + ret = ibv_post_send(cb->qp, &cb->rdma_sq_wr, &bad_wr); + if (ret) { + fprintf(stderr, "post send error %d\n", ret); + break; + } + DEBUG_LOG("server posted rdma read req \n"); + + /* Wait for read completion */ + sem_wait(&cb->sem); + if (cb->state != RDMA_READ_COMPLETE) { + fprintf(stderr, "wait for RDMA_READ_COMPLETE state %d\n", + cb->state); + break; + } + DEBUG_LOG("server received read complete\n"); + + /* Display data in recv buf */ + if (cb->verbose) + printf("server ping data: %s\n", cb->rdma_buf); + + /* Tell client to continue */ + ret = ibv_post_send(cb->qp, &cb->sq_wr, &bad_wr); + if (ret) { + fprintf(stderr, "post send error %d\n", ret); + break; + } + DEBUG_LOG("server posted go ahead\n"); + + /* Wait for client's RDMA STAG/TO/Len */ + sem_wait(&cb->sem); + if (cb->state != RDMA_WRITE_ADV) { + fprintf(stderr, "wait for RDMA_WRITE_ADV state %d\n", + cb->state); + break; + } + DEBUG_LOG("server received sink adv\n"); + + /* RDMA Write echo data */ + cb->rdma_sq_wr.opcode = IBV_WR_RDMA_WRITE; + cb->rdma_sq_wr.wr.rdma.rkey = cb->remote_rkey; + cb->rdma_sq_wr.wr.rdma.remote_addr = cb->remote_addr; + cb->rdma_sq_wr.sg_list->length = strlen(cb->rdma_buf) + 1; + DEBUG_LOG("rdma write from lkey %x laddr %llx len %d\n", + cb->rdma_sq_wr.sg_list->lkey, + cb->rdma_sq_wr.sg_list->addr, + cb->rdma_sq_wr.sg_list->length); + + ret = ibv_post_send(cb->qp, &cb->rdma_sq_wr, &bad_wr); + if (ret) { + fprintf(stderr, "post send error %d\n", ret); + break; + } + + /* Wait for completion */ + ret = sem_wait(&cb->sem); + if (cb->state != RDMA_WRITE_COMPLETE) { + fprintf(stderr, "wait for RDMA_WRITE_COMPLETE state %d\n", + cb->state); + break; + } + DEBUG_LOG("server rdma write complete \n"); + + /* Tell client to begin again */ + ret = ibv_post_send(cb->qp, &cb->sq_wr, &bad_wr); + if (ret) { + fprintf(stderr, "post send error %d\n", ret); + break; + } + DEBUG_LOG("server posted go ahead\n"); + } +} + +static int rping_bind_server(struct rping_cb *cb) +{ + struct sockaddr_in sin; + int ret; + + memset(&sin, 0, sizeof(sin)); + sin.sin_family = AF_INET; + sin.sin_addr.s_addr = cb->addr; + sin.sin_port = cb->port; + + ret = rdma_bind_addr(cb->cm_id, (struct sockaddr *) &sin); + if (ret) { + fprintf(stderr, "rdma_bind_addr error %d\n", ret); + return ret; + } + DEBUG_LOG("rdma_bind_addr successful\n"); + + DEBUG_LOG("rdma_listen\n"); + ret = rdma_listen(cb->cm_id, 3); + if (ret) { + fprintf(stderr, "rdma_listen failed: %d\n", ret); + return ret; + } + + sem_wait(&cb->sem); + if (cb->state != CONNECT_REQUEST) { + fprintf(stderr, "wait for CONNECT_REQUEST state %d\n", + cb->state); + return -1; + } + + return 0; +} + +static void rping_run_server(struct rping_cb *cb) +{ + struct ibv_recv_wr *bad_wr; + int ret; + + ret = rping_bind_server(cb); + if (ret) + return; + + ret = rping_setup_qp(cb, cb->child_cm_id); + if (ret) { + fprintf(stderr, "setup_qp failed: %d\n", ret); + return; + } + + ret = rping_setup_buffers(cb); + if (ret) { + fprintf(stderr, "rping_setup_buffers failed: %d\n", ret); + goto err1; + } + + ret = ibv_post_recv(cb->qp, &cb->rq_wr, &bad_wr); + if (ret) { + fprintf(stderr, "ibv_post_recv failed: %d\n", ret); + goto err2; + } + + pthread_create(&cb->cqthread, NULL, cq_thread, cb); + + ret = rping_accept(cb); + if (ret) { + fprintf(stderr, "connect error %d\n", ret); + goto err2; + } + + rping_test_server(cb); + rdma_disconnect(cb->child_cm_id); + rdma_destroy_id(cb->child_cm_id); +err2: + rping_free_buffers(cb); +err1: + rping_free_qp(cb); +} + +static void rping_test_client(struct rping_cb *cb) +{ + int ping, start, cc, i, ret; + struct ibv_send_wr *bad_wr; + unsigned char c; + + start = 65; + for (ping = 0; !cb->count || ping < cb->count; ping++) { + cb->state = RDMA_READ_ADV; + + /* Put some ascii text in the buffer. */ + cc = sprintf(cb->start_buf, "rdma-ping-%d: ", ping); + for (i = cc, c = start; i < cb->size; i++) { + cb->start_buf[i] = c; + c++; + if (c > 122) + c = 65; + } + start++; + if (start > 122) + start = 65; + cb->start_buf[cb->size - 1] = 0; + + rping_format_send(cb, cb->start_buf, cb->start_mr); + ret = ibv_post_send(cb->qp, &cb->sq_wr, &bad_wr); + if (ret) { + fprintf(stderr, "post send error %d\n", ret); + break; + } + + /* Wait for server to ACK */ + sem_wait(&cb->sem); + if (cb->state != RDMA_WRITE_ADV) { + fprintf(stderr, "wait for RDMA_WRITE_ADV state %d\n", + cb->state); + break; + } + + rping_format_send(cb, cb->rdma_buf, cb->rdma_mr); + ret = ibv_post_send(cb->qp, &cb->sq_wr, &bad_wr); + if (ret) { + fprintf(stderr, "post send error %d\n", ret); + break; + } + + /* Wait for the server to say the RDMA Write is complete. */ + sem_wait(&cb->sem); + if (cb->state != RDMA_WRITE_COMPLETE) { + fprintf(stderr, "wait for RDMA_WRITE_COMPLETE state %d\n", + cb->state); + break; + } + + if (cb->validate) + if (memcmp(cb->start_buf, cb->rdma_buf, cb->size)) { + fprintf(stderr, "data mismatch!\n"); + break; + } + + if (cb->verbose) + printf("ping data: %s\n", cb->rdma_buf); + } +} + +static int rping_connect_client(struct rping_cb *cb) +{ + struct rdma_conn_param conn_param; + int ret; + + memset(&conn_param, 0, sizeof conn_param); + conn_param.responder_resources = 1; + conn_param.initiator_depth = 1; + conn_param.retry_count = 10; + + ret = rdma_connect(cb->cm_id, &conn_param); + if (ret) { + fprintf(stderr, "rdma_connect error %d\n", ret); + return ret; + } + + sem_wait(&cb->sem); + if (cb->state != CONNECTED) { + fprintf(stderr, "wait for CONNECTED state %d\n", cb->state); + return -1; + } + + DEBUG_LOG("rmda_connect successful\n"); + return 0; +} + +static int rping_bind_client(struct rping_cb *cb) +{ + struct sockaddr_in sin; + int ret; + + memset(&sin, 0, sizeof(sin)); + sin.sin_family = AF_INET; + sin.sin_addr.s_addr = cb->addr; + sin.sin_port = cb->port; + + ret = rdma_resolve_addr(cb->cm_id, NULL, (struct sockaddr *) &sin, + 2000); + if (ret) { + fprintf(stderr, "rdma_resolve_addr error %d\n", ret); + return ret; + } + + sem_wait(&cb->sem); + if (cb->state != ROUTE_RESOLVED) { + fprintf(stderr, "waiting for addr/route resolution state %d\n", + cb->state); + return ret; + } + + DEBUG_LOG("rdma_resolve_addr - rdma_resolve_route successful\n"); + return 0; +} + +static void rping_run_client(struct rping_cb *cb) +{ + struct ibv_recv_wr *bad_wr; + int ret; + + ret = rping_bind_client(cb); + if (ret) + return; + + ret = rping_setup_qp(cb, cb->cm_id); + if (ret) { + fprintf(stderr, "setup_qp failed: %d\n", ret); + return; + } + + ret = rping_setup_buffers(cb); + if (ret) { + fprintf(stderr, "rping_setup_buffers failed: %d\n", ret); + goto err1; + } + + ret = ibv_post_recv(cb->qp, &cb->rq_wr, &bad_wr); + if (ret) { + fprintf(stderr, "ibv_post_recv failed: %d\n", ret); + goto err2; + } + + pthread_create(&cb->cqthread, NULL, cq_thread, cb); + + ret = rping_connect_client(cb); + if (ret) { + fprintf(stderr, "connect error %d\n", ret); + goto err2; + } + + rping_test_client(cb); + rdma_disconnect(cb->cm_id); +err2: + rping_free_buffers(cb); +err1: + rping_free_qp(cb); +} + +static void usage(char *name) +{ + printf("%s -c|s [-vVd] [-S size] [-C count] -a addr -p port\n", + basename(name)); + printf("\t-c\t\tclient side\n"); + printf("\t-s\t\tserver side\n"); + printf("\t-v\t\tdisplay ping data to stdout\n"); + printf("\t-V\t\tverbosity\n"); + printf("\t-d\t\tdebug printfs\n"); + printf("\t-S size \tping data size\n"); + printf("\t-C count\tping count times\n"); + printf("\t-a addr\t\taddress\n"); + printf("\t-p port\t\tport\n"); +} + +int main(int argc, char *argv[]) +{ + struct rping_cb *cb; + int op; + int ret = 0; + + cb = malloc(sizeof(*cb)); + if (!cb) + return -ENOMEM; + + memset(cb, 0, sizeof(*cb)); + cb->server = -1; + cb->state = IDLE; + cb->size = 64; + sem_init(&cb->sem, 0, 0); + + opterr = 0; + while ((op=getopt(argc, argv, "a:p:C:S:t:scvVd")) != -1) { + switch (op) { + case 'a': + cb->addr_str = optarg; + cb->addr = inet_addr(optarg); + DEBUG_LOG("ipaddr (%s)\n", optarg); + break; + case 'p': + cb->port = htons(atoi(optarg)); + DEBUG_LOG("port %d\n", (int) atoi(optarg)); + break; + case 's': + cb->server = 1; + DEBUG_LOG("server\n"); + break; + case 'c': + cb->server = 0; + DEBUG_LOG("client\n"); + break; + case 'S': + cb->size = atoi(optarg); + if ((cb->size < 1) || + (cb->size > (RPING_BUFSIZE - 1))) { + fprintf(stderr, "Invalid size %d " + "(valid range is 1 to %d)\n", + cb->size, RPING_BUFSIZE); + ret = EINVAL; + } else + DEBUG_LOG("size %d\n", (int) atoi(optarg)); + break; + case 'C': + cb->count = atoi(optarg); + if (cb->count < 0) { + fprintf(stderr, "Invalid count %d\n", + cb->count); + ret = EINVAL; + } else + DEBUG_LOG("count %d\n", (int) cb->count); + break; + case 'v': + cb->verbose++; + DEBUG_LOG("verbose\n"); + break; + case 'V': + cb->validate++; + DEBUG_LOG("validate data\n"); + break; + case 'd': + debug++; + break; + default: + usage("rping"); + ret = EINVAL; + break; + } + } + if (ret) + goto out; + + if (cb->server == -1) { + usage("rping"); + ret = EINVAL; + goto out; + } + + ret = rdma_create_id(&cb->cm_id, cb); + if (ret) { + ret = errno; + fprintf(stderr, "rdma_create_id error %d\n", ret); + goto out; + } + DEBUG_LOG("created cm_id %p\n", cb->cm_id); + + pthread_create(&cb->cmthread, NULL, cm_thread, cb); + + if (cb->server) + rping_run_server(cb); + else + rping_run_client(cb); + + DEBUG_LOG("destroy cm_id %p\n", cb->cm_id); + rdma_destroy_id(cb->cm_id); +out: + free(cb); + return ret; +} From mst at mellanox.co.il Thu Feb 9 17:29:43 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 10 Feb 2006 03:29:43 +0200 Subject: [openib-general] Re: FW: [PATCH 1 of 3] mad: large RMPP support In-Reply-To: <43EBE828.8080409@ichips.intel.com> References: <43EB7E28.3050402@ichips.intel.com> <20060209234654.GB5447@mellanox.co.il> <43EBD82B.3090804@ichips.intel.com> <20060210005423.GA7174@mellanox.co.il> <43EBE828.8080409@ichips.intel.com> Message-ID: <20060210012943.GC7174@mellanox.co.il> Quoting r. Sean Hefty : > Externally, a user should have access to the segment list > directly, rather than through a function call. Why not a function call? > The latest idea that I was suggesting was to separate the MAD header from > the data segments for multi-segment/RMPP active MADs. Let *mad reference > the header, and rmpp_list reference all data segments. Non-RMPP MADs would > use *mad to reference the entire buffer. For an RMPP MAD (one with RMPP > active) that fits into a single segment, I can go either way. What problem would this change address? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Thu Feb 9 17:31:30 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 10 Feb 2006 03:31:30 +0200 Subject: [openib-general] Re: FW: [PATCH 1 of 3] mad: large RMPP support In-Reply-To: References: <43EB7E28.3050402@ichips.intel.com> <20060209234654.GB5447@mellanox.co.il> <43EBD82B.3090804@ichips.intel.com> <20060210005423.GA7174@mellanox.co.il> <43EBE9BC.9040003@ichips.intel.com> Message-ID: <20060210013130.GD7174@mellanox.co.il> Quoting r. Roland Dreier : > If we want to keep the internals hidden, then we could expose some > sort of iterator object that clients can use to walk the list in O(n). AFAIK thats what Jack is working on. Expect results by Monday. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From swise at opengridcomputing.com Thu Feb 9 18:32:08 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 09 Feb 2006 20:32:08 -0600 Subject: [openib-general] Re: [PATCH v2] [RFC] - example user mode rdma ping/pongprogram using CMA In-Reply-To: References: Message-ID: <1139538728.14152.14.camel@stevo-laptop> On Thu, 2006-02-09 at 17:25 -0800, Hefty, Sean wrote: > Here's an updated version of rping. I restructured to code to make it > more modular, reduce the size of some of the functions, simplify some > areas, and make it more consistent. The updated version worked for my > limited testing. Please review the changes to see if I changed any of > the intended functionality. I like what you did on making it more modular! I don't see any real functional changes, but you removed the simple private data exchange. Any particular reason? Also: > +static int rping_create_qp(struct rping_cb *cb) > +{ > + struct ibv_qp_init_attr init_attr; > + // struct ibv_qp_attr qp_attr; > + int ret; > + > + memset(&init_attr, 0, sizeof(init_attr)); > + init_attr.cap.max_send_wr = RPING_SQ_DEPTH; > + init_attr.cap.max_recv_wr = 2; > + init_attr.cap.max_recv_sge = 1; > + init_attr.cap.max_send_sge = 1; > + init_attr.qp_type = IBV_QPT_RC; > + init_attr.send_cq = cb->cq; > + init_attr.recv_cq = cb->cq; > + > + if (cb->server) { > + ret = rdma_create_qp(cb->child_cm_id, cb->pd, > &init_attr); > + if (!ret) > + cb->qp = cb->child_cm_id->qp; > + } else { > + ret = rdma_create_qp(cb->cm_id, cb->pd, &init_attr); > + if (!ret) > + cb->qp = cb->cm_id->qp; > + } > + > +// if (ret) { > +// cb->qp = NULL; > +// return ret; > +// } > +// > +// qp_attr.qp_access_flags = IBV_ACCESS_REMOTE_READ| > +// IBV_ACCESS_REMOTE_WRITE; > +// ret = ibv_modify_qp(cb->qp, &qp_attr, IBV_QP_ACCESS_FLAGS); > +// if (ret) > +// printf("ibv_modify_qp returned %d\n", ret); > + return ret; > +} I added this qp_modify code during testing because the server's initial rdma read request wasn't being processed. I later found out that this was due to the server path not reaping the "CONNECTED" event prior to posting the rdma read wr. Now I notice you commented out this code (but left it in). So what's up with these access flags on qp? You cannot set them on qp creation...only on qp modify. That seems strange. Since rdma read/writes work without these attributes set, I'm wondering what they really do? Steve. From rdreier at cisco.com Thu Feb 9 18:42:34 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 09 Feb 2006 18:42:34 -0800 Subject: [openib-general] Re: [PATCH v2] [RFC] - example user mode rdma ping/pongprogram using CMA In-Reply-To: <1139538728.14152.14.camel@stevo-laptop> (Steve Wise's message of "Thu, 09 Feb 2006 20:32:08 -0600") References: <1139538728.14152.14.camel@stevo-laptop> Message-ID: Steve> So what's up with these access flags on qp? You cannot set Steve> them on qp creation...only on qp modify. That seems Steve> strange. Since rdma read/writes work without these Steve> attributes set, I'm wondering what they really do? I'm not sure how things interact with the RDMA CM abstraction, but the underlying IB modify QP operation allows the QP access flags to be set in the RESET -> INIT state transition. A QP cannot generate or process any messages until it is in the INIT state (actually, until it is in the RTR state, which can only be reached through the INIT state), so there's no problem setting them early enough. RDMA operations should not work without the QP access flags set and I'm pretty sure that I've seen RDMAs fail without them in practice, so there's still a bit of a mystery with rping. - R. From swise at opengridcomputing.com Thu Feb 9 19:32:20 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 09 Feb 2006 21:32:20 -0600 Subject: [openib-general] Re: [PATCH v2] [RFC] - example user mode rdma ping/pongprogram using CMA In-Reply-To: References: <1139538728.14152.14.camel@stevo-laptop> Message-ID: <1139542340.16814.1.camel@stevo-laptop> On Thu, 2006-02-09 at 18:42 -0800, Roland Dreier wrote: > Steve> So what's up with these access flags on qp? You cannot set > Steve> them on qp creation...only on qp modify. That seems > Steve> strange. Since rdma read/writes work without these > Steve> attributes set, I'm wondering what they really do? > > I'm not sure how things interact with the RDMA CM abstraction, but the > underlying IB modify QP operation allows the QP access flags to be set > in the RESET -> INIT state transition. A QP cannot generate or > process any messages until it is in the INIT state (actually, until it > is in the RTR state, which can only be reached through the INIT > state), so there's no problem setting them early enough. > > RDMA operations should not work without the QP access flags set and > I'm pretty sure that I've seen RDMAs fail without them in practice, so > there's still a bit of a mystery with rping. > > - R. After poking around, I think CMA sets these via ib_cm_init_qp_attr() called by rdma_init_qp_attr(). Stevo. From tom at opengridcomputing.com Thu Feb 9 20:16:06 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Thu, 09 Feb 2006 22:16:06 -0600 Subject: [openib-general] RE: [Openib-windows] NFS performance and general disk network export advice (Linux-Windows) In-Reply-To: <000b01c62dc4$8781aba0$6701a8c0@infiniconsys.com> References: <000b01c62dc4$8781aba0$6701a8c0@infiniconsys.com> Message-ID: <1139544966.8779.7.camel@strider.opengridcomputing.com> Fab: As you point out, we've been focused on the main trunk and or target test platforms are Linux based. That said, we actually had Beta versions of NDIS and Winsock Direct drivers for the AMSO adapter, so we know this works and we know where the dead are buried. It probably makes sense to wait until the core iWARP support is merged into the main trunk. however, when/if you decide to merge up from the main trunk and pick up iWARP support, I am more than happy to help you with any issues that you may have. Tom On Thu, 2006-02-09 at 14:02 -0800, Fab Tillier wrote: > Hi Paul, > > > I'm looking to export a filesystem from each of four linux > > 64bit boxes to a single Windows server 2003 64bit Ed. > > > > Has anyone achieved this already using an IB transport? Can > > I use NFS over IPoIB cross platform? i.e. do both ends > > support a solution? > > IPoIB will interoperate cross platform, so any higher-level services you layer > above TCP/IP or UDP/IP should work fine. > > > Is NFS over RDMA compatible with Windows (pretty sure the > > answer is no to this one but love to be proven wrong). I've > > attached Tom's announcement of the latest to the bottom of > > this email. I don't think Windows has the RDMA abstraction > > (yet)? > > There is no NFS over RDMA file system for OpenIB Windows. It would be great to > have it, but the focus is currently on getting the core stack stable and > released. > > The long term goals, at least from my perspective, is to match functionality > between OpenIB Linux and Windows, even if the APIs aren't identical. The > reality is that the iWARP crowd hasn't really been involved in the Windows > project, and have not driven any requirements, so that stack is continuing to be > focused on IB only. I don't have a timeline for getting functionality matched > up, and we could certainly use more hands on deck for the Windows project. > > > Are windows IB drivers (Openib or Mellanox) compatible with > > these options? > > Do I layer Windows services for Unix on top of the Windows IB > > drivers and IPoIB to achieve a cross platform NFS? > > I don't know what you would need to do to get NFS working on Windows, but that > should be an orthogonal problem to getting IB working. If NFS works on Windows > over GbE, it should work without a problem over IPoIB. > > > Has anyone done much in the way of NFS performance > > comparisons of NFS over IPoIB in cross-platform situations > > vs say Gigabit ethernet. Does it work :) What is large file > > throughput and processor loading - I'm aiming for 150-200 > > MB/s on large files on 4x SDR IB (possibly DDR if we can > > fit the bigger 144 port switch chassis into our rack layout > > for 50-ish nodes). > > I can tell you that IPoIB performance on Windows is pretty awful. The reason > for that is that the IPoIB driver shoehorns itself into the NDIS stack as a > 802.3 Ethernet NIC, and thus gets 6-byte Ethernet MAC addresses. Further, > Windows doesn't have any IB knowledge, so the IPoIB driver is responsible for > all ARP and DHCP encapsulation to match the IPoIB protocol on the wire. This > involves snooping both outbound and inbound packets to see if they need > conversion, which does nasty stuff to performance. > > Depending on the host CPU, 150-200MB/s should be achievable (I've seen 150+MB/s > in some of my testing). > > > Are there any alternatives to using NFS that may be better > > and that would 'transparently' receive a performance boost > > with IB compared with using a simple NFS/gigabit ethernet > > solution. Must be fairly straightforward, ideally application > > neutral (configure a drive and load/unload script for Linux > > and it just happens) and compatible between Win2003 and > > Linux? Alternatives using perhaps Samba on the Linux side? > > If you only have a single Windows box that has to read data from one or more > Linux boxes, you might have some success with making the Linux boxes SRP > targets, and then using the Windows SRP driver to access the Linux boxes. The > SRP target driver would have to handle SRP commands and perform local disk > access. > > Of course, the file system would have to be Windows compatible with this > solution, but you should be able to get the full RDMA performance since there > would be no network stack involved. You'd also need to make sure that only a > single system accesses the data on the disks exported as SRP targets to prevent > corruption as those disks would appear as locally attached drives to the Windows > box. > > I am unaware of an SRP target implementation for Linux, though, so that may not > be a viable option for you. > > > My lack of knowledge of IB in the windows world has got me > > concerned over whether this is actually achievable (easily). > > > > I hope to be trying this once we get a Windows 2003 machine, > > but hope someone can encourage me that its a breeze prior to > > my coming unstuck in a month or so! > > The IB stuff should be a breeze to get functional and interoperating. Whether > performance matches your requirements/expectations is another thing. Do report > back if you have any questions or run into any problems along the way. > > - Fab > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Thu Feb 9 20:06:17 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Feb 2006 23:06:17 -0500 Subject: [openib-general] relocation error / link time reference error In-Reply-To: <200602070246.k172kb325484@qube3.dbresearch.net> References: <200602070246.k172kb325484@qube3.dbresearch.net> Message-ID: <1139544376.4450.3026.camel@hal.voltaire.com> On Mon, 2006-02-06 at 21:46, Sean Hubbell wrote: > ---------- Original message ---------- > Date: 05 Feb 2006 11:44:52 -0500 > From: Hal Rosenstock > Reply-To: Hal Rosenstock > To: Sean Hubbell > Subject: Re: [openib-general] relocation error / link time reference error > > On Sun, 2006-02-05 at 09:40, Sean Hubbell wrote: > > Hal, > > > > I removed and rebuilt everything. > > And everything's OK now ? > > -- Hal > > > Nope, I still have the link time reference problem. I'll download the latest svn tree again in the morning and rebuild. How do you typically download and rebuild? Here are the steps that I follow: > 1) Download the openib code. > 2) Copy a version of the Kernel Source Tree and copy over the infiniband directory to the drivers dir. > 3) Removed the include/rdma directory and all of the .svn directories > 4) Get a second version of the Kernel Source Tree and build a patch file for the infiniband changes. > 5) I add the patch file to the linux-2.6.15.spec file > 6) I rebuild the kernel (rpm based kernel) and then install the rpms (smp, numa, ...). > 7) I reboot. > 8) I then remove all of the openib modules > 9) I then rebuilt openib tools from the commands listed on the wiki FAQ. > 10) That's it ... That's the kernel side; not the userspace side. The problem is on the userspace side, specifically the management libraries and diags. > How do you rebuilt openib? I build the userspace management similar to what the "Building management tools" page under "Cheat Sheet" on the OpenIB wiki. > Do you pull from a particular tag or the trunk? I do an svn update from the trunk mostly. -- Hal > If anyone has a better way to build the kernel, please let me know. > I only want to make sure that I can built it as an rpm because I like > the ability to figure out what file goes with what package. Any and all > suggestions would be appreciated. > > Thanks again for all of the help Hal. > > Sean Hubbell > From sean.hefty at intel.com Thu Feb 9 20:25:26 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 9 Feb 2006 20:25:26 -0800 Subject: [openib-general] Re: FW: [PATCH 1 of 3] mad: large RMPP support In-Reply-To: <20060210013130.GD7174@mellanox.co.il> Message-ID: >> If we want to keep the internals hidden, then we could expose some >> sort of iterator object that clients can use to walk the list in O(n). > >AFAIK thats what Jack is working on. Expect results by Monday. Why not just use the list functionality that ships with the kernel? This is what the received RMPP MADs do. - Sean From sean.hefty at intel.com Thu Feb 9 20:50:32 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 9 Feb 2006 20:50:32 -0800 Subject: [openib-general] Re: FW: [PATCH 1 of 3] mad: large RMPP support In-Reply-To: <20060210012943.GC7174@mellanox.co.il> Message-ID: >> The latest idea that I was suggesting was to separate the MAD header from >> the data segments for multi-segment/RMPP active MADs. Let *mad reference >> the header, and rmpp_list reference all data segments. Non-RMPP MADs would >> use *mad to reference the entire buffer. For an RMPP MAD (one with RMPP >> active) that fits into a single segment, I can go either way. > >What problem would this change address? >From the original patch, this addresses walking through the list multiple times. A user needs to be able to walk through all data segments in order to copy the data. It easier to me to do: For each segment Copy the segment Versus: Copy the first segment For each segment > 1 Copy the segment The RMPP code would also be able to treat all segments equally. It can use last_ack and last_sent pointers to track its location. On a timeout, it simply resends starting at last_ack, versus checking to see if the timeout occurred on the first segment or some later segment. Likewise an ACK would simply move the last_ack pointer forward, versus needing the check. My assumption is that this would result in a simpler implementation. From within the kernel, it seems like it would. I don't know if there's easier or more efficient approach given the user_mad implementation, hence my question regarding whether kernel modules would need to send multi-segment RMPP MADs. If not, it's entirely possible that we may find that what works well for handling MADs coming down from userspace may not be the most efficient implementation from RMPP's viewpoint. In any case, I think it's important to see multiple implementation possibilities, so that others can be considered if the implementation of one starts getting ugly. - Sean From karun at gs-lab.com Thu Feb 9 20:56:12 2006 From: karun at gs-lab.com (Karun Beer Sharma) Date: Fri, 10 Feb 2006 10:26:12 +0530 Subject: [openib-general] 2.6.9-22 kernel patch for iSER Message-ID: <43EC1CEC.8030808@gs-lab.com> Hi: I want to compile openIB with 2.6.9-22EL kernel version. It is giving me error with respect to iSER module. After browsing through openib.org website, I came to know that iSER is supported from 2.6.11 onwards but some work is going on for the patch with 2.6.9-22 kernel version. Just want to know if the patch is available. If yes, from where can i download it. Thanks in advance. Regards, Karun From sean.hefty at intel.com Thu Feb 9 21:02:30 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 9 Feb 2006 21:02:30 -0800 Subject: [openib-general] RE: [PATCH v2] [RFC] - example user mode rdma ping/pongprogramusing CMA In-Reply-To: <1139538728.14152.14.camel@stevo-laptop> Message-ID: >I like what you did on making it more modular! I don't see any real >functional changes, but you removed the simple private data exchange. >Any particular reason? I didn't see that the private data was being checked. I dropped it from one of the function calls when I was merging some of the calls together. It's probably not that hard to add back in. >> +static int rping_create_qp(struct rping_cb *cb) >> +{ >> + struct ibv_qp_init_attr init_attr; >> + // struct ibv_qp_attr qp_attr; >> + int ret; >> + >> + memset(&init_attr, 0, sizeof(init_attr)); >> + init_attr.cap.max_send_wr = RPING_SQ_DEPTH; >> + init_attr.cap.max_recv_wr = 2; >> + init_attr.cap.max_recv_sge = 1; >> + init_attr.cap.max_send_sge = 1; >> + init_attr.qp_type = IBV_QPT_RC; >> + init_attr.send_cq = cb->cq; >> + init_attr.recv_cq = cb->cq; >> + >> + if (cb->server) { >> + ret = rdma_create_qp(cb->child_cm_id, cb->pd, >> &init_attr); >> + if (!ret) >> + cb->qp = cb->child_cm_id->qp; >> + } else { >> + ret = rdma_create_qp(cb->cm_id, cb->pd, &init_attr); >> + if (!ret) >> + cb->qp = cb->cm_id->qp; >> + } >> + >> +// if (ret) { >> +// cb->qp = NULL; >> +// return ret; >> +// } >> +// >> +// qp_attr.qp_access_flags = IBV_ACCESS_REMOTE_READ| >> +// IBV_ACCESS_REMOTE_WRITE; >> +// ret = ibv_modify_qp(cb->qp, &qp_attr, IBV_QP_ACCESS_FLAGS); >> +// if (ret) >> +// printf("ibv_modify_qp returned %d\n", ret); >> + return ret; >> +} > >I added this qp_modify code during testing because the server's initial >rdma read request wasn't being processed. I later found out that this >was due to the server path not reaping the "CONNECTED" event prior to >posting the rdma read wr. Now I notice you commented out this code (but >left it in). Uhm... I didn't mean to leave the code like this before submitting the patch. I knew that the IB CM set the QP attributes for RDMA read based on the requested responder_resources being > 0, and writes were enabled by default. I commented out the code to verify that the modify wasn't needed. We should always be able to determine if reads should be enabled. Enabling writes by default is debatable, but that's the way the code is currently written. - Sean From grave at ipno.in2p3.fr Fri Feb 10 00:35:34 2006 From: grave at ipno.in2p3.fr (Xavier Grave) Date: Fri, 10 Feb 2006 09:35:34 +0100 Subject: [openib-general] libsdp running nearly fine In-Reply-To: <1139504959.673.31.camel@brick.internal.keyresearch.com> References: <1139502670.14833.14.camel@ipnnarval> <1139504959.673.31.camel@brick.internal.keyresearch.com> Message-ID: <1139560534.19362.4.camel@ipnnarval> Le jeudi 09 février 2006 à 09:09 -0800, Ralph Campbell a écrit : > My guess is there is a bug in the zero-copy code. > Try "echo 1000000 > /sys/module/ib_sdp/parameters/sdp_zcopy_thrsh_src_default" > and see if the problem still exists. > This raises the zero-copy threshold. I think you are right, every things runs well now. I'll try to evaluate some performances between to blades. thanks, xavier From simga888 at yahoo.it Fri Feb 10 02:44:12 2006 From: simga888 at yahoo.it (=?shift-jis?B?bmFtaQ==?=) Date: Fri, 10 Feb 2006 02:44:12 -0800 (PST) Subject: [openib-general] =?iso-2022-jp?b?GyRCS1xGfCReJEckRyQ5JCwbKEI=?= =?iso-2022-jp?b?GyRCISIkJCQrJCwkJCQ/JDckXiQ3JGckJhsoQg==?= Message-ID: <20060210104412.78EC12283D7@openib.ca.sandia.gov> *$B!y(B*$B!A(B*$B!y(B*$B=P0)$$$r5a$a$k?M$X(B*$B!y(B*$B!A(B*$B!y(B*$B!!(B $B"!CK=wEPO?3d9g!y(B53$B!s!&(B47$B!s(B $B!!!!!!(B $B"!=w at -%9%?%C%U$b$$$k$N$G0B?4$7$FEPO?$G$-$^$9(B $B!!!!!!!!GA$$$F$_$h$&!Z(Bhttp://www.deai-style.net/casanova/?1927$B![(B $B"!!~"!M7$SH>J,$N=P0)$$!AK\Ev$N=P0)$$$^$GDs6!$7$^$9"!!~"!(B $B=P0)$$$O$3$3$+$i(B http://www.deai-style.net/casanova/?1927 $B!!!!!!!!!!!!!z!y!z:#2s$OBN83CL$r>R2p$7$^!A$9!z!y!z(B $B!y!E!D(,(,(,(,(,!y$"$kF|!E!D!y!E!D(,(,(,(,(,!y!E!D(,(,(,(,(,(B $B$"$-$i!!(B28$B:P!!CK!!:k6L8)!!(B $B$3$l$O!"(B9$B7n$N=i$a$N=PMh;v$J$s$G$9$,!">/$7CY$a$N2F5Y$_!#(B $B0l?M$G$$$F$b$D$^$i$J$$$7!"$J$s$H$J$/L5NA$N=P2q$$7O%5%$%H$KEPO?$7$F$_(B $B$^$7$?!#(B $B$=$3$G!*L\$,;_$^$C$?=q9~$_$rH/8+!#M35*$5$s!&(B32$B:P!&?M:J!#(B $B$I$&$d$i!"8=:_=P at h$+$i=q$-9~$s$G$$$k$i$7$/!"6a=j$N%7%g%C%T%s%0%b!<%k(B $B$K$$$k$H$N;v!#AaB.JV;v$r=q$/$H!"Cf$N%9%?%P$GBT$A9g$o$;$H$$$&$3$H$K$J(B $B$j!"!V$I$&$;$9$C$]$+$5$l$k$@$m$&$1$I!&!&!&!W$0$i$$$N5$;}$A$G=P$+$1$F(B $B$_$^$7$?!#%7%g%C%T%s%0%b!<%k$KCe$$$F!"4|BTH>J,$GBT$D;v#1#0J,!&!&!&(B $B$I$&$;Mh$J$$$@$m$&$H;W$C$F$$$?$N$G$9$,!"?'Gr$N>/$7$]$C$A$c$j$7$?=w at -(B $B$,OC$7$+$1$F$-$^$7$?!#(B $BMh$?;v$K6C$$$?$N$HF1;~$K!"$J$+$J$+e:No$J?M$@$C$?;v$KKt6C$-(B(^_^;) $B at iJf$5$s$O?M:J$G!"IaCJ$O=,$$;v$J$I$GK;$7$$$i$7$$$s$@$1$I!"5^$K=,$$;v(B $B$,$*5Y$_$K$J$C$F2K$K$J$C$?$i$7$$!#(B $B7Z$/$*e$,$C$F$/$k$HH`=w$,>/$7B)9S$2$K!V>l=jJQ$($^$;$s$+(B $B!&!&!)!W$H!*$H$$$&Lu$G!"9T$-$^$7$?$C!*%A%g%C%H;~4V$O3]$+$j$^$7$?$1$I(B $BNI$+$C$?!N$7$F$^$@B3$$$F$$$^$9$h�{(B $B!!!!!!!Z40A4L5NA$N=P2q$$![(Bhttp://www.deai-style.net/casanova/?1927 $BITMW$NJ}$O$3$A$i$K%a!<%k$r"-(B concept3_net at yahoo.ca From grave at ipno.in2p3.fr Fri Feb 10 02:59:09 2006 From: grave at ipno.in2p3.fr (Xavier Grave) Date: Fri, 10 Feb 2006 11:59:09 +0100 Subject: [openib-general] libsdp without zcopy also have problems Message-ID: <1139569149.19364.27.camel@ipnnarval> When it runs I get very good results about 228 MB/s with the same sample program without sdp with ipoib i get about 130 MB/s But if I try to connect to a non running server I get this : iommu_free: invalid entry entry = 0x0 dma_addr = 0x0 Table = 0xc0000000edf34480 bus# = 0xd9 size = 0x10000 startOff = 0x20000 index = 0x3000003 Badness in __iommu_free at arch/powerpc/kernel/iommu.c:208 Call Trace: [C0000000EBF831C0] [C00000000000E6A0] .show_stack+0x6c/0x1a0 (unreliable) [C0000000EBF83260] [C0000000000206A4] .program_check_exception +0x168/0x518 [C0000000EBF83300] [C000000000004348] program_check_common+0xc8/0x100 --- Exception: 700 at .__iommu_free+0xe0/0x178 LR = .__iommu_free+0xd8/0x178 [C0000000EBF83690] [C00000000002259C] .iommu_free+0x50/0xb4 [C0000000EBF83730] [C00000000002738C] .pci_iommu_unmap_single+0x40/0x54 [C0000000EBF837A0] [C000000000021E68] .dma_unmap_single+0x54/0x70 [C0000000EBF83810] [C00000000024A31C] .sdp_buff_q_clear_unmap+0x40/0xf4 [C0000000EBF838B0] [C0000000002518F8] .sdp_conn_put+0x154/0x228 [C0000000EBF83940] [C00000000024CFE0] .sdp_inet_release+0x36c/0x398 [C0000000EBF83A10] [C00000000025A214] .sock_release+0x54/0x148 [C0000000EBF83AA0] [C00000000025AA50] .sock_close+0x44/0x60 [C0000000EBF83B20] [C0000000000B106C] .__fput+0xf8/0x254 [C0000000EBF83BC0] [C0000000000AE320] .filp_close+0xac/0xd4 [C0000000EBF83C50] [C000000000051FB4] .put_files_struct+0xbc/0x148 [C0000000EBF83CF0] [C0000000000538C4] .do_exit+0x23c/0x980 [C0000000EBF83DA0] [C0000000000540E4] .sys_exit_group+0x0/0x8 [C0000000EBF83E30] [C0000000000086F8] syscall_exit+0x0/0x40 I have time to test this if somebody wants me to do so. xavier From ctgrippe at freemail.lt Fri Feb 10 04:01:40 2006 From: ctgrippe at freemail.lt (ctgrippe at freemail.lt) Date: Fri, 10 Feb 2006 04:01:40 -0800 (PST) Subject: [openib-general] =?iso-2022-jp?b?GyRCTTVKISRKJTslbCVWJTkbKEI=?= =?iso-2022-jp?b?GyRCJV0lcyU1ITwkSDhyOl0kNyRGJF8kXiQ7JHMkKyEmGyhC?= =?iso-2022-jp?b?GyRCISYhJhsoQg==?= Message-ID: 20060210200538.81357mail@mail.lovelove-queensex552158754_lookserver772_womansystem01_woman-queen-love.tv $B=i$a$^$7$F!#!Z%^%@%`2q!'BeI=(B $BNo;R![$H?=$7$^$9!#(B $B$3$N%5!<%/%k$K=P;q$7$F$k$N$GFCJL$K5v2D$rD:$-!"Jg=8$7$^$9!#(B $B"!3d at Z$C$?8r:]$r5a$a$kG/<}(B1000$BK|0J>e$N=w at -$H$N5U%5%]!<%H8r:]$r$7$F$_$^$;$s$+!)(B $B!}F~2q6b!&2qHqEy$9$Y$F=w at -IiC4!"CK at -$O$^$C$?$/EPO?!">R2pNA6b3]$+$j$^$;$s!#(B $B!}7n(B1$B!A(B2$B2s(B($B=w at -$K$h$k!"Kt$OOC$79g$$$G(B) $B!}3d at Z$C$?4X78!"7k:'A0Ds!"%S%8%M%9%Q!<%H%J!eF1;N$N?7$7$$$*IU9g$$(B $B$r4uK>$5$l$F$$$kJ}$rC5$7$F$$$^$9!#(B http://lovlyqueen.cx/h/ From halr at voltaire.com Fri Feb 10 04:17:25 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Feb 2006 07:17:25 -0500 Subject: [openib-general] Re: [PATCH] Opensm - change default dir for Windows In-Reply-To: <5zek2f1sjq.fsf@mtl066.yok.mtl.com> References: <5zek2f1sjq.fsf@mtl066.yok.mtl.com> Message-ID: <1139573843.4450.4977.camel@hal.voltaire.com> On Tue, 2006-02-07 at 03:28, Yael Kalka wrote: > Hi Hal, > > The following patch includes some fixes for the windows stack: > 1. Add needed __cdecl. > 2. Change the default directories/files names. Thanks. Applied. -- Hal > Thanks, > Yael > > Signed-off-by: Yael Kalka From halr at voltaire.com Fri Feb 10 04:33:22 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Feb 2006 07:33:22 -0500 Subject: [openib-general] Re: [PATCH] Opensm - osm_mcast_mgr.c add type casting In-Reply-To: <5z7j86l0bl.fsf@mtl066.yok.mtl.com> References: <5z7j86l0bl.fsf@mtl066.yok.mtl.com> Message-ID: <1139574800.4450.5054.camel@hal.voltaire.com> On Wed, 2006-02-08 at 03:30, Yael Kalka wrote: > Hi Hal, > > The following patch adds a missing type casting in the return value of > the function osm_mcast_mgr_compute_max_hops. Thanks. Applied. -- Hal > Thanks, > Yael From halr at voltaire.com Fri Feb 10 04:53:53 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Feb 2006 07:53:53 -0500 Subject: [openib-general] Re: [PATCH] Opensm - type changing in st.h/c files In-Reply-To: <5z64nqkzec.fsf@mtl066.yok.mtl.com> References: <5z64nqkzec.fsf@mtl066.yok.mtl.com> Message-ID: <1139576032.4450.5180.camel@hal.voltaire.com> On Wed, 2006-02-08 at 03:50, Yael Kalka wrote: > Hi Hal, > > There was a problem with some of the types defined when compiling on > 64bit windows machines. The following patch adds support for these as > well. Thanks. Applied. -- Hal > Thanks, > Yael > > Signed-off-by: Yael Kalka From peha at 7908484.com Thu Feb 9 15:07:45 2006 From: peha at 7908484.com (Avery Carter) Date: Fri, 10 Feb 2006 09:07:45 +1000 Subject: [openib-general] Software At Low Pr1ce Message-ID: <000001c62e75$ad4a0b00$0100007f@localhost> Finally the real thing- no more ripoffs! Enhancment Patches are hot right now, VERY hot! Unfortunately, most are cheap imitiations and do very little to increase your size and stamina. Well this is the real thing, not an imitation! One of the very originals, the absolutely strongest Patch available, anywhere! A top team of British scientists and medical doctors have worked to develop the state-of-the-art Pen1s Enlargment Patch delivery system which automatically increases pen1s size up to 3-4 full inches. The patches are the easiest and most effective way to increase your size. You won't have to take pills, get under the knife to perform expensive and very painful surgery, use any pumps or other devices. No one will ever find out that you are using our product. Just apply one patch on your body and wear it for 3 days and you will start noticing dramatic results. Millions of men are taking advantage of this revolutionary new product - Don't be left behind! As an added incentive, they are offering huge discount specials right now, check out the site to see for yourself! Here's the link to check out! http://www.beratont.biz/pt/?46&wkkpgy -------------- next part -------------- An HTML attachment was scrubbed... URL: From suri at baymicrosystems.com Fri Feb 10 06:09:25 2006 From: suri at baymicrosystems.com (Suresh Shelvapille) Date: Fri, 10 Feb 2006 09:09:25 -0500 Subject: [openib-general] port num in port priv In-Reply-To: <1139526157.4450.2111.camel@hal.voltaire.com> Message-ID: <200602101409.k1AE9UP7003658@mail.baymicrosystems.com> Hal: The way I gather port stats locally on my device is by doing a "cat" on the "/sys/class/Infiniband/mysw/port/0/counters" file. As Roland mentioned this is limited to port0 counters on a switch. If I need stats for port#N I need to expand this by providing more class attributes and getting to them that way.... I am not using user space verbs library, is there any other way to get at the different port#N stats? Thanks, Suri > The PMA counters are gathered by a tool named perfquery. Port number is > in the PortSelect component of the PMA attribute. You can't get PMA > counters until the port is active. > > -- Hal From halr at voltaire.com Fri Feb 10 06:34:25 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Feb 2006 09:34:25 -0500 Subject: [openib-general] svn Ids on OpenIB code ? Message-ID: <1139582065.4450.5781.camel@hal.voltaire.com> Hi, In most all the OpenIB code, we have been putting in svn IDs. In the kernel submission of the ipath code, the following comment came back from lkml from Christoph. Should we remove all the svn IDs from the OpenIB kernel tree ? -- Hal -----Forwarded Message----- From: Christoph Hellwig To: Roland Dreier Cc: linux-kernel at vger.kernel.org, openib-general at openib.org Subject: [openib-general] Re: [PATCH 01/13] [RFC] ipath basic headers Date: 17 Dec 2005 13:14:56 +0000 > + * $Id: ipath_common.h 4491 2005-12-15 22:20:31Z rjwalsh $ please remove RCSIDs everywhere. From chas at cmf.nrl.navy.mil Fri Feb 10 06:57:34 2006 From: chas at cmf.nrl.navy.mil (chas williams - CONTRACTOR) Date: Fri, 10 Feb 2006 09:57:34 -0500 Subject: [openib-general] questions about gen2 srp driver In-Reply-To: Message-ID: <200602101457.k1AEvYRU018661@cmf.nrl.navy.mil> In message ,Roland Dreier writes: >Yes, it's exactly because we know that work queues run in process >context with interrupts enabled which lets us use spin_lock_irq. thanks for the reply. you are quite right. i dont know what i was thinking. >There's no limitation on number of outstanding RDMAs targeting a >single R_Key. after looking at it further i finally see what the srp driver is doing. i didnt know that the rkey/lkey it gets during init applies to entire host memory. now, things make a little more sense. part of my confusion is that ->va really seems to mean physical address not virtual address. From mdidomenico at gmail.com Fri Feb 10 07:11:16 2006 From: mdidomenico at gmail.com (Michael Di Domenico) Date: Fri, 10 Feb 2006 10:11:16 -0500 Subject: [openib-general] libtool error under libibmad Message-ID: <97a7c7ed0602100711r7d798df3i9bd1bfe77365d559@mail.gmail.com> trying to compile libibmad, but not installing to /usr/local i'm installing to /opt/openib. libibcommon and libibumad compiled just fine with the same configure syntax (cd libibmad && ./autogen.sh && ./configure --prefix=/opt/openib LDFLAGS=-L/opt/openib/lib CFLAGS=-I/opt/openib/include && make && make install) /bin/sh ./libtool --mode=link --tag=CC gcc -I/opt/openib/include -L/opt/openib/lib -o libibmad.la -rpath /opt/openib/lib -version-info 1 -export-dynamic -Wl,--version-script=./src/libibmad.map libibmad_la-dump.lo libibmad_la-fields.lo libibmad_la-mad.lo libibmad_la-portid.lo libibmad_la-resolve.lo libibmad_la-rpc.lo libibmad_la-sa.lo libibmad_la-smp.lo libibmad_la-gs.lo libibmad_la-serv.lo libibmad_la-register.lo libibmad_la-vendor.lo -libumad -libcommon grep: /usr/local/lib/libibcommon.la: No such file or directory /bin/sed: can't read /usr/local/lib/libibcommon.la: No such file or directory libtool: link: `/usr/local/lib/libibcommon.la' is not a valid libtool archive make[2]: *** [libibmad.la] Error 1 make[2]: Leaving directory `/usr/src/trunk/src/userspace/management/libibmad' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/usr/src/trunk/src/userspace/management/libibmad' make: *** [all] Error 2 I worked around the problem by linking /usr/local/lib to /opt/openib/lib temp. and it seemed to go through just fine... I'm not sure if this is a hardcode somewhere or if this particuliar library is picking up something strange that the others didnot [root at linux14 libibmad]# ./autogen.sh + aclocal -I config + libtoolize --force --copy Putting files in AC_CONFIG_AUX_DIR, `config'. + autoheader + automake --foreign --add-missing --copy + autoconf [root at linux14 libibmad]# ./configure --prefix=/opt/openib LDFLAGS=-L/opt/openib/lib CFLAGS=-I/opt/openib/include checking for a BSD-compatible install... /usr/bin/install -c checking whether build environment is sane... yes checking for gawk... gawk checking whether make sets $(MAKE)... yes checking build system type... i686-redhat-linux-gnu checking host system type... i686-redhat-linux-gnu checking for style of include used by make... GNU checking for gcc... gcc checking for C compiler default output file name... a.out checking whether the C compiler works... yes checking whether we are cross compiling... no checking for suffix of executables... checking for suffix of object files... o checking whether we are using the GNU C compiler... yes checking whether gcc accepts -g... yes checking for gcc option to accept ANSI C... none needed checking dependency style of gcc... gcc3 checking for a sed that does not truncate output... /bin/sed checking for egrep... grep -E checking for ld used by gcc... /usr/bin/ld checking if the linker (/usr/bin/ld) is GNU ld... yes checking for /usr/bin/ld option to reload object files... -r checking for BSD-compatible nm... /usr/bin/nm -B checking whether ln -s works... yes checking how to recognise dependent libraries... pass_all checking how to run the C preprocessor... gcc -E checking for ANSI C header files... yes checking for sys/types.h... yes checking for sys/stat.h... yes checking for stdlib.h... yes checking for string.h... yes checking for memory.h... yes checking for strings.h... yes checking for inttypes.h... yes checking for stdint.h... yes checking for unistd.h... yes checking dlfcn.h usability... yes checking dlfcn.h presence... yes checking for dlfcn.h... yes checking for g++... g++ checking whether we are using the GNU C++ compiler... yes checking whether g++ accepts -g... yes checking dependency style of g++... gcc3 checking how to run the C++ preprocessor... g++ -E checking for g77... g77 checking whether we are using the GNU Fortran 77 compiler... yes checking whether g77 accepts -g... yes checking the maximum length of command line arguments... 32768 checking command to parse /usr/bin/nm -B output from gcc object... ok checking for objdir... .libs checking for ar... ar checking for ranlib... ranlib checking for strip... strip checking if gcc static flag works... yes checking if gcc supports -fno-rtti -fno-exceptions... no checking for gcc option to produce PIC... -fPIC checking if gcc PIC flag -fPIC works... yes checking if gcc supports -c -o file.o... yes checking whether the gcc linker (/usr/bin/ld) supports shared libraries... yes checking whether -lc should be explicitly linked in... no checking dynamic linker characteristics... GNU/Linux ld.so checking how to hardcode library paths into programs... immediate checking whether stripping libraries is possible... yes checking if libtool supports shared libraries... yes checking whether to build shared libraries... yes checking whether to build static libraries... yes configure: creating libtool appending configuration tag "CXX" to libtool checking for ld used by g++... /usr/bin/ld checking if the linker (/usr/bin/ld) is GNU ld... yes checking whether the g++ linker (/usr/bin/ld) supports shared libraries... yes checking for g++ option to produce PIC... -fPIC checking if g++ PIC flag -fPIC works... yes checking if g++ supports -c -o file.o... yes checking whether the g++ linker (/usr/bin/ld) supports shared libraries... yes checking dynamic linker characteristics... GNU/Linux ld.so checking how to hardcode library paths into programs... immediate checking whether stripping libraries is possible... yes appending configuration tag "F77" to libtool checking if libtool supports shared libraries... yes checking whether to build shared libraries... yes checking whether to build static libraries... yes checking for g77 option to produce PIC... -fPIC checking if g77 PIC flag -fPIC works... yes checking if g77 supports -c -o file.o... yes checking whether the g77 linker (/usr/bin/ld) supports shared libraries... yes checking dynamic linker characteristics... GNU/Linux ld.so checking how to hardcode library paths into programs... immediate checking whether stripping libraries is possible... yes checking for gcc... (cached) gcc checking whether we are using the GNU C compiler... (cached) yes checking whether gcc accepts -g... (cached) yes checking for gcc option to accept ANSI C... (cached) none needed checking dependency style of gcc... (cached) gcc3 checking for sys_read_string in -libcommon... yes checking for umad_init in -libumad... yes checking for ANSI C header files... (cached) yes checking netinet/in.h usability... yes checking netinet/in.h presence... yes checking for netinet/in.h... yes checking for stdlib.h... (cached) yes checking for string.h... (cached) yes checking sys/time.h usability... yes checking sys/time.h presence... yes checking for sys/time.h... yes checking for unistd.h... (cached) yes checking infiniband/common.h usability... yes checking infiniband/common.h presence... no configure: WARNING: infiniband/common.h: accepted by the compiler, rejected by the preprocessor! configure: WARNING: infiniband/common.h: proceeding with the compiler's result checking for infiniband/common.h... yes checking infiniband/umad.h usability... yes checking infiniband/umad.h presence... no configure: WARNING: infiniband/umad.h: accepted by the compiler, rejected by the preprocessor! configure: WARNING: infiniband/umad.h: proceeding with the compiler's result checking for infiniband/umad.h... yes checking for memset... yes checking for strrchr... yes checking for strtol... yes checking for an ANSI C-conforming const... yes checking for inline... inline checking whether ld accepts --version-script... yes configure: creating ./config.status config.status: creating Makefile config.status: creating libibmad.spec config.status: creating config.h config.status: config.h is unchanged config.status: executing depfiles commands From tom at opengridcomputing.com Fri Feb 10 07:36:00 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Fri, 10 Feb 2006 09:36:00 -0600 Subject: [openib-general] [PATCH] NOARP Device Support in the CMA Message-ID: <1139585760.18470.13.camel@trinity.ogc.int> This patch adds support for devices that do ARP internally. The patch checks the flags in the netdevice during ARP resolution processing to see if it supports ARP. If it does not, it will return 'done' up the stack until it hits the CMA where a callback will be generated on the CMA workqueue thread. I tested this with the 'krping' application on mthca. This can't be tested with a device that actually does this (AMSO) until we get this driver dropped in the trunk. It has been tested in the iWARP branch however with both IB and iWARP. Signed-off-by: Tom Tucker Index: addr.c =================================================================== --- addr.c (revision 5356) +++ addr.c (working copy) @@ -36,6 +36,7 @@ #include #include +#include #include #include #include @@ -163,6 +164,12 @@ if (ret) goto out; + /* If the device does ARP internally, return 'done' */ + if (rt->idev->dev->flags & IFF_NOARP) { + copy_addr(addr, rt->idev->dev, NULL); + return 1; + } + neigh = neigh_lookup(&arp_tbl, &rt->rt_gateway, rt->idev->dev); if (!neigh) { ret = -ENODATA; @@ -288,6 +295,10 @@ req->timeout = jiffies; queue_req(req); break; + case 1: + ret = req->status; + kfree(req); + break; case -ENODATA: req->timeout = msecs_to_jiffies(timeout_ms) + jiffies; queue_req(req); Index: cma.c =================================================================== --- cma.c (revision 5356) +++ cma.c (working copy) @@ -1315,6 +1315,27 @@ ret = rdma_resolve_ip(src_addr, dst_addr, &id->route.addr.dev_addr, timeout_ms, addr_handler, id_priv); + if (ret == 1) { + /* Already resolved, schedule the addr_handler on the + * work queue. + */ + struct cma_work *work = kmalloc(sizeof *work, GFP_KERNEL); + if (!work) + return -ENOMEM; + ret = cma_acquire_dev(id_priv); + if (ret) { + kfree(work); + goto err; + } + work->id = id_priv; + INIT_WORK(&work->work, cma_work_handler, work); + work->old_state = CMA_ADDR_QUERY; + work->new_state = CMA_ADDR_RESOLVED; + work->event.event = RDMA_CM_EVENT_ADDR_RESOLVED; + queue_work(rdma_wq, &work->work); + return 0; + } + if (ret) goto err; From eitan at mellanox.co.il Fri Feb 10 07:39:23 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Fri, 10 Feb 2006 17:39:23 +0200 Subject: [openib-general] port num in port priv Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B724@mtlexch01.mtl.com> Hi Suri, You can send a PMA PortCounters MAD either using "process local" or to the device using its LID. Eitan > I am not using user space verbs library, is there any other way to get at > the different port#N stats? > > Thanks, > Suri From halr at voltaire.com Fri Feb 10 07:44:51 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Feb 2006 10:44:51 -0500 Subject: [openib-general] Re: [PATCH] management/*/autogen.sh: config dir test and creation In-Reply-To: <20060209133127.GA29512@sashak.voltaire.com> References: <20060209133127.GA29512@sashak.voltaire.com> Message-ID: <1139586284.4450.6002.camel@hal.voltaire.com> On Thu, 2006-02-09 at 08:31, Sasha Khapyorsky wrote: > Hi, > > This will check config dir existance and create it if neccessary. As > previous one but for libraries and diags. Thanks. Applied. > Sasha. > > > This will check config dir existance and create it if neccessary. > > Signed-off-by: Sasha Khapyorsky From halr at voltaire.com Fri Feb 10 07:57:35 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Feb 2006 10:57:35 -0500 Subject: [openib-general] port num in port priv In-Reply-To: <200602101409.k1AE9UP7003658@mail.baymicrosystems.com> References: <200602101409.k1AE9UP7003658@mail.baymicrosystems.com> Message-ID: <1139587050.4450.6060.camel@hal.voltaire.com> Hi Suri, On Fri, 2006-02-10 at 09:09, Suresh Shelvapille wrote: > Hal: > The way I gather port stats locally on my device is by doing a "cat" on the > "/sys/class/Infiniband/mysw/port/0/counters" file. As Roland mentioned this > is limited to port0 counters on a switch. Base switch port 0 does not support PortCounters; only enhanced switch port 0. > If I need stats for port#N I need to expand this by providing more class > attributes and getting to them that way.... Switches are a little different than HCAs in terms of this so you may want to change the model. > I am not using user space verbs library, is there any other way to get at > the different port#N stats? Yes, by sending the appropriate PerfMgt Get for PortCounters to the local LID (of port 0) with the the PortSelect component set to the switch external port you want and decoding the GetResp. That's what the perfquery tool does. Your switch needs to support PMA for this to work. -- Hal > Thanks, > Suri > > > > The PMA counters are gathered by a tool named perfquery. Port number is > > in the PortSelect component of the PMA attribute. You can't get PMA > > counters until the port is active. > > > > -- Hal > From halr at voltaire.com Fri Feb 10 08:05:34 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Feb 2006 11:05:34 -0500 Subject: [openib-general] IPoIB and lid change In-Reply-To: <20060208201404.GE32759@mellanox.co.il> References: <20060208201404.GE32759@mellanox.co.il> Message-ID: <1139587533.4450.6094.camel@hal.voltaire.com> On Wed, 2006-02-08 at 15:14, Michael S. Tsirkin wrote: > Hi, Roland! > One issue we have with IPoIB is that IPoIB may cache a remote node path for a > long time. Remote LID may get changed e.g. if the SM is changed, and IPoIB might > lose connectivity. The remote LID may get changed for other reasons too without an SM change (SM merge of 2 separate subnets). How can this be handled ? -- Hal > One simple way to address this would be to have a list of all > address handles per net device and kill them on an SM change event. > > What do you think? From peter at yahoo.com Fri Feb 10 01:28:18 2006 From: peter at yahoo.com (Horowitz) Date: Fri, 10 Feb 2006 10:28:18 +0100 Subject: [openib-general] Men Health Message-ID: <000001c62a4e$7b3c4180$0100007f@neska> -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.jpg Type: image/jpeg Size: 19254 bytes Desc: not available URL: From halr at voltaire.com Fri Feb 10 08:59:59 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Feb 2006 11:59:59 -0500 Subject: [openib-general] Re: [PATCH] Opensm - client reregistration In-Reply-To: <5z4q38lol4.fsf@mtl066.yok.mtl.com> References: <5z4q38lol4.fsf@mtl066.yok.mtl.com> Message-ID: <1139590798.4475.13.camel@hal.voltaire.com> On Thu, 2006-02-09 at 07:10, Yael Kalka wrote: > Hi Hal, > > Currently, the OpenSM sends PortInfo with ClientReRegistration bit > turned on only during the first sweep after becoming Master. > This doesn't cover all cases where ClientReRegistration should be > turned on. OpenSM should turn on this bit also on new ports it > discovers (in cases of subnet merging, for example). > The following patch adds support for turning on ClientReRegistration > bit on newly discovered ports. Thanks. Applied (with change below; see comment). -- Hal > Thanks, > Yael > > Signed-off-by: Yael Kalka > > Index: opensm/osm_lid_mgr.c > =================================================================== > --- opensm/osm_lid_mgr.c (revision 5337) > +++ opensm/osm_lid_mgr.c (working copy) > @@ -897,6 +897,7 @@ __osm_lid_mgr_get_port_lid( > static boolean_t > __osm_lid_mgr_set_physp_pi( > IN osm_lid_mgr_t * const p_mgr, > + IN osm_port_t* const p_port, > IN osm_physp_t* const p_physp, > IN ib_net16_t const lid ) > { > @@ -910,6 +911,7 @@ __osm_lid_mgr_set_physp_pi( > uint8_t op_vls; > uint8_t port_num; > boolean_t send_set = FALSE; > + boolean_t new_port = FALSE; > > OSM_LOG_ENTER( p_mgr->p_log, __osm_lid_mgr_set_physp_pi ); > > @@ -1140,8 +1142,17 @@ __osm_lid_mgr_set_physp_pi( > /* > We need to set the cli_rereg bit when we are in first_time_master_sweep for > ports supporting the ClientReregistration Vol1 (v1.2) p811 14.4.11 > + Also, if this port was just now discovered - then we should also set the > + cli_rereg bit. We know that the port was just discovered if it is in > + the p_subn->new_port_list list. > */ > - if ( ( p_mgr->p_subn->first_time_master_sweep == TRUE ) && > + if ( cl_is_object_in_list(&p_mgr->p_subn->new_ports_list, p_port) ) missing { here... > + /* p_port is in new_ports_list - mark new_port as TRUE */ > + new_port = TRUE; > + } > + > + if ( ( p_mgr->p_subn->first_time_master_sweep == TRUE || > + new_port == TRUE ) && > ( (p_old_pi->capability_mask & IB_PORT_CAP_HAS_CLIENT_REREG) != 0 ) ) > ib_port_info_set_client_rereg( p_pi, 1 ); > else > @@ -1243,7 +1254,7 @@ __osm_lid_mgr_process_our_sm_node( > */ > p_physp = osm_port_get_default_phys_ptr( p_port ); > > - __osm_lid_mgr_set_physp_pi( p_mgr, p_physp, cl_hton16( min_lid_ho ) ); > + __osm_lid_mgr_set_physp_pi( p_mgr, p_port, p_physp, cl_hton16( min_lid_ho ) ); > > Exit: > OSM_LOG_EXIT( p_mgr->p_log ); > @@ -1367,7 +1378,7 @@ osm_lid_mgr_process_subnet( > > p_physp = osm_port_get_default_phys_ptr( p_port ); > /* the proc returns the fact it sent a set port info */ > - if (__osm_lid_mgr_set_physp_pi( p_mgr, p_physp, cl_hton16( min_lid_ho ))) > + if (__osm_lid_mgr_set_physp_pi( p_mgr, p_port, p_physp, cl_hton16( min_lid_ho ))) > p_mgr->send_set_reqs = TRUE; > } > } /* all ports */ > From rdreier at cisco.com Fri Feb 10 09:17:47 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 10 Feb 2006 09:17:47 -0800 Subject: [openib-general] svn Ids on OpenIB code ? In-Reply-To: <1139582065.4450.5781.camel@hal.voltaire.com> (Hal Rosenstock's message of "10 Feb 2006 09:34:25 -0500") References: <1139582065.4450.5781.camel@hal.voltaire.com> Message-ID: Hal> Should we remove all the svn IDs from the OpenIB kernel tree? It's probably worth leaving them in our svn tree. If you want to make a patch that removes them from the upstream kernel, that would be fine. - R. From info at ooeer.com Fri Feb 10 08:32:05 2006 From: info at ooeer.com (info at ooeer.com) Date: 11 Feb 2006 01:32:05 +0900 Subject: [openib-general] $B7n(B100$BK|!A$N<}F~$r(B!$B%$%s%?!<%M%C%H:_Bp6HL3(B!! Message-ID: <20060210163205.17742.qmail@mail.ooeer.com> $B7P83$N$J$$J}$K$b%T%C%?%j!#(B $B5.J}$O:#$N;E;v$KBP$7$F>-MhIT0B$,$"$j$^$9$+!)(B $B5.J}$O:#$N;E;v$KK\Ev$KK~B-$7$F$$$^$9$+!)(B $B5.J}$OO78e$KBP$7$FIT0B$,$"$j$^$9$+!)(B $B5.J}$O(B30$BG/8e$b%5%i%j!<%^%s(BOL$B$@$H;W$$$^$9$+!)(B (YES$B$N?M$O:#$9$0$7$?$N(BURL$B$+$i;22C"-"-"-(B) http://www.gyakuten5.net/?sf $B!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a(B NO.I don't veceive your mail gendar7_net at yahoo.ca $B:#8e!"l9g$O(B gendar7_net at yahoo.ca $B!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a(B From halr at voltaire.com Fri Feb 10 09:11:03 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Feb 2006 12:11:03 -0500 Subject: [openib-general] libtool error under libibmad In-Reply-To: <97a7c7ed0602100711r7d798df3i9bd1bfe77365d559@mail.gmail.com> References: <97a7c7ed0602100711r7d798df3i9bd1bfe77365d559@mail.gmail.com> Message-ID: <1139591462.4475.29.camel@hal.voltaire.com> Hi Michael, On Fri, 2006-02-10 at 10:11, Michael Di Domenico wrote: > trying to compile libibmad, but not installing to /usr/local i'm > installing to /opt/openib. libibcommon and libibumad compiled just > fine with the same configure syntax > > (cd libibmad && ./autogen.sh && ./configure --prefix=/opt/openib > LDFLAGS=-L/opt/openib/lib CFLAGS=-I/opt/openib/include && make && make > install) > > /bin/sh ./libtool --mode=link --tag=CC gcc -I/opt/openib/include > -L/opt/openib/lib -o libibmad.la -rpath /opt/openib/lib -version-info > 1 -export-dynamic -Wl,--version-script=./src/libibmad.map > libibmad_la-dump.lo libibmad_la-fields.lo libibmad_la-mad.lo > libibmad_la-portid.lo libibmad_la-resolve.lo libibmad_la-rpc.lo > libibmad_la-sa.lo libibmad_la-smp.lo libibmad_la-gs.lo > libibmad_la-serv.lo libibmad_la-register.lo libibmad_la-vendor.lo > -libumad -libcommon > grep: /usr/local/lib/libibcommon.la: No such file or directory > /bin/sed: can't read /usr/local/lib/libibcommon.la: No such file or directory > libtool: link: `/usr/local/lib/libibcommon.la' is not a valid libtool archive > make[2]: *** [libibmad.la] Error 1 > make[2]: Leaving directory `/usr/src/trunk/src/userspace/management/libibmad' > make[1]: *** [all-recursive] Error 1 > make[1]: Leaving directory `/usr/src/trunk/src/userspace/management/libibmad' > make: *** [all] Error 2 > > I worked around the problem by linking /usr/local/lib to > /opt/openib/lib temp. and it seemed to go through just fine... I'm > not sure if this is a hardcode somewhere or if this particuliar > library is picking up something strange that the others didnot It looks like libibcommon is not installed in /opt/openib/lib for some reason. What is is that directory ? Also, what is in /opt/openib/include ? Did you do make install the other libraries and in what order (and did those work) ? -- Hal > [root at linux14 libibmad]# ./autogen.sh > + aclocal -I config > + libtoolize --force --copy > Putting files in AC_CONFIG_AUX_DIR, `config'. > + autoheader > + automake --foreign --add-missing --copy > + autoconf > [root at linux14 libibmad]# ./configure --prefix=/opt/openib > LDFLAGS=-L/opt/openib/lib CFLAGS=-I/opt/openib/include > checking for a BSD-compatible install... /usr/bin/install -c > checking whether build environment is sane... yes > checking for gawk... gawk > checking whether make sets $(MAKE)... yes > checking build system type... i686-redhat-linux-gnu > checking host system type... i686-redhat-linux-gnu > checking for style of include used by make... GNU > checking for gcc... gcc > checking for C compiler default output file name... a.out > checking whether the C compiler works... yes > checking whether we are cross compiling... no > checking for suffix of executables... > checking for suffix of object files... o > checking whether we are using the GNU C compiler... yes > checking whether gcc accepts -g... yes > checking for gcc option to accept ANSI C... none needed > checking dependency style of gcc... gcc3 > checking for a sed that does not truncate output... /bin/sed > checking for egrep... grep -E > checking for ld used by gcc... /usr/bin/ld > checking if the linker (/usr/bin/ld) is GNU ld... yes > checking for /usr/bin/ld option to reload object files... -r > checking for BSD-compatible nm... /usr/bin/nm -B > checking whether ln -s works... yes > checking how to recognise dependent libraries... pass_all > checking how to run the C preprocessor... gcc -E > checking for ANSI C header files... yes > checking for sys/types.h... yes > checking for sys/stat.h... yes > checking for stdlib.h... yes > checking for string.h... yes > checking for memory.h... yes > checking for strings.h... yes > checking for inttypes.h... yes > checking for stdint.h... yes > checking for unistd.h... yes > checking dlfcn.h usability... yes > checking dlfcn.h presence... yes > checking for dlfcn.h... yes > checking for g++... g++ > checking whether we are using the GNU C++ compiler... yes > checking whether g++ accepts -g... yes > checking dependency style of g++... gcc3 > checking how to run the C++ preprocessor... g++ -E > checking for g77... g77 > checking whether we are using the GNU Fortran 77 compiler... yes > checking whether g77 accepts -g... yes > checking the maximum length of command line arguments... 32768 > checking command to parse /usr/bin/nm -B output from gcc object... ok > checking for objdir... .libs > checking for ar... ar > checking for ranlib... ranlib > checking for strip... strip > checking if gcc static flag works... yes > checking if gcc supports -fno-rtti -fno-exceptions... no > checking for gcc option to produce PIC... -fPIC > checking if gcc PIC flag -fPIC works... yes > checking if gcc supports -c -o file.o... yes > checking whether the gcc linker (/usr/bin/ld) supports shared libraries... yes > checking whether -lc should be explicitly linked in... no > checking dynamic linker characteristics... GNU/Linux ld.so > checking how to hardcode library paths into programs... immediate > checking whether stripping libraries is possible... yes > checking if libtool supports shared libraries... yes > checking whether to build shared libraries... yes > checking whether to build static libraries... yes > configure: creating libtool > appending configuration tag "CXX" to libtool > checking for ld used by g++... /usr/bin/ld > checking if the linker (/usr/bin/ld) is GNU ld... yes > checking whether the g++ linker (/usr/bin/ld) supports shared libraries... yes > checking for g++ option to produce PIC... -fPIC > checking if g++ PIC flag -fPIC works... yes > checking if g++ supports -c -o file.o... yes > checking whether the g++ linker (/usr/bin/ld) supports shared libraries... yes > checking dynamic linker characteristics... GNU/Linux ld.so > checking how to hardcode library paths into programs... immediate > checking whether stripping libraries is possible... yes > appending configuration tag "F77" to libtool > checking if libtool supports shared libraries... yes > checking whether to build shared libraries... yes > checking whether to build static libraries... yes > checking for g77 option to produce PIC... -fPIC > checking if g77 PIC flag -fPIC works... yes > checking if g77 supports -c -o file.o... yes > checking whether the g77 linker (/usr/bin/ld) supports shared libraries... yes > checking dynamic linker characteristics... GNU/Linux ld.so > checking how to hardcode library paths into programs... immediate > checking whether stripping libraries is possible... yes > checking for gcc... (cached) gcc > checking whether we are using the GNU C compiler... (cached) yes > checking whether gcc accepts -g... (cached) yes > checking for gcc option to accept ANSI C... (cached) none needed > checking dependency style of gcc... (cached) gcc3 > checking for sys_read_string in -libcommon... yes > checking for umad_init in -libumad... yes > checking for ANSI C header files... (cached) yes > checking netinet/in.h usability... yes > checking netinet/in.h presence... yes > checking for netinet/in.h... yes > checking for stdlib.h... (cached) yes > checking for string.h... (cached) yes > checking sys/time.h usability... yes > checking sys/time.h presence... yes > checking for sys/time.h... yes > checking for unistd.h... (cached) yes > checking infiniband/common.h usability... yes > checking infiniband/common.h presence... no > configure: WARNING: infiniband/common.h: accepted by the compiler, > rejected by the preprocessor! > configure: WARNING: infiniband/common.h: proceeding with the compiler's result > checking for infiniband/common.h... yes > checking infiniband/umad.h usability... yes > checking infiniband/umad.h presence... no > configure: WARNING: infiniband/umad.h: accepted by the compiler, > rejected by the preprocessor! > configure: WARNING: infiniband/umad.h: proceeding with the compiler's result > checking for infiniband/umad.h... yes > checking for memset... yes > checking for strrchr... yes > checking for strtol... yes > checking for an ANSI C-conforming const... yes > checking for inline... inline > checking whether ld accepts --version-script... yes > configure: creating ./config.status > config.status: creating Makefile > config.status: creating libibmad.spec > config.status: creating config.h > config.status: config.h is unchanged > config.status: executing depfiles commands > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From iod00d at hp.com Fri Feb 10 09:43:57 2006 From: iod00d at hp.com (Grant Grundler) Date: Fri, 10 Feb 2006 09:43:57 -0800 Subject: [openib-general] IPoIB and lid change In-Reply-To: <1139587533.4450.6094.camel@hal.voltaire.com> References: <20060208201404.GE32759@mellanox.co.il> <1139587533.4450.6094.camel@hal.voltaire.com> Message-ID: <20060210174357.GD24865@esmail.cup.hp.com> On Fri, Feb 10, 2006 at 11:05:34AM -0500, Hal Rosenstock wrote: > > Hi, Roland! > > One issue we have with IPoIB is that IPoIB may cache a remote node path > > for a long time. Remote LID may get changed e.g. if the SM is changed, > > and IPoIB might lose connectivity. I wonder if this is why when I reload the IB drivers on one node I sometimes have to reload them on other nodes too. Otherwise ping over IPoIB doesn't work. > The remote LID may get changed for other reasons too without an SM > change (SM merge of 2 separate subnets). How can this be handled ? Isn't this just another case of the SM changing for one of the subnets? grant From halr at voltaire.com Fri Feb 10 09:52:53 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Feb 2006 12:52:53 -0500 Subject: [openib-general] IPoIB and lid change In-Reply-To: <20060210174357.GD24865@esmail.cup.hp.com> References: <20060208201404.GE32759@mellanox.co.il> <1139587533.4450.6094.camel@hal.voltaire.com> <20060210174357.GD24865@esmail.cup.hp.com> Message-ID: <1139593973.4475.97.camel@hal.voltaire.com> On Fri, 2006-02-10 at 12:43, Grant Grundler wrote: > On Fri, Feb 10, 2006 at 11:05:34AM -0500, Hal Rosenstock wrote: > > > Hi, Roland! > > > One issue we have with IPoIB is that IPoIB may cache a remote node path > > > for a long time. Remote LID may get changed e.g. if the SM is changed, > > > and IPoIB might lose connectivity. > > I wonder if this is why when I reload the IB drivers on one node > I sometimes have to reload them on other nodes too. Otherwise > ping over IPoIB doesn't work. > > > The remote LID may get changed for other reasons too without an SM > > change (SM merge of 2 separate subnets). How can this be handled ? > > Isn't this just another case of the SM changing for one of the subnets? For the one whose SM loses but there still could be LID changes in the existing subnet depending on the SM policy. -- Hal > grant From sean.hefty at intel.com Fri Feb 10 10:02:21 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 10 Feb 2006 10:02:21 -0800 Subject: [openib-general] [PATCH] NOARP Device Support in the CMA In-Reply-To: <1139585760.18470.13.camel@trinity.ogc.int> Message-ID: Tom, Can you tell me if this alternate patch would work? I let the ib_addr module queue the callback. No CMA changes are necessary then. Signed-off-by: Sean Hefty --- Index: addr.c =================================================================== --- addr.c (revision 5295) +++ addr.c (working copy) @@ -36,6 +36,7 @@ #include #include +#include #include #include #include @@ -163,6 +164,12 @@ static int addr_resolve_remote(struct so if (ret) goto out; + /* If the device does ARP internally, return 'done' */ + if (rt->idev->dev->flags & IFF_NOARP) { + copy_addr(addr, rt->idev->dev, NULL); + return 0; + } + neigh = neigh_lookup(&arp_tbl, &rt->rt_gateway, rt->idev->dev); if (!neigh) { ret = -ENODATA; From mdidomenico at gmail.com Fri Feb 10 10:06:41 2006 From: mdidomenico at gmail.com (Michael Di Domenico) Date: Fri, 10 Feb 2006 13:06:41 -0500 Subject: [openib-general] libtool error under libibmad In-Reply-To: <1139591462.4475.29.camel@hal.voltaire.com> References: <97a7c7ed0602100711r7d798df3i9bd1bfe77365d559@mail.gmail.com> <1139591462.4475.29.camel@hal.voltaire.com> Message-ID: <97a7c7ed0602101006p68f953e0v91a2c628a06377d3@mail.gmail.com> On 10 Feb 2006 12:11:03 -0500, Hal Rosenstock wrote: > Hi Michael, > > On Fri, 2006-02-10 at 10:11, Michael Di Domenico wrote: > > trying to compile libibmad, but not installing to /usr/local i'm > > installing to /opt/openib. libibcommon and libibumad compiled just > > fine with the same configure syntax > > > > (cd libibmad && ./autogen.sh && ./configure --prefix=/opt/openib > > LDFLAGS=-L/opt/openib/lib CFLAGS=-I/opt/openib/include && make && make > > install) > > > > /bin/sh ./libtool --mode=link --tag=CC gcc -I/opt/openib/include > > -L/opt/openib/lib -o libibmad.la -rpath /opt/openib/lib -version-info > > 1 -export-dynamic -Wl,--version-script=./src/libibmad.map > > libibmad_la-dump.lo libibmad_la-fields.lo libibmad_la-mad.lo > > libibmad_la-portid.lo libibmad_la-resolve.lo libibmad_la-rpc.lo > > libibmad_la-sa.lo libibmad_la-smp.lo libibmad_la-gs.lo > > libibmad_la-serv.lo libibmad_la-register.lo libibmad_la-vendor.lo > > -libumad -libcommon > > grep: /usr/local/lib/libibcommon.la: No such file or directory > > /bin/sed: can't read /usr/local/lib/libibcommon.la: No such file or directory > > libtool: link: `/usr/local/lib/libibcommon.la' is not a valid libtool archive > > make[2]: *** [libibmad.la] Error 1 > > make[2]: Leaving directory `/usr/src/trunk/src/userspace/management/libibmad' > > make[1]: *** [all-recursive] Error 1 > > make[1]: Leaving directory `/usr/src/trunk/src/userspace/management/libibmad' > > make: *** [all] Error 2 > > > > I worked around the problem by linking /usr/local/lib to > > /opt/openib/lib temp. and it seemed to go through just fine... I'm > > not sure if this is a hardcode somewhere or if this particuliar > > library is picking up something strange that the others didnot > > It looks like libibcommon is not installed in /opt/openib/lib for some > reason. What is is that directory ? Also, what is in /opt/openib/include > ? Did you do make install the other libraries and in what order (and did > those work) ? Hal, It definately was installed in the directory. I'm not exactly sure what caused the problem. I ended up just removing the installation in /opt/openib and doing a make clean, and then remake'ing the everything and it seemed to go through just fine, so perhaps there was an old .o file lying around. I managed to pull this from libtool --debug, but i'm not sure what caused it After the make clean, everything seemed to go in smoothly, so i'm not sure what happened. ++ echo X/opt/openib/lib/libibumad.la ++ /bin/sed -e '1s/^X//' -e 's%/[^/]*$%%' + ladir=/opt/openib/lib + test X/opt/openib/lib = X/opt/openib/lib/libibumad.la + dlname= + dlopen= + dlpreopen= + libdir= + library_names= + old_library= + installed=yes + shouldnotlink=no + case $lib in + . /opt/openib/lib/libibumad.la ++ dlname=libibumad.so.1 ++ library_names='libibumad.so.1.0.0 libibumad.so.1 libibumad.so' ++ old_library=libibumad.a ++ dependency_libs=' -L/opt/openib/lib /usr/local/lib/libibcommon.la' ++ current=1 ++ age=0 ++ revision=0 ++ installed=yes ++ shouldnotlink=no ++ dlopen= ++ dlpreopen= ++ libdir=/opt/openib/lib + test prog,scan = lib,link + test prog,scan = prog,scan + test -n '' + test -n '' + test scan = conv + linklib= + for l in '$old_library' '$library_names' From halr at voltaire.com Fri Feb 10 10:17:02 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Feb 2006 13:17:02 -0500 Subject: [openib-general] libtool error under libibmad In-Reply-To: <97a7c7ed0602101006p68f953e0v91a2c628a06377d3@mail.gmail.com> References: <97a7c7ed0602100711r7d798df3i9bd1bfe77365d559@mail.gmail.com> <1139591462.4475.29.camel@hal.voltaire.com> <97a7c7ed0602101006p68f953e0v91a2c628a06377d3@mail.gmail.com> Message-ID: <1139595307.4475.145.camel@hal.voltaire.com> On Fri, 2006-02-10 at 13:06, Michael Di Domenico wrote: > On 10 Feb 2006 12:11:03 -0500, Hal Rosenstock wrote: > > Hi Michael, > > > > On Fri, 2006-02-10 at 10:11, Michael Di Domenico wrote: > > > trying to compile libibmad, but not installing to /usr/local i'm > > > installing to /opt/openib. libibcommon and libibumad compiled just > > > fine with the same configure syntax > > > > > > (cd libibmad && ./autogen.sh && ./configure --prefix=/opt/openib > > > LDFLAGS=-L/opt/openib/lib CFLAGS=-I/opt/openib/include && make && make > > > install) > > > > > > /bin/sh ./libtool --mode=link --tag=CC gcc -I/opt/openib/include > > > -L/opt/openib/lib -o libibmad.la -rpath /opt/openib/lib -version-info > > > 1 -export-dynamic -Wl,--version-script=./src/libibmad.map > > > libibmad_la-dump.lo libibmad_la-fields.lo libibmad_la-mad.lo > > > libibmad_la-portid.lo libibmad_la-resolve.lo libibmad_la-rpc.lo > > > libibmad_la-sa.lo libibmad_la-smp.lo libibmad_la-gs.lo > > > libibmad_la-serv.lo libibmad_la-register.lo libibmad_la-vendor.lo > > > -libumad -libcommon > > > grep: /usr/local/lib/libibcommon.la: No such file or directory > > > /bin/sed: can't read /usr/local/lib/libibcommon.la: No such file or directory > > > libtool: link: `/usr/local/lib/libibcommon.la' is not a valid libtool archive > > > make[2]: *** [libibmad.la] Error 1 > > > make[2]: Leaving directory `/usr/src/trunk/src/userspace/management/libibmad' > > > make[1]: *** [all-recursive] Error 1 > > > make[1]: Leaving directory `/usr/src/trunk/src/userspace/management/libibmad' > > > make: *** [all] Error 2 > > > > > > I worked around the problem by linking /usr/local/lib to > > > /opt/openib/lib temp. and it seemed to go through just fine... I'm > > > not sure if this is a hardcode somewhere or if this particuliar > > > library is picking up something strange that the others didnot > > > > It looks like libibcommon is not installed in /opt/openib/lib for some > > reason. What is is that directory ? Also, what is in /opt/openib/include > > ? Did you do make install the other libraries and in what order (and did > > those work) ? > > Hal, > > It definately was installed in the directory. I'm not exactly sure > what caused the problem. I ended up just removing the installation in > /opt/openib and doing a make clean, and then remake'ing the everything > and it seemed to go through just fine, so perhaps there was an old .o > file lying around. I managed to pull this from libtool --debug, but > i'm not sure what caused it > > After the make clean, everything seemed to go in smoothly, so i'm not > sure what happened. That makes two of us :-( Some others may have experienced this as well and when they cleaned and rebuilt everything was OK. I'm not sure what causes this but haven't been able to recreate this. -- Hal > ++ echo X/opt/openib/lib/libibumad.la > ++ /bin/sed -e '1s/^X//' -e 's%/[^/]*$%%' > + ladir=/opt/openib/lib > + test X/opt/openib/lib = X/opt/openib/lib/libibumad.la > + dlname= > + dlopen= > + dlpreopen= > + libdir= > + library_names= > + old_library= > + installed=yes > + shouldnotlink=no > + case $lib in > + . /opt/openib/lib/libibumad.la > ++ dlname=libibumad.so.1 > ++ library_names='libibumad.so.1.0.0 libibumad.so.1 libibumad.so' > ++ old_library=libibumad.a > ++ dependency_libs=' -L/opt/openib/lib /usr/local/lib/libibcommon.la' > ++ current=1 > ++ age=0 > ++ revision=0 > ++ installed=yes > ++ shouldnotlink=no > ++ dlopen= > ++ dlpreopen= > ++ libdir=/opt/openib/lib > + test prog,scan = lib,link > + test prog,scan = prog,scan > + test -n '' > + test -n '' > + test scan = conv > + linklib= > + for l in '$old_library' '$library_names' > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From krause at cup.hp.com Fri Feb 10 10:17:53 2006 From: krause at cup.hp.com (Michael Krause) Date: Fri, 10 Feb 2006 10:17:53 -0800 Subject: [openib-general] IPoIB and lid change In-Reply-To: <20060210174357.GD24865@esmail.cup.hp.com> References: <20060208201404.GE32759@mellanox.co.il> <1139587533.4450.6094.camel@hal.voltaire.com> <20060210174357.GD24865@esmail.cup.hp.com> Message-ID: <6.2.0.14.2.20060210101401.026fc8b0@esmail.cup.hp.com> At 09:43 AM 2/10/2006, Grant Grundler wrote: >On Fri, Feb 10, 2006 at 11:05:34AM -0500, Hal Rosenstock wrote: > > > Hi, Roland! > > > One issue we have with IPoIB is that IPoIB may cache a remote node path > > > for a long time. Remote LID may get changed e.g. if the SM is changed, > > > and IPoIB might lose connectivity. > >I wonder if this is why when I reload the IB drivers on one node >I sometimes have to reload them on other nodes too. Otherwise >ping over IPoIB doesn't work. If endnodes are not periodically refreshing their caches or are not subscribing to event management to be informed a refresh is in order, then endnodes will fall out of sync and would need to be restarted to establish communication. This is a classic problem that was illustrated in various early router protocols and is why today's protocols rely implement a two-prong approach in many cases - limited cache lifetime and proactive cache event updates. > > The remote LID may get changed for other reasons too without an SM > > change (SM merge of 2 separate subnets). How can this be handled ? > >Isn't this just another case of the SM changing for one of the subnets? A SM merge that involves updating LIDs is a non-trivial event. It requires connections to be effectively restarted as one cannot ascertain whether all packets are flushed from the fabric otherwise - that can cause silent data corruption. For a subsystem such as IPoverIB, a LID update should result in an unsolicited ARP / ND exchange which will cause all remote endnodes to receive the new information. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Fri Feb 10 10:56:47 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Feb 2006 13:56:47 -0500 Subject: [openib-general] port num in port priv In-Reply-To: References: <200602091540.k19FewKX012081@mail.baymicrosystems.com> Message-ID: <1139597807.4475.178.camel@hal.voltaire.com> On Thu, 2006-02-09 at 10:55, Roland Dreier wrote: > The current code probably needs to be extended to give > the upper layers the port number where a MAD was received, and also to > make it possible to specify the port number that a directed route MAD > will be sent on. On the receive side, the port_num is part of the WC for DR SMPs. On the send side, the port_num is part of the WR (for DR SMPs). So at least that part is covered as far as I can tell. There are some changes I can see and of course, the SMI has not been exercised for switch forwarding so there could be some issues there. -- Hal From ftillier at silverstorm.com Fri Feb 10 11:12:33 2006 From: ftillier at silverstorm.com (Fab Tillier) Date: Fri, 10 Feb 2006 11:12:33 -0800 Subject: [openib-general] RE: [Openib-windows] NFS performance and general disk network export advice (Linux-Windows) In-Reply-To: <1139544966.8779.7.camel@strider.opengridcomputing.com> Message-ID: <000d01c62e75$f6511c80$6701a8c0@infiniconsys.com> Hi Tom, > Fab: > > As you point out, we've been focused on the main trunk and or target > test platforms are Linux based. That said, we actually had Beta versions > of NDIS and Winsock Direct drivers for the AMSO adapter, so we know this > works and we know where the dead are buried. > > It probably makes sense to wait until the core iWARP support is merged > into the main trunk. however, when/if you decide to merge up from the > main trunk and pick up iWARP support, I am more than happy to help you > with any issues that you may have. I think you misunderstood what I meant - I think it makes sense to have functionality like the CMA in Windows as that provides applications with a valuable service (using IP addressing to establish IB connections). I don't have any plans to merge in iWarp support though, as I don't understand what the iWarp community's requirements are and haven't followed the discussions very closely. There is no policy or plan for merging code from Linux to Windows. Lastly, with the Microsoft RDMA and TCP chimney designs, I don't know if it makes sense for iWarp to layer into the IB framework rather than into the OS provided one. If there is to be iWarp support in OpenIB Windows, I expect iWarp vendors to drive that effort. Hopefully this clears things up. - Fab From arlin.r.davis at intel.com Fri Feb 10 11:11:03 2006 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Fri, 10 Feb 2006 11:11:03 -0800 Subject: [dat-discussions] [openib-general] [RFC] DAT2.0 immediate data proposal Message-ID: <59278FC0C48A994BABABD069571E45680DED82CD@orsmsx401.amr.corp.intel.com> > Arlin, > This can be done. > > But I have an issue that extension call violate Transport Requirement. > Currently, the matching semantic is well-defined since > Recv only matches Send. Since Spec does not have any idea what > operations are defined in extension(s) there is a problem > with the transport requirements. We can, of course, > make some generic statement that with does not cover APIs > that are defined in extensions. This is a good point. I think a generic statement will suffice. By definition we are extending transport services so both requirements/interfaces could change. Any variation will be documented with the extended code definitions. Not sure what else to do since specific transport requirements will differ from DAT requirements in most cases. -arlin From info at ooeer.com Fri Feb 10 11:23:35 2006 From: info at ooeer.com (info at ooeer.com) Date: 11 Feb 2006 04:23:35 +0900 Subject: [openib-general] $B4JC1$KBg6b$r2T$$$G$/$@$5$$!T(BID105246$BMMFCJL4k2h!U(B Message-ID: <20060210192335.32479.qmail@mail.ooeer.com> $B!~"!!~(BID105246$BMMFCJL>7BT>u"!!~"!(B $B4JC1$K$*6b$r2T$0;v$,$G$-$^$9!#K\Ev$K4JC1$G$9!#(B $B8+$FD:$1$l$P$9$0J,$+$j$^$9!#(B $B"->\$7$$>\:Y$O"-(B http://www.gyakuten5.net/?okane/ $B%a!<%k $B7P83$N$J$$J}$K$b%T%C%?%j!#(B $B5.J}$O:#$N;E;v$KBP$7$F>-MhIT0B$,$"$j$^$9$+!)(B $B5.J}$O:#$N;E;v$KK\Ev$KK~B-$7$F$$$^$9$+!)(B $B5.J}$OO78e$KBP$7$FIT0B$,$"$j$^$9$+!)(B $B5.J}$O(B30$BG/8e$b%5%i%j!<%^%s(BOL$B$@$H;W$$$^$9$+!)(B (YES$B$N?M$O:#$9$0$7$?$N(BURL$B$+$i;22C"-"-"-(B) http://www.gyakuten5.net/?sf $B!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a(B NO.I don't veceive your mail gendar7_net at yahoo.ca $B:#8e!"l9g$O(B gendar7_net at yahoo.ca $B!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a(B From hamid_ab at she.com Fri Feb 10 13:36:30 2006 From: hamid_ab at she.com (Mr.Hamid Abdalla.) Date: Sat, 11 Feb 2006 04:36:30 +0700 Subject: [openib-general] SINCERE ASSOCIATE NEEDED Message-ID: <20060210214248.11BB02283D7@openib.ca.sandia.gov> From: Mr.Hamid Abdalla. Email:hamid_ab at she.com Tel:+66-6-1217553. Attention, With due respect trust and humanity. I write you this letter which I believe would be of great interest to you. Before I proceed, may I humbly introduce myself to your good self, My Name is Mr.Hamid Abdalla, from Sudan ,my Father was until recently, one of the personal aid to the late Dr John De Mabior Garang, First Vice-President of Sudan, President of Southern Sudan and Leader of the Sudan People Liberation Movement /Army (SPLM/A) who was killed in a helicopter crash on his way returning from a peace agreement meeting from Uganda on 30th of July 2005, just three weeks after his nomination into the main governement under the January event. My late father MR ABDALLA ATER had been one of the personal aid to leader of the Sudan People's Liberation Army for the past 21 years, and had lived and fought for peace and one united Sudan, with a quick wit, ferocious temper and longing to see the long-suffering people of southern Sudan experience a better life, and just as he was on the verge of achieving what he has lived and fought for, he is taken away from us by some wicked politicians in the country. You can view at the website http://news.bbc.co.uk/1/hi/world/africa/2134220.stm . Prior to the serious crisis in my country, which led to killing of my father and my late Father's position as the personal aid to the vice president, I was able to come over here in Thailand for investment ,I inherited the sum of US$13 Million. The funds were originally gotten from my late Father's proceeds. My late father was able to safe guard the fund with a very good diplomatic contact from my Country to Thailand which make me to come here to plan for life and investment, presently, I have just arrived Thailand with the help of United Nations peacekeeping Pilot, and I have certified the existence of the fund with all the appropriate documents of the fund in my hand as the next of kin/ beneficiary . But all is about life and death I witnessed another tragedy which could have taken my own life that is why I decided to flew here for investment but I feel this is not the time for me to die, as I am the only surviving person in my Family now. I have decided to contact you because I am interested in investing in your country . Please kindly guide and assist me in making the right investment as every thing regarding technical and logistics details is worked out and ascertained to our respective satisfaction. In view of your participation, I am ready to give you a good negotiable percentage for your assistance, or better still commit it into viable Joint venture projects, be assured that you stand no risk of any kind as the funds belong to me as the only surviving son. As soon as I get your consent, I will quickly move this fund to your country for investment . However, upon your acceptance to work as my partner, you can contact me with my private telephone number or e-mail for more details,I strongly believe that associating with you to embark on this and other business ventures will derive a huge success here after, please include your private contact telephone number and private e-mail when replying. Yours Sincerely. Hamid Abdalla From halr at voltaire.com Fri Feb 10 13:46:06 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Feb 2006 16:46:06 -0500 Subject: [openib-general] port num in port priv In-Reply-To: <1139597807.4475.178.camel@hal.voltaire.com> References: <200602091540.k19FewKX012081@mail.baymicrosystems.com> <1139597807.4475.178.camel@hal.voltaire.com> Message-ID: <1139607965.4475.364.camel@hal.voltaire.com> On Fri, 2006-02-10 at 13:56, Hal Rosenstock wrote: > On Thu, 2006-02-09 at 10:55, Roland Dreier wrote: > > The current code probably needs to be extended to give > > the upper layers the port number where a MAD was received, and also to > > make it possible to specify the port number that a directed route MAD > > will be sent on. > > On the receive side, the port_num is part of the WC for DR SMPs. > On the send side, the port_num is part of the WR (for DR SMPs). > So at least that part is covered as far as I can tell. While this is supported, one issue is that in order to obtain the port_num on the send side, the optional ib_query_ah must be implemented as the ib_mad_send_buf only contains an ah and not a port_num. Is there a better way to handle this ? -- Hal > There are some changes I can see and of course, the SMI has not been > exercised for switch forwarding so there could be some issues there. > > -- Hal > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Fri Feb 10 14:08:21 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Feb 2006 17:08:21 -0500 Subject: [openib-general] 2.6.9-22 kernel patch for iSER In-Reply-To: <43EC1CEC.8030808@gs-lab.com> References: <43EC1CEC.8030808@gs-lab.com> Message-ID: <1139609199.4475.408.camel@hal.voltaire.com> On Thu, 2006-02-09 at 23:56, Karun Beer Sharma wrote: > Hi: > > I want to compile openIB with 2.6.9-22EL kernel version. It is giving me > error with respect to iSER module. > After browsing through openib.org website, I came to know that iSER is > supported from 2.6.11 onwards but some work is going on for the patch > with 2.6.9-22 kernel version. There is a work in progress for 2.6.9-11 to backport iser based on svn 5164 done by Bob Woodruff and is in gen2/branches/backport-to-2.6.9 The readme states: Note that in version 5164, iSer and the associated iscsi transport, and the iscsi tcp initiator from 2.6.15 has been backported and included. However, there are still known problems with iSer that are being debugged, so treat the iSer module as very experimental for now. -- Hal > Just want to know if the patch is available. If yes, from where can i > download it. > > Thanks in advance. > > Regards, > Karun > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From tom at opengridcomputing.com Fri Feb 10 14:36:27 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Fri, 10 Feb 2006 16:36:27 -0600 Subject: [openib-general] [PATCH] NOARP Device Support in the CMA In-Reply-To: References: Message-ID: <1139610987.18470.50.camel@trinity.ogc.int> Sean: Sorry for the delay ... I think what would happen in that case is that the ARP request would get queued, but never resolved because the ARP reply would never hit the host (it's eaten by the adapter itself). So eventually the request would timeout, fail to find a resolved next hop, and end up delivering an error to the CMA via the callback. On Fri, 2006-02-10 at 10:02 -0800, Sean Hefty wrote: > Tom, > > Can you tell me if this alternate patch would work? I let the ib_addr module > queue the callback. No CMA changes are necessary then. > > Signed-off-by: Sean Hefty > > --- > > Index: addr.c > =================================================================== > --- addr.c (revision 5295) > +++ addr.c (working copy) > @@ -36,6 +36,7 @@ > > #include > #include > +#include > #include > #include > #include > @@ -163,6 +164,12 @@ static int addr_resolve_remote(struct so > if (ret) > goto out; > > + /* If the device does ARP internally, return 'done' */ > + if (rt->idev->dev->flags & IFF_NOARP) { > + copy_addr(addr, rt->idev->dev, NULL); > + return 0; > + } > + > neigh = neigh_lookup(&arp_tbl, &rt->rt_gateway, rt->idev->dev); > if (!neigh) { > ret = -ENODATA; > > From sean.hefty at intel.com Fri Feb 10 14:44:47 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 10 Feb 2006 14:44:47 -0800 Subject: [openib-general] [PATCH] NOARP Device Support in the CMA In-Reply-To: <1139610987.18470.50.camel@trinity.ogc.int> Message-ID: >I think what would happen in that case is that the ARP request would get >queued, but never resolved because the ARP reply would never hit the >host (it's eaten by the adapter itself). Returning success (0) from addr_resolve_remote() should mark the request as done and queue it for a callback. This is similar to the case where the ARP data is already cached. In the callback, process_req(), the request will have status == 0, resulting in a callback to the user. - Sean From sean.hefty at intel.com Fri Feb 10 15:12:08 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 10 Feb 2006 15:12:08 -0800 Subject: [openib-general] [PATCH v3] NOARP Device Support in the CMA In-Reply-To: Message-ID: This is yet another updated patch that releases the reference on the acquired route. Signed-off-by: Sean Hefty --- Index: addr.c =================================================================== --- addr.c (revision 5295) +++ addr.c (working copy) @@ -36,6 +36,7 @@ #include #include +#include #include #include #include @@ -163,15 +164,21 @@ static int addr_resolve_remote(struct so if (ret) goto out; + /* If the device does ARP internally, return 'done' */ + if (rt->idev->dev->flags & IFF_NOARP) { + copy_addr(addr, rt->idev->dev, NULL); + goto put; + } + neigh = neigh_lookup(&arp_tbl, &rt->rt_gateway, rt->idev->dev); if (!neigh) { ret = -ENODATA; - goto err1; + goto put; } if (!(neigh->nud_state & NUD_VALID)) { ret = -ENODATA; - goto err2; + goto release; } if (!src_ip) { @@ -180,9 +187,9 @@ static int addr_resolve_remote(struct so } ret = copy_addr(addr, neigh->dev, neigh->ha); -err2: +release: neigh_release(neigh); -err1: +put: ip_rt_put(rt); out: return ret; From tom at opengridcomputing.com Fri Feb 10 15:38:54 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Fri, 10 Feb 2006 17:38:54 -0600 Subject: [openib-general] [PATCH] NOARP Device Support in the CMA In-Reply-To: References: Message-ID: <1139614734.18470.53.camel@trinity.ogc.int> I'll try it and see what happens. I haven't tested your approach in the latest tree, so it may work as you describe. Thanks, Tom On Fri, 2006-02-10 at 14:44 -0800, Sean Hefty wrote: > >I think what would happen in that case is that the ARP request would get > >queued, but never resolved because the ARP reply would never hit the > >host (it's eaten by the adapter itself). > > Returning success (0) from addr_resolve_remote() should mark the request as done > and queue it for a callback. This is similar to the case where the ARP data is > already cached. In the callback, process_req(), the request will have status == > 0, resulting in a callback to the user. > > - Sean > From tom at opengridcomputing.com Fri Feb 10 15:45:16 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Fri, 10 Feb 2006 17:45:16 -0600 Subject: [openib-general] [PATCH] NOARP Device Support in the CMA In-Reply-To: <1139614734.18470.53.camel@trinity.ogc.int> References: <1139614734.18470.53.camel@trinity.ogc.int> Message-ID: <1139615116.18470.57.camel@trinity.ogc.int> Sean: Good catch. You're right. I was looking in the wrong place. It is processed as you suggest in process_req(...). We don't need the change to the CMA, only the mod in addr_resolve_remote(...). On Fri, 2006-02-10 at 17:38 -0600, Tom Tucker wrote: > I'll try it and see what happens. I haven't tested your approach in the > latest tree, so it may work as you describe. > > Thanks, > > Tom > > On Fri, 2006-02-10 at 14:44 -0800, Sean Hefty wrote: > > >I think what would happen in that case is that the ARP request would get > > >queued, but never resolved because the ARP reply would never hit the > > >host (it's eaten by the adapter itself). > > > > Returning success (0) from addr_resolve_remote() should mark the request as done > > and queue it for a callback. This is similar to the case where the ARP data is > > already cached. In the callback, process_req(), the request will have status == > > 0, resulting in a callback to the user. > > > > - Sean > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From info at ooeer.com Fri Feb 10 15:18:24 2006 From: info at ooeer.com (info at ooeer.com) Date: 11 Feb 2006 08:18:24 +0900 Subject: [openib-general] $B7n(B100$BK|!A$N<}F~$r(B!$B%$%s%?!<%M%C%H:_Bp6HL3(B!! Message-ID: <20060210231824.9526.qmail@mail.ooeer.com> $B7P83$N$J$$J}$K$b%T%C%?%j!#(B $B5.J}$O:#$N;E;v$KBP$7$F>-MhIT0B$,$"$j$^$9$+!)(B $B5.J}$O:#$N;E;v$KK\Ev$KK~B-$7$F$$$^$9$+!)(B $B5.J}$OO78e$KBP$7$FIT0B$,$"$j$^$9$+!)(B $B5.J}$O(B30$BG/8e$b%5%i%j!<%^%s(BOL$B$@$H;W$$$^$9$+!)(B (YES$B$N?M$O:#$9$0$7$?$N(BURL$B$+$i;22C"-"-"-(B) http://www.gyakuten5.net/?sf $B!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a(B NO.I don't veceive your mail gendar7_net at yahoo.ca $B:#8e!"l9g$O(B gendar7_net at yahoo.ca $B!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a(B From arlin.r.davis at intel.com Fri Feb 10 15:56:00 2006 From: arlin.r.davis at intel.com (Arlin Davis) Date: Fri, 10 Feb 2006 15:56:00 -0800 Subject: [openib-general] [PATCH] [uDAPL] DAT 2.0 extensions with IB immed data and atomics examples Message-ID: James, Here is a patch with the IB rdma_write_with_immediate() back in as an extension along with the atomics. Please review the various methods that I used to extend the return codes, memory privileges, and DTO event processing. Thanks, -arlin Signed-off by: Arlin Davis ardavis at ichips.intel.com Index: test/dtest/makefile =================================================================== --- test/dtest/makefile (revision 5366) +++ test/dtest/makefile (working copy) @@ -4,13 +4,18 @@ CFLAGS = -O2 -g DAT_INC = ../../dat/include DAT_LIB = /usr/local/lib -all: dtest +all: dtest dtest_ext clean: - rm -f *.o;touch *.c;rm -f dtest + rm -f *.o;touch *.c;rm -f dtest dtest_ext dtest: ./dtest.c $(CC) $(CFLAGS) ./dtest.c -o dtest \ -DDAPL_PROVIDER='"OpenIB-cma-ip"' \ -I $(DAT_INC) -L $(DAT_LIB) -ldat +dtest_ext: ./dtest_ext.c + $(CC) $(CFLAGS) ./dtest_ext.c -o dtest_ext \ + -DDAPL_PROVIDER='"OpenIB-cma-ip"' \ + -I $(DAT_INC) -L $(DAT_LIB) -ldat + Index: dapl/include/dapl.h =================================================================== --- dapl/include/dapl.h (revision 5366) +++ dapl/include/dapl.h (working copy) @@ -1,25 +1,28 @@ /* - * Copyright (c) 2002-2003, Network Appliance, Inc. All rights reserved. + * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. * * This Software is licensed under one of the following licenses: - * + * * 1) under the terms of the "Common Public License 1.0" a copy of which is - * available from the Open Source Initiative, see + * in the file LICENSE.txt in the root directory. The license is also + * available from the Open Source Initiative, see * http://www.opensource.org/licenses/cpl.php. - * - * 2) under the terms of the "The BSD License" a copy of which is - * available from the Open Source Initiative, see + * + * 2) under the terms of the "The BSD License" a copy of which is in the file + * LICENSE2.txt in the root directory. The license is also available from + * the Open Source Initiative, see * http://www.opensource.org/licenses/bsd-license.php. - * + * * 3) under the terms of the "GNU General Public License (GPL) Version 2" a - * copy of which is available from the Open Source Initiative, see + * copy of which is in the file LICENSE3.txt in the root directory. The + * license is also available from the Open Source Initiative, see * http://www.opensource.org/licenses/gpl-license.php. - * + * * Licensee has the right to choose one of the above licenses. - * + * * Redistributions of source code must retain the above copyright * notice and one of the license notices. - * + * * Redistributions in binary form must reproduce both the above copyright * notice, one of the license notices in the documentation * and/or other materials provided with the distribution. @@ -178,6 +181,11 @@ typedef enum dapl_qp_state #define DAT_ERROR(Type, SubType) ((DAT_RETURN)(DAT_CLASS_ERROR | Type | SubType)) +#ifdef DAT_EXTENSIONS +#define DAT_ERROR_EXTENSION(Type, ExtType, SubType) \ + ((DAT_RETURN)(DAT_CLASS_ERROR | Type | ExtType | SubType)) +#endif + /********************************************************************* * * * Typedefs * @@ -563,6 +571,10 @@ typedef enum dapl_dto_type DAPL_DTO_TYPE_RECV, DAPL_DTO_TYPE_RDMA_WRITE, DAPL_DTO_TYPE_RDMA_READ, +#ifdef DAT_EXTENSIONS + DAPL_DTO_TYPE_EXTENSION +#endif + } DAPL_DTO_TYPE; typedef enum dapl_cookie_type @@ -570,6 +582,7 @@ typedef enum dapl_cookie_type DAPL_COOKIE_TYPE_NULL, DAPL_COOKIE_TYPE_DTO, DAPL_COOKIE_TYPE_RMR, + } DAPL_COOKIE_TYPE; /* DAPL_DTO_COOKIE used as context for DTO WQEs */ @@ -578,6 +591,9 @@ struct dapl_dto_cookie DAPL_DTO_TYPE type; DAT_DTO_COOKIE cookie; DAT_COUNT size; /* used for SEND and RDMA write */ +#ifdef DAT_EXTENSIONS + DAT_PVOID extension; /* extended DTO ops */ +#endif }; /* DAPL_RMR_COOKIE used as context for bind WQEs */ @@ -1116,6 +1132,14 @@ extern DAT_RETURN dapl_srq_set_lw( IN DAT_SRQ_HANDLE, /* srq_handle */ IN DAT_COUNT); /* low_watermark */ +#ifdef DAT_EXTENSIONS +extern DAT_RETURN dapl_extensions( + IN DAT_HANDLE, /* dat_handle */ + IN DAT_DTO_EXTENSION_OP, /* extension operation */ + IN va_list ); /* va_list args */ + +#endif + /* * DAPL internal utility function prototpyes */ Index: dapl/include/dapl_debug.h =================================================================== --- dapl/include/dapl_debug.h (revision 5366) +++ dapl/include/dapl_debug.h (working copy) @@ -112,7 +112,13 @@ extern void dapl_internal_dbg_log ( DAPL #define DCNT_EVD_DEQUEUE_NOT_FOUND 18 #define DCNT_TIMER_SET 19 #define DCNT_TIMER_CANCEL 20 +#ifdef DAT_EXTENSION +#define DCNT_EXTENSION 21 +#define DCNT_NUM_COUNTERS 22 +#else #define DCNT_NUM_COUNTERS 21 +#endif + #define DCNT_ALL_COUNTERS DCNT_NUM_COUNTERS #if defined(DAPL_COUNTERS) Index: dapl/udapl/Makefile =================================================================== --- dapl/udapl/Makefile (revision 5366) +++ dapl/udapl/Makefile (working copy) @@ -80,6 +80,12 @@ ifdef OS_VENDOR CFLAGS += -D$(OS_VENDOR) endif +# If an implementation supports immdiate data and extensions +CFLAGS += -DDAT_EXTENSIONS + +# If an implementation supports DAPL provider specific attributes +CFLAGS += -DDAPL_PROVIDER_SPECIFIC_ATTR + # # dummy provider # @@ -283,6 +289,8 @@ LDFLAGS += -libverbs -lrdmacm LDFLAGS += -rpath /usr/local/lib -L /usr/local/lib PROVIDER_SRCS = dapl_ib_util.c dapl_ib_cq.c dapl_ib_qp.c \ dapl_ib_cm.c dapl_ib_mem.c +# implementation supports extensions +PROVIDER_SRCS += dapl_ib_extensions.c endif UDAPL_SRCS = dapl_init.c \ Index: dapl/common/dapl_ia_query.c =================================================================== --- dapl/common/dapl_ia_query.c (revision 5366) +++ dapl/common/dapl_ia_query.c (working copy) @@ -151,6 +151,7 @@ dapl_ia_query ( provider_attr->dat_qos_supported = DAT_QOS_BEST_EFFORT; provider_attr->completion_flags_supported = DAT_COMPLETION_DEFAULT_FLAG; provider_attr->is_thread_safe = DAT_FALSE; + /* * N.B. The second part of the following equation will evaluate * to 0 unless IBHOSTS_NAMING is enabled. @@ -167,6 +168,14 @@ dapl_ia_query ( #if !defined(__KDAPL__) provider_attr->pz_support = DAT_PZ_UNIQUE; #endif /* !KDAPL */ + + /* + * Query for provider specific attributes + */ +#ifdef DAPL_PROVIDER_SPECIFIC_ATTR + dapls_query_provider_specific_attr(provider_attr); +#endif + /* * Set up evd_stream_merging_supported options. Note there is * one bit per allowable combination, using the ordinal Index: dapl/common/dapl_adapter_util.h =================================================================== --- dapl/common/dapl_adapter_util.h (revision 5366) +++ dapl/common/dapl_adapter_util.h (working copy) @@ -256,6 +256,21 @@ dapls_ib_wait_object_wait ( IN u_int32_t timeout); #endif +#ifdef DAPL_PROVIDER_SPECIFIC_ATTR +void +dapls_query_provider_specific_attr( + IN DAT_PROVIDER_ATTR *provider_attr ); +#endif + +#ifdef DAT_EXTENSIONS +void +dapls_cqe_to_event_extension( + IN DAPL_EP *ep_ptr, + IN DAPL_COOKIE *cookie, + IN ib_work_completion_t *cqe_ptr, + OUT DAT_EVENT *event_ptr); +#endif + /* * Values for provider DAT_NAMED_ATTR */ Index: dapl/common/dapl_provider.c =================================================================== --- dapl/common/dapl_provider.c (revision 5366) +++ dapl/common/dapl_provider.c (working copy) @@ -221,7 +221,12 @@ DAT_PROVIDER g_dapl_provider_template = &dapl_srq_post_recv, &dapl_srq_query, &dapl_srq_resize, - &dapl_srq_set_lw + &dapl_srq_set_lw, + +#ifdef DAT_EXTENSIONS + /* dat-2.0 */ + &dapl_extensions +#endif }; #endif /* __KDAPL__ */ Index: dapl/common/dapl_evd_util.c =================================================================== --- dapl/common/dapl_evd_util.c (revision 5366) +++ dapl/common/dapl_evd_util.c (working copy) @@ -502,6 +502,21 @@ dapli_evd_eh_print_cqe ( #ifdef DAPL_DBG static char *optable[] = { +#ifdef OPENIB + /* different order for openib verbs */ + "OP_RDMA_WRITE", + "OP_RDMA_WRITE_IMMED", + "OP_SEND", + "OP_SEND_IMMED", + "OP_RDMA_READ", + "OP_COMP_AND_SWAP", + "OP_FETCH_AND_ADD", + "OP_RECEIVE", + "OP_RECEIVE_IMMED", + "OP_RECEIVE_RDMA_IMMED", + "OP_BIND_MW", + "OP_INVALID", +#else "OP_SEND", "OP_RDMA_READ", "OP_RDMA_WRITE", @@ -509,6 +524,7 @@ dapli_evd_eh_print_cqe ( "OP_FETCH_AND_ADD", "OP_RECEIVE", "OP_BIND_MW", +#endif 0 }; @@ -1047,6 +1063,10 @@ dapli_evd_cqe_to_event ( event_ptr->event_data.dto_completion_event_data.user_cookie = cookie->val.dto.cookie; event_ptr->event_data.dto_completion_event_data.status = dto_status; + + /* new operation field for DAT 2.0 */ + event_ptr->event_data.dto_completion_event_data.operation = + DAPL_GET_CQE_DTOS_OPTYPE(cqe_ptr); #ifdef DAPL_DBG if (dto_status == DAT_DTO_SUCCESS) @@ -1055,15 +1075,20 @@ dapli_evd_cqe_to_event ( ibtype = DAPL_GET_CQE_OPTYPE (cqe_ptr); - dapl_os_assert ((ibtype == OP_SEND && - cookie->val.dto.type == DAPL_DTO_TYPE_SEND) - || (ibtype == OP_RECEIVE && - cookie->val.dto.type == DAPL_DTO_TYPE_RECV) - || (ibtype == OP_RDMA_WRITE && - cookie->val.dto.type == DAPL_DTO_TYPE_RDMA_WRITE) - || (ibtype == OP_RDMA_READ && - cookie->val.dto.type == DAPL_DTO_TYPE_RDMA_READ)); - } +#ifdef DAT_EXTENSIONS + if ((cookie->val.dto.type != DAPL_DTO_TYPE_EXTENSION) && + (ibtype == OP_RECEIVE && + cookie->val.dto.type == DAPL_DTO_TYPE_RECV)) +#endif + dapl_os_assert ((ibtype == OP_SEND && + cookie->val.dto.type == DAPL_DTO_TYPE_SEND) + || (ibtype == OP_RECEIVE && + cookie->val.dto.type == DAPL_DTO_TYPE_RECV) + || (ibtype == OP_RDMA_WRITE && + cookie->val.dto.type == DAPL_DTO_TYPE_RDMA_WRITE) + || (ibtype == OP_RDMA_READ && + cookie->val.dto.type == DAPL_DTO_TYPE_RDMA_READ)); + } #endif /* DAPL_DBG */ if ( cookie->val.dto.type == DAPL_DTO_TYPE_SEND || @@ -1079,6 +1104,11 @@ dapli_evd_cqe_to_event ( DAPL_GET_CQE_BYTESNUM (cqe_ptr); } +#ifdef DAT_EXTENSIONS + if ( DAPL_GET_CQE_DTOS_OPTYPE(cqe_ptr) == DAT_EXTENSION ) + dapls_cqe_to_event_extension(ep_ptr, cookie, cqe_ptr, event_ptr); +#endif + dapls_cookie_dealloc (buffer, cookie); break; } @@ -1113,6 +1143,7 @@ dapli_evd_cqe_to_event ( dapls_cookie_dealloc (&ep_ptr->req_buffer, cookie); break; } + default: { dapl_os_assert (!"Invalid Operation type"); Index: dapl/common/dapl_debug.c =================================================================== --- dapl/common/dapl_debug.c (revision 5366) +++ dapl/common/dapl_debug.c (working copy) @@ -86,6 +86,9 @@ char *dapl_dbg_counter_names[] = { "dapl_evd_not_found", "dapls_timer_set", "dapls_timer_cancel", +#ifdef DAT_EXTENSION + "dapls_extension", +#endif }; void dapl_dump_cntr( int cntr ) Index: dapl/openib_cma/dapl_ib_dto.h =================================================================== --- dapl/openib_cma/dapl_ib_dto.h (revision 5366) +++ dapl/openib_cma/dapl_ib_dto.h (working copy) @@ -35,7 +35,7 @@ * * Description: * - * The uDAPL openib provider - DTO operations and CQE macros + * The OpenIB uCMA provider - DTO operations and CQE macros * **************************************************************************** * Source Control System Information @@ -119,7 +119,6 @@ dapls_ib_post_recv ( return DAT_SUCCESS; } - /* * dapls_ib_post_send * @@ -191,7 +190,7 @@ dapls_ib_post_send ( if (cookie != NULL) cookie->val.dto.size = total_len; - + if ((op_type == OP_RDMA_WRITE) || (op_type == OP_RDMA_READ)) { wr.wr.rdma.remote_addr = remote_iov->target_address; wr.wr.rdma.rkey = remote_iov->rmr_context; @@ -200,6 +199,7 @@ dapls_ib_post_send ( wr.wr.rdma.rkey, wr.wr.rdma.remote_addr); } + /* inline data for send or write ops */ if ((total_len <= ibt_ptr->max_inline_send) && ((op_type == OP_SEND) || (op_type == OP_RDMA_WRITE))) @@ -224,6 +224,178 @@ dapls_ib_post_send ( return DAT_SUCCESS; } +/* map Work Completions to DAPL WR operations */ +STATIC _INLINE_ DAT_DTOS dapls_cqe_dtos_opcode(ib_work_completion_t *cqe_p) +{ + switch (cqe_p->opcode) { + + case IBV_WC_SEND: + return (DAT_SEND); + case IBV_WC_RDMA_READ: + return (DAT_RDMA_READ); + case IBV_WC_BIND_MW: + return (DAT_BIND_MW); +#ifdef DAT_EXTENSIONS + case IBV_WC_RDMA_WRITE: + if (cqe_p->wc_flags & IBV_WC_WITH_IMM) + return (DAT_EXTENSION); + else + return (DAT_RDMA_WRITE); + case IBV_WC_COMP_SWAP: + return (DAT_EXTENSION); + case IBV_WC_FETCH_ADD: + return (DAT_EXTENSION); + case IBV_WC_RECV_RDMA_WITH_IMM: + return (DAT_EXTENSION); +#else + case IBV_WC_RDMA_WRITE: + return (DAT_RDMA_WRITE); +#endif + case IBV_WC_RECV: + return (DAT_RECEIVE); + default: + return (0xff); + } +} +#define DAPL_GET_CQE_DTOS_OPTYPE(cqe_p) dapls_cqe_dtos_opcode(cqe_p) + + +#ifdef DAT_EXTENSIONS +/* + * dapls_ib_post_ext_send + * + * Provider specific extended Post SEND function for atomics + * OP_COMP_AND_SWAP and OP_FETCH_AND_ADD + */ +STATIC _INLINE_ DAT_RETURN +dapls_ib_post_ext_send ( + IN DAPL_EP *ep_ptr, + IN ib_send_op_type_t op_type, + IN DAPL_COOKIE *cookie, + IN DAT_COUNT segments, + IN DAT_LMR_TRIPLET *local_iov, + IN const DAT_RMR_TRIPLET *remote_iov, + IN DAT_UINT32 immed_data, + IN DAT_UINT64 compare_add, + IN DAT_UINT64 swap, + IN DAT_COMPLETION_FLAGS completion_flags) +{ + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: ep %p op %d ck %p sgs", + "%d l_iov %p r_iov %p f %d\n", + ep_ptr, op_type, cookie, segments, local_iov, + remote_iov, completion_flags); + + ib_data_segment_t ds_array[DEFAULT_DS_ENTRIES]; + ib_data_segment_t *ds_array_p; + struct ibv_send_wr wr; + struct ibv_send_wr *bad_wr; + DAT_COUNT i, total_len; + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: ep %p cookie %p segs %d l_iov %p\n", + ep_ptr, cookie, segments, local_iov); + + if(segments <= DEFAULT_DS_ENTRIES) + ds_array_p = ds_array; + else + ds_array_p = + dapl_os_alloc(segments * sizeof(ib_data_segment_t)); + + if (NULL == ds_array_p) + return (DAT_INSUFFICIENT_RESOURCES); + + /* setup the work request */ + wr.next = 0; + wr.opcode = op_type; + wr.num_sge = 0; + wr.send_flags = 0; + wr.wr_id = (uint64_t)(uintptr_t)cookie; + wr.sg_list = ds_array_p; + total_len = 0; + + for (i = 0; i < segments; i++ ) { + if ( !local_iov[i].segment_length ) + continue; + + ds_array_p->addr = (uint64_t) local_iov[i].virtual_address; + ds_array_p->length = local_iov[i].segment_length; + ds_array_p->lkey = local_iov[i].lmr_context; + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: lkey 0x%x va %p len %d\n", + ds_array_p->lkey, ds_array_p->addr, + ds_array_p->length ); + + total_len += ds_array_p->length; + wr.num_sge++; + ds_array_p++; + } + + if (cookie != NULL) + cookie->val.dto.size = total_len; + + switch (op_type) { + case OP_RDMA_WRITE_IMMED: + /* OP_RDMA_WRITE)IMMED has direct IB wr_type mapping */ + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_ext: rkey 0x%x va %#016Lx immed=0x%x\n", + remote_iov->rmr_context, + remote_iov->target_address, immed_data); + + wr.imm_data = immed_data; + wr.wr.rdma.remote_addr = remote_iov->target_address; + wr.wr.rdma.rkey = remote_iov->rmr_context; + break; + case OP_COMP_AND_SWAP: + /* OP_COMP_AND_SWAP has direct IB wr_type mapping */ + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_ext: OP_COMP_AND_SWAP=%lx," + "%lx rkey 0x%x va %#016Lx\n", + compare_add, swap, remote_iov->rmr_context, + remote_iov->target_address); + + wr.wr.atomic.compare_add = compare_add; + wr.wr.atomic.swap = swap; + wr.wr.atomic.remote_addr = remote_iov->target_address; + wr.wr.atomic.rkey = remote_iov->rmr_context; + break; + case OP_FETCH_AND_ADD: + /* OP_FETCH_AND_ADD has direct IB wr_type mapping */ + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_ext: OP_FETCH_AND_ADD=%lx," + "%lx rkey 0x%x va %#016Lx\n", + compare_add, remote_iov->rmr_context, + remote_iov->target_address); + + wr.wr.atomic.compare_add = compare_add; + wr.wr.atomic.remote_addr = remote_iov->target_address; + wr.wr.atomic.rkey = remote_iov->rmr_context; + break; + default: + break; + } + + /* set completion flags in work request */ + wr.send_flags |= (DAT_COMPLETION_SUPPRESS_FLAG & + completion_flags) ? 0 : IBV_SEND_SIGNALED; + wr.send_flags |= (DAT_COMPLETION_BARRIER_FENCE_FLAG & + completion_flags) ? IBV_SEND_FENCE : 0; + wr.send_flags |= (DAT_COMPLETION_SOLICITED_WAIT_FLAG & + completion_flags) ? IBV_SEND_SOLICITED : 0; + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: op 0x%x flags 0x%x sglist %p, %d\n", + wr.opcode, wr.send_flags, wr.sg_list, wr.num_sge); + + if (ibv_post_send(ep_ptr->qp_handle->cm_id->qp, &wr, &bad_wr)) + return( dapl_convert_errno(EFAULT,"ibv_recv") ); + + dapl_dbg_log(DAPL_DBG_TYPE_EP," post_snd: returned\n"); + return DAT_SUCCESS; +} +#endif + STATIC _INLINE_ DAT_RETURN dapls_ib_optional_prv_dat( IN DAPL_CR *cr_ptr, @@ -233,13 +405,17 @@ dapls_ib_optional_prv_dat( return DAT_SUCCESS; } +/* map Work Completions to DAPL WR operations */ STATIC _INLINE_ int dapls_cqe_opcode(ib_work_completion_t *cqe_p) { switch (cqe_p->opcode) { case IBV_WC_SEND: return (OP_SEND); case IBV_WC_RDMA_WRITE: - return (OP_RDMA_WRITE); + if (cqe_p->wc_flags & IBV_WC_WITH_IMM) + return (OP_RDMA_WRITE_IMMED); + else + return (OP_RDMA_WRITE); case IBV_WC_RDMA_READ: return (OP_RDMA_READ); case IBV_WC_COMP_SWAP: @@ -249,14 +425,18 @@ STATIC _INLINE_ int dapls_cqe_opcode(ib_ case IBV_WC_BIND_MW: return (OP_BIND_MW); case IBV_WC_RECV: - return (OP_RECEIVE); + if (cqe_p->wc_flags & IBV_WC_WITH_IMM) + return (OP_RECEIVE_IMMED); + else + return (OP_RECEIVE); case IBV_WC_RECV_RDMA_WITH_IMM: - return (OP_RECEIVE_IMM); + return (OP_RECEIVE_RDMA_IMMED); default: return (OP_INVALID); } } + #define DAPL_GET_CQE_OPTYPE(cqe_p) dapls_cqe_opcode(cqe_p) #define DAPL_GET_CQE_WRID(cqe_p) ((ib_work_completion_t*)cqe_p)->wr_id #define DAPL_GET_CQE_STATUS(cqe_p) ((ib_work_completion_t*)cqe_p)->status Index: dapl/openib_cma/dapl_ib_util.c =================================================================== --- dapl/openib_cma/dapl_ib_util.c (revision 5366) +++ dapl/openib_cma/dapl_ib_util.c (working copy) @@ -35,7 +35,7 @@ * * Description: * - * The uDAPL openib provider - init, open, close, utilities, work thread + * The OpenIB uCMA provider - init, open, close, utilities, work thread * **************************************************************************** * Source Control System Information @@ -64,7 +64,6 @@ static const char rcsid[] = "$Id: $"; #include /* for struct ifreq */ #include /* for ARPHRD_INFINIBAND */ - int g_dapl_loopback_connection = 0; int g_ib_pipe[2]; ib_thread_state_t g_ib_thread_state = 0; @@ -121,11 +120,12 @@ static int getipaddr(char *name, char *a if (getaddrinfo(name, NULL, NULL, &res)) { /* retry using network device name */ ret = getipaddr_netdev(name,addr,len); - if (ret) + if (ret) { dapl_dbg_log(DAPL_DBG_TYPE_WARN, " getipaddr: invalid name, addr, or netdev(%s)\n", name); - return ret; + return ret; + } } else { if (len >= res->ai_addrlen) memcpy(addr, res->ai_addr, res->ai_addrlen); @@ -330,6 +330,13 @@ DAT_RETURN dapls_ib_close_hca(IN DAPL_HC hca_ptr->ib_hca_handle = IB_INVALID_HANDLE; } + dapl_os_lock(&g_hca_lock); + if (g_ib_thread_state != IB_THREAD_RUN) { + dapl_os_unlock(&g_hca_lock); + goto bail; + } + dapl_os_unlock(&g_hca_lock); + /* * Remove hca from async and CQ event processing list * Wakeup work thread to remove from polling list @@ -342,10 +349,12 @@ DAT_RETURN dapls_ib_close_hca(IN DAPL_HC struct timespec sleep, remain; sleep.tv_sec = 0; sleep.tv_nsec = 10000000; /* 10 ms */ + write(g_ib_pipe[1], "w", sizeof "w"); dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " ib_thread_destroy: wait on hca %p destroy\n"); nanosleep (&sleep, &remain); } +bail: return (DAT_SUCCESS); } @@ -727,7 +736,7 @@ void dapli_thread(void *arg) int ret,idx,fds; char rbuf[2]; - dapl_dbg_log (DAPL_DBG_TYPE_CM, + dapl_dbg_log (DAPL_DBG_TYPE_UTIL, " ib_thread(%d,0x%x): ENTER: pipe %d ucma %d\n", getpid(), g_ib_thread, g_ib_pipe[0], rdma_get_fd()); @@ -767,7 +776,7 @@ void dapli_thread(void *arg) ufds[idx].revents = 0; uhca[idx] = hca; - dapl_dbg_log(DAPL_DBG_TYPE_CM, + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " ib_thread(%d) poll_fd: hca[%d]=%p, async=%d" " pipe=%d cm=%d cq=d\n", getpid(), hca, ufds[idx-1].fd, @@ -783,14 +792,14 @@ void dapli_thread(void *arg) dapl_os_unlock(&g_hca_lock); ret = poll(ufds, fds, -1); if (ret <= 0) { - dapl_dbg_log(DAPL_DBG_TYPE_WARN, + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " ib_thread(%d): ERR %s poll\n", getpid(),strerror(errno)); dapl_os_lock(&g_hca_lock); continue; } - dapl_dbg_log(DAPL_DBG_TYPE_CM, + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " ib_thread(%d) poll_event: " " async=0x%x pipe=0x%x cm=0x%x cq=0x%x\n", getpid(), ufds[idx-1].revents, ufds[0].revents, @@ -834,3 +843,61 @@ void dapli_thread(void *arg) dapl_os_unlock(&g_hca_lock); } +#ifdef DAPL_PROVIDER_SPECIFIC_ATTR +/* + * dapls_query_provider_specific_attr + * + * Input: + * attr_ptr Pointer provider attributes + * + * Output: + * none + * + * Returns: + * void + */ +DAT_NAMED_ATTR ib_attrs[] = { + +#ifdef DAT_EXTENSIONS + { + DAT_EXTENSION_ATTR, + DAT_EXTENSION_ATTR_TRUE + }, + { + DAT_EXTENSION_ATTR_VERSION, + DAT_EXTENSION_ATTR_VERSION_VALUE + }, + { + DAT_EXTENSION_ATTR_FETCH_AND_ADD, + DAT_EXTENSION_ATTR_TRUE + }, + { + DAT_EXTENSION_ATTR_CMP_AND_SWAP, + DAT_EXTENSION_ATTR_TRUE + }, + { + DAT_EXTENSION_ATTR_IMMED_DATA, + DAT_EXTENSION_ATTR_TRUE + }, +#else + { + "DAT_EXTENSION_INTERFACE", + "FALSE" + }, +#endif +}; + +#define SPEC_ATTR_SIZE(x) ( sizeof(x)/sizeof(DAT_NAMED_ATTR) ) + +/* + * Query for all provider specific attributes and + */ +void dapls_query_provider_specific_attr( + IN DAT_PROVIDER_ATTR *attr_ptr ) +{ + attr_ptr->num_provider_specific_attr = SPEC_ATTR_SIZE(ib_attrs); + attr_ptr->provider_specific_attr = ib_attrs; +} + +#endif + Index: dapl/openib_cma/dapl_ib_mem.c =================================================================== --- dapl/openib_cma/dapl_ib_mem.c (revision 5366) +++ dapl/openib_cma/dapl_ib_mem.c (working copy) @@ -25,9 +25,9 @@ /********************************************************************** * - * MODULE: dapl_det_mem.c + * MODULE: dapl_ib_mem.c * - * PURPOSE: Intel DET APIs: Memory windows, registration, + * PURPOSE: OpenIB uCMA provider Memory windows, registration, * and protection domain * * $Id: $ @@ -72,12 +72,12 @@ dapls_convert_privileges(IN DAT_MEM_PRIV access |= IBV_ACCESS_LOCAL_WRITE; if (DAT_MEM_PRIV_REMOTE_WRITE_FLAG & privileges) access |= IBV_ACCESS_REMOTE_WRITE; - if (DAT_MEM_PRIV_REMOTE_READ_FLAG & privileges) - access |= IBV_ACCESS_REMOTE_READ; - if (DAT_MEM_PRIV_REMOTE_READ_FLAG & privileges) - access |= IBV_ACCESS_REMOTE_READ; - if (DAT_MEM_PRIV_REMOTE_READ_FLAG & privileges) + if (DAT_MEM_PRIV_REMOTE_READ_FLAG & privileges) access |= IBV_ACCESS_REMOTE_READ; +#ifdef DAT_EXTENSIONS + if (DAT_MEM_PRIV_EXT_REMOTE_ATOMIC & privileges) + access |= IBV_ACCESS_REMOTE_ATOMIC; +#endif return access; } Index: dapl/openib_cma/dapl_ib_qp.c =================================================================== --- dapl/openib_cma/dapl_ib_qp.c (revision 5366) +++ dapl/openib_cma/dapl_ib_qp.c (working copy) @@ -25,9 +25,9 @@ /********************************************************************** * - * MODULE: dapl_det_qp.c + * MODULE: dapl_ib_qp.c * - * PURPOSE: QP routines for access to DET Verbs + * PURPOSE: OpenIB uCMA QP routines * * $Id: $ **********************************************************************/ Index: dapl/openib_cma/dapl_ib_util.h =================================================================== --- dapl/openib_cma/dapl_ib_util.h (revision 5366) +++ dapl/openib_cma/dapl_ib_util.h (working copy) @@ -35,7 +35,7 @@ * * Description: * - * The uDAPL openib provider - definitions, prototypes, + * The OpenIB uCMA provider - definitions, prototypes, * **************************************************************************** * Source Control System Information @@ -123,15 +123,16 @@ typedef struct ibv_comp_channel *ib_wait /* DTO OPs, ordered for DAPL ENUM definitions */ #define OP_RDMA_WRITE IBV_WR_RDMA_WRITE -#define OP_RDMA_WRITE_IMM IBV_WR_RDMA_WRITE_WITH_IMM +#define OP_RDMA_WRITE_IMMED IBV_WR_RDMA_WRITE_WITH_IMM #define OP_SEND IBV_WR_SEND -#define OP_SEND_IMM IBV_WR_SEND_WITH_IMM +#define OP_SEND_IMMED IBV_WR_SEND_WITH_IMM #define OP_RDMA_READ IBV_WR_RDMA_READ #define OP_COMP_AND_SWAP IBV_WR_ATOMIC_CMP_AND_SWP #define OP_FETCH_AND_ADD IBV_WR_ATOMIC_FETCH_AND_ADD -#define OP_RECEIVE 7 /* internal op */ -#define OP_RECEIVE_IMM 8 /* internel op */ -#define OP_BIND_MW 9 /* internal op */ +#define OP_RECEIVE 0x7 /* internal op */ +#define OP_RECEIVE_IMMED 0x8 /* internel op */ +#define OP_RECEIVE_RDMA_IMMED 0x9 /* internal op */ +#define OP_BIND_MW 0xa /* internal op */ #define OP_INVALID 0xff /* Definitions to map QP state */ @@ -295,7 +296,8 @@ dapl_convert_errno( IN int err, IN const if (!err) return DAT_SUCCESS; #if DAPL_DBG - if ((err != EAGAIN) && (err != ETIME) && (err != ETIMEDOUT)) + if ((err != EAGAIN) && (err != ETIME) && + (err != ETIMEDOUT) && (err != EINTR)) dapl_dbg_log (DAPL_DBG_TYPE_ERR," %s %s\n", str, strerror(err)); #endif Index: dapl/openib_cma/dapl_ib_cq.c =================================================================== --- dapl/openib_cma/dapl_ib_cq.c (revision 5366) +++ dapl/openib_cma/dapl_ib_cq.c (working copy) @@ -35,7 +35,7 @@ * * Description: * - * The uDAPL openib provider - completion queue + * The OpenIB uCMA provider - completion queue * **************************************************************************** * Source Control System Information @@ -498,7 +498,10 @@ dapls_ib_wait_object_wait(IN ib_wait_obj if (timeout != DAT_TIMEOUT_INFINITE) timeout_ms = timeout/1000; - status = poll(&cq_fd, 1, timeout_ms); + /* restart syscall */ + while ((status = poll(&cq_fd, 1, timeout_ms)) == -1 ) + if (errno == EINTR) + continue; /* returned event */ if (status > 0) { @@ -511,6 +514,8 @@ dapls_ib_wait_object_wait(IN ib_wait_obj /* timeout */ } else if (status == 0) status = ETIMEDOUT; + else + status = errno; dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " cq_object_wait: RET evd %p ibv_cq %p ibv_ctx %p %s\n", Index: dat/include/dat/udat.h =================================================================== --- dat/include/dat/udat.h (revision 5366) +++ dat/include/dat/udat.h (working copy) @@ -1,31 +1,51 @@ /* - * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. + * Copyright (c) 2002-2004, Network Appliance, Inc. All rights reserved. * - * This Software is licensed under one of the following licenses: - * - * 1) under the terms of the "Common Public License 1.0" a copy of which is - * in the file LICENSE.txt in the root directory. The license is also - * available from the Open Source Initiative, see + * This Software is licensed under both of the following two licenses: + * + * 1) under the terms of the "Common Public License 1.0". The license is also + * available from the Open Source Initiative, see * http://www.opensource.org/licenses/cpl.php. - * - * 2) under the terms of the "The BSD License" a copy of which is in the file - * LICENSE2.txt in the root directory. The license is also available from - * the Open Source Initiative, see + * + * OR + * + * 2) under the terms of the "The BSD License". The license is also available + * from the Open Source Initiative, see * http://www.opensource.org/licenses/bsd-license.php. - * - * 3) under the terms of the "GNU General Public License (GPL) Version 2" a - * copy of which is in the file LICENSE3.txt in the root directory. The - * license is also available from the Open Source Initiative, see - * http://www.opensource.org/licenses/gpl-license.php. - * - * Licensee has the right to choose one of the above licenses. - * - * Redistributions of source code must retain the above copyright - * notice and one of the license notices. - * + * + * Licensee has the right to choose either one of the above two licenses. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are + * met: + * + * Redistributions of source code must retain both the above copyright + * notice and either one of the license notices. + * * Redistributions in binary form must reproduce both the above copyright - * notice, one of the license notices in the documentation + * notice, either one of the license notices in the documentation * and/or other materials provided with the distribution. + * + * Neither the name of Network Appliance, Inc. nor the names of other DAT + * Collaborative contributors may be used to endorse or promote + * products derived from this software without specific prior written + * permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND + * CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED + * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL + * THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY + * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, + * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, + * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. */ /**************************************************************** Index: dat/include/dat/dat_redirection.h =================================================================== --- dat/include/dat/dat_redirection.h (revision 5366) +++ dat/include/dat/dat_redirection.h (working copy) @@ -59,10 +59,10 @@ typedef struct dat_provider DAT_PROVIDER * This would allow a good compiler to avoid indirection overhead when * making function calls. */ - #define DAT_HANDLE_TO_PROVIDER(handle) (*(DAT_PROVIDER **)(handle)) #endif + #define DAT_IA_QUERY(ia, evd, ia_msk, ia_ptr, p_msk, p_ptr) \ (*DAT_HANDLE_TO_PROVIDER (ia)->ia_query_func) (\ (ia), \ @@ -395,6 +395,14 @@ typedef struct dat_provider DAT_PROVIDER (lbuf), \ (cookie)) +#ifdef DAT_EXTENSIONS +#define DAT_EXTENSION(handle, op, args) \ + (*DAT_HANDLE_TO_PROVIDER (handle)->extension_func) (\ + (handle), \ + (op), \ + (args)) +#endif + /*************************************************************** * * FUNCTION PROTOTYPES @@ -720,4 +728,13 @@ typedef DAT_RETURN (*DAT_SRQ_POST_RECV_F IN DAT_LMR_TRIPLET *, /* local_iov */ IN DAT_DTO_COOKIE ); /* user_cookie */ +#ifdef DAT_EXTENSIONS +#include +typedef DAT_RETURN (*DAT_EXTENSION_FUNC) ( + IN DAT_HANDLE, /* dat handle */ + IN DAT_DTO_EXTENSION_OP, /* extension operation */ + IN va_list ); /* va_list */ +#endif + + #endif /* _DAT_REDIRECTION_H_ */ Index: dat/include/dat/dat_error.h =================================================================== --- dat/include/dat/dat_error.h (revision 5366) +++ dat/include/dat/dat_error.h (working copy) @@ -1,31 +1,51 @@ /* - * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. + * Copyright (c) 2002-2004, Network Appliance, Inc. All rights reserved. * - * This Software is licensed under one of the following licenses: - * - * 1) under the terms of the "Common Public License 1.0" a copy of which is - * in the file LICENSE.txt in the root directory. The license is also - * available from the Open Source Initiative, see + * This Software is licensed under both of the following two licenses: + * + * 1) under the terms of the "Common Public License 1.0". The license is also + * available from the Open Source Initiative, see * http://www.opensource.org/licenses/cpl.php. - * - * 2) under the terms of the "The BSD License" a copy of which is in the file - * LICENSE2.txt in the root directory. The license is also available from - * the Open Source Initiative, see + * + * OR + * + * 2) under the terms of the "The BSD License". The license is also available + * from the Open Source Initiative, see * http://www.opensource.org/licenses/bsd-license.php. - * - * 3) under the terms of the "GNU General Public License (GPL) Version 2" a - * copy of which is in the file LICENSE3.txt in the root directory. The - * license is also available from the Open Source Initiative, see - * http://www.opensource.org/licenses/gpl-license.php. - * - * Licensee has the right to choose one of the above licenses. - * - * Redistributions of source code must retain the above copyright - * notice and one of the license notices. - * + * + * Licensee has the right to choose either one of the above two licenses. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are + * met: + * + * Redistributions of source code must retain both the above copyright + * notice and either one of the license notices. + * * Redistributions in binary form must reproduce both the above copyright - * notice, one of the license notices in the documentation + * notice, either one of the license notices in the documentation * and/or other materials provided with the distribution. + * + * Neither the name of Network Appliance, Inc. nor the names of other DAT + * Collaborative contributors may be used to endorse or promote + * products derived from this software without specific prior written + * permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND + * CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED + * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL + * THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY + * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, + * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, + * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. */ /*********************************************************** @@ -47,17 +67,15 @@ /* * - * All return codes are actually a 3-way tuple: - * - * type: DAT_RETURN_CLASS DAT_RETURN_TYPE DAT_RETURN_SUBTYPE - * bits: 31-30 29-16 15-0 + * All return codes are actually a 4-way tuple: * - * 3 2 1 - * 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 - * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - * | C | DAT_RETURN_TYPE | DAT_RETURN_SUBTYPE | - * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + * type: CLASS RETURN_TYPE EXTENSION_SUBTYPE SUBTYPE + * bits: 31-30 29-16 15-8 7-0 * + * +-------------------------------------------------------------------------+ + * |3130 | 2928272625242322212019181716 | 15141312111009080 | 706054003020100| + * |CLAS | DAT_TYPE_STATUS | EXTENSION_SUBTYPE | SUBTYPE | + * +-------------------------------------------------------------------------+ */ /* @@ -70,8 +88,13 @@ * DAT Error bits */ #define DAT_TYPE_MASK 0x3fff0000 /* mask for DAT_TYPE_STATUS bits */ -#define DAT_SUBTYPE_MASK 0x0000FFFF /* mask for DAT_SUBTYPE_STATUS bits */ +#define DAT_SUBTYPE_MASK 0x000000FF /* mask for DAT_SUBTYPE_STATUS bits */ +#ifdef DAT_EXTENSIONS +/* Mask and macro for new extension subtype bits */ +#define DAT_EXTENSION_SUBTYPE_MASK 0x0000FF00 /* mask for DAT_EXTENSION_SUBTYPE_STATUS bits */ +#define DAT_GET_EXTENSION_SUBTYPE(status) ((DAT_UINT32)(status) & DAT_EXTENSION_SUBTYPE_MASK) +#endif /* * Determining the success of an operation is best done with a macro; * each of these returns a boolean value. Index: dat/include/dat/udat_redirection.h =================================================================== --- dat/include/dat/udat_redirection.h (revision 5366) +++ dat/include/dat/udat_redirection.h (working copy) @@ -199,13 +199,12 @@ typedef DAT_RETURN (*DAT_EVD_SET_UNWAITA typedef DAT_RETURN (*DAT_EVD_CLEAR_UNWAITABLE_FUNC) ( IN DAT_EVD_HANDLE); /* evd_handle */ - #include struct dat_provider { const char * device_name; - DAT_PVOID extension; + DAT_PVOID extension; DAT_IA_OPEN_FUNC ia_open_func; DAT_IA_QUERY_FUNC ia_query_func; @@ -294,6 +293,12 @@ struct dat_provider DAT_SRQ_QUERY_FUNC srq_query_func; DAT_SRQ_RESIZE_FUNC srq_resize_func; DAT_SRQ_SET_LW_FUNC srq_set_lw_func; + +#ifdef DAT_EXTENSIONS + /* udat-2.0 extensions */ + DAT_EXTENSION_FUNC extension_func; +#endif + }; #endif /* _UDAT_REDIRECTION_H_ */ Index: dat/include/dat/dat.h =================================================================== --- dat/include/dat/dat.h (revision 5366) +++ dat/include/dat/dat.h (working copy) @@ -119,6 +119,23 @@ typedef DAT_HANDLE DAT_RMR_HANDLE; typedef DAT_HANDLE DAT_RSP_HANDLE; typedef DAT_HANDLE DAT_SRQ_HANDLE; +/* PROTOTYPE: immediate data and extensions */ +typedef enum dat_dtos +{ + DAT_SEND, + DAT_SEND_IMMED, + DAT_RDMA_WRITE, + DAT_RDMA_WRITE_IMMED, + DAT_RDMA_READ, + DAT_RECEIVE, + DAT_RECEIVE_INVALIDATE, + DAT_RECEIVE_RDMA_WRITE_IMMED, + DAT_BIND_MW, +#ifdef DAT_EXTENSIONS + DAT_EXTENSION, +#endif +} DAT_DTOS; + /* dat NULL handles */ #define DAT_HANDLE_NULL ((DAT_HANDLE)NULL) @@ -259,7 +276,6 @@ typedef struct dat_rmr_triplet */ /* Memory privileges */ - typedef enum dat_mem_priv_flags { DAT_MEM_PRIV_NONE_FLAG = 0x00, @@ -267,7 +283,11 @@ typedef enum dat_mem_priv_flags DAT_MEM_PRIV_REMOTE_READ_FLAG = 0x02, DAT_MEM_PRIV_LOCAL_WRITE_FLAG = 0x10, DAT_MEM_PRIV_REMOTE_WRITE_FLAG = 0x20, - DAT_MEM_PRIV_ALL_FLAG = 0x33 + DAT_MEM_PRIV_MW_BIND_FLAG = 0x40, + DAT_MEM_PRIV_ALL_FLAG = 0x73, +#ifdef DAT_EXTENSIONS + DAT_MEM_PRIV_EXTENSION = 0x1000 +#endif } DAT_MEM_PRIV_FLAGS; /* For backward compatibility with DAT-1.0, memory privileges values are @@ -712,14 +732,23 @@ typedef enum dat_dto_completion_status /* Completion group structs (six total) */ -/* DTO completion event data */ +#ifdef DAT_EXTENSIONS +#include +#endif + +/* DTO completion event data, 2.0 update */ /* transfered_length is not defined if status is not DAT_SUCCESS */ typedef struct dat_dto_completion_event_data { DAT_EP_HANDLE ep_handle; DAT_DTO_COOKIE user_cookie; DAT_DTO_COMPLETION_STATUS status; - DAT_VLEN transfered_length; + DAT_VLEN transfered_length; + DAT_DTOS operation; + DAT_RMR_CONTEXT rmr_context; +#ifdef DAT_EXTENSIONS + DAT_DTO_EXTENSION_EVENT_DATA extension; +#endif } DAT_DTO_COMPLETION_EVENT_DATA; /* RMR bind completion event data */ @@ -854,11 +883,11 @@ typedef enum dat_event_number DAT_ASYNC_ERROR_EP_BROKEN = 0x08003, DAT_ASYNC_ERROR_TIMED_OUT = 0x08004, DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR = 0x08005, - DAT_SOFTWARE_EVENT = 0x10001 + DAT_SOFTWARE_EVENT = 0x10001, + } DAT_EVENT_NUMBER; /* Union for event Data */ - typedef union dat_event_data { DAT_DTO_COMPLETION_EVENT_DATA dto_completion_event_data; @@ -1222,6 +1251,13 @@ extern DAT_RETURN dat_srq_set_lw ( IN DAT_SRQ_HANDLE, /* srq_handle */ IN DAT_COUNT); /* low_watermark */ +#ifdef DAT_EXTENSIONS +extern DAT_RETURN dat_extension( + IN DAT_HANDLE, + IN DAT_DTO_EXTENSION_OP, + IN ... ); +#endif + /* * DAT registry functions. * Index: dat/common/dat_api.c =================================================================== --- dat/common/dat_api.c (revision 5366) +++ dat/common/dat_api.c (working copy) @@ -2,27 +2,27 @@ * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. * * This Software is licensed under one of the following licenses: - * + * * 1) under the terms of the "Common Public License 1.0" a copy of which is * in the file LICENSE.txt in the root directory. The license is also - * available from the Open Source Initiative, see + * available from the Open Source Initiative, see * http://www.opensource.org/licenses/cpl.php. - * + * * 2) under the terms of the "The BSD License" a copy of which is in the file * LICENSE2.txt in the root directory. The license is also available from * the Open Source Initiative, see * http://www.opensource.org/licenses/bsd-license.php. - * + * * 3) under the terms of the "GNU General Public License (GPL) Version 2" a * copy of which is in the file LICENSE3.txt in the root directory. The * license is also available from the Open Source Initiative, see * http://www.opensource.org/licenses/gpl-license.php. - * + * * Licensee has the right to choose one of the above licenses. - * + * * Redistributions of source code must retain the above copyright * notice and one of the license notices. - * + * * Redistributions in binary form must reproduce both the above copyright * notice, one of the license notices in the documentation * and/or other materials provided with the distribution. @@ -35,7 +35,7 @@ * PURPOSE: DAT Provider and Consumer registry functions. * Also provide small integers for IA_HANDLES * - * $Id: dat_api.c,v 1.10 2005/05/20 22:25:31 jlentini Exp $ + * $Id: dat_api.c,v 1.5 2005/02/17 19:36:23 jlentini Exp $ **********************************************************************/ #include "dat_osd.h" @@ -70,15 +70,16 @@ dats_handle_vector_init ( void ) { DAT_RETURN dat_status; int i; + int status; dat_status = DAT_SUCCESS; g_hv.handle_max = DAT_HANDLE_ENTRY_STEP; - dat_status = dat_os_lock_init (&g_hv.handle_lock); - if ( DAT_SUCCESS != dat_status ) + status = dat_os_lock_init (&g_hv.handle_lock); + if ( DAT_SUCCESS != status ) { - return dat_status; + return status; } g_hv.handle_array = dat_os_alloc (sizeof(void *) * DAT_HANDLE_ENTRY_STEP); @@ -88,7 +89,7 @@ dats_handle_vector_init ( void ) goto bail; } - for (i = 0; i < g_hv.handle_max; i++) + for (i = g_hv.handle_max; i < g_hv.handle_max; i++) { g_hv.handle_array[i] = NULL; } @@ -112,11 +113,7 @@ dats_set_ia_handle ( void **h; dat_os_lock (&g_hv.handle_lock); - - /* - * Don't give out handle zero since that is DAT_HANDLE_NULL! - */ - for (i = 1; i < g_hv.handle_max; i++) + for (i = 0; i < g_hv.handle_max; i++) { if (g_hv.handle_array[i] == NULL) { @@ -1142,6 +1139,39 @@ DAT_RETURN dat_srq_set_lw( low_watermark); } +#ifdef DAT_EXTENSIONS +DAT_RETURN dat_extension( + IN DAT_HANDLE handle, + IN DAT_DTO_EXTENSION_OP ext_op, + IN ... ) + +{ + DAT_RETURN status; + va_list args; + + if (handle == NULL) + { + return DAT_ERROR(DAT_INVALID_HANDLE, DAT_INVALID_HANDLE_EP); + } + + /* verify provider extension support */ + if (!dat_extensions) + { + return DAT_ERROR(DAT_NOT_IMPLEMENTED, 0); + } + + va_start(args, ext_op); + + status = DAT_EXTENSION(handle, + ext_op, + args); + va_end(args); + + return status; +} +#endif + + /* * Local variables: * c-indent-level: 4 Index: dat/udat/Makefile =================================================================== --- dat/udat/Makefile (revision 5366) +++ dat/udat/Makefile (working copy) @@ -112,6 +112,12 @@ CFLAGS32 = -m32 endif # +# Prototype 2.0 DAT extensions +# +CFLAGS += -DDAT_EXTENSIONS + + +# # LD definitions # Index: dat/udat/udat.c =================================================================== --- dat/udat/udat.c (revision 5366) +++ dat/udat/udat.c (working copy) @@ -2,27 +2,27 @@ * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. * * This Software is licensed under one of the following licenses: - * + * * 1) under the terms of the "Common Public License 1.0" a copy of which is * in the file LICENSE.txt in the root directory. The license is also - * available from the Open Source Initiative, see + * available from the Open Source Initiative, see * http://www.opensource.org/licenses/cpl.php. - * + * * 2) under the terms of the "The BSD License" a copy of which is in the file * LICENSE2.txt in the root directory. The license is also available from * the Open Source Initiative, see * http://www.opensource.org/licenses/bsd-license.php. - * - * 3) under the terms of the "GNU General Public License (GPL) Version 2" a - * copy of which is in the file LICENSE3.txt in the root directory. The + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is in the file LICENSE3.txt in the root directory. The * license is also available from the Open Source Initiative, see * http://www.opensource.org/licenses/gpl-license.php. - * + * * Licensee has the right to choose one of the above licenses. - * + * * Redistributions of source code must retain the above copyright * notice and one of the license notices. - * + * * Redistributions in binary form must reproduce both the above copyright * notice, one of the license notices in the documentation * and/or other materials provided with the distribution. @@ -34,7 +34,7 @@ * * PURPOSE: DAT Provider and Consumer registry functions. * - * $Id: udat.c,v 1.22 2005/03/24 05:58:35 jlentini Exp $ + * $Id: udat.c,v 1.20 2005/02/11 20:17:05 jlentini Exp $ **********************************************************************/ #include @@ -66,6 +66,10 @@ udat_check_state ( void ); * * *********************************************************************/ +/* + * Use a global to get an unresolved when run with pre-extension library + */ +int dat_extensions = 0; /* * @@ -226,17 +230,48 @@ dat_ia_openv ( return dat_status; } - dat_status = (*ia_open_func) (name, - async_event_qlen, - async_event_handle, - ia_handle); + dat_status = (*ia_open_func) (name, + async_event_qlen, + async_event_handle, + ia_handle); + + /* + * See if provider supports extensions + */ if (dat_status == DAT_SUCCESS) { + DAT_PROVIDER_ATTR p_attr; + int i; + return_handle = dats_set_ia_handle (*ia_handle); if (return_handle >= 0) { *ia_handle = (DAT_IA_HANDLE)return_handle; - } + } + + if ( dat_ia_query( *ia_handle, + NULL, + 0, + NULL, + DAT_PROVIDER_FIELD_PROVIDER_SPECIFIC_ATTR, + &p_attr ) == DAT_SUCCESS ) + { + for ( i = 0; i < p_attr.num_provider_specific_attr; i++ ) + { + if ( (strcmp( p_attr.provider_specific_attr[i].name, + "DAT_EXTENSION_INTERFACE" ) == 0) && + (strcmp( p_attr.provider_specific_attr[i].value, + "TRUE" ) == 0) ) + { + dat_os_dbg_print(DAT_OS_DBG_TYPE_CONSUMER_API, + "DAT Registry: dat_ia_open () " + "DAPL Extension Interface supported!\n"); + + dat_extensions = 1; + break; + } + } + } } return dat_status; -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Fri Feb 10 15:59:01 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 10 Feb 2006 15:59:01 -0800 Subject: [openib-general] Re: [PATCH v3] NOARP Device Support in the CMA In-Reply-To: References: Message-ID: <43ED28C5.2070502@ichips.intel.com> Sean Hefty wrote: > This is yet another updated patch that releases the reference on the acquired > route. I've committed this version of the patch. Thanks. - Sean From mshefty at ichips.intel.com Fri Feb 10 16:06:23 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 10 Feb 2006 16:06:23 -0800 Subject: [openib-general] [PATCH v2] [RFC] - example user mode rdma ping/pongprogram using CMA In-Reply-To: References: Message-ID: <43ED2A7F.4090101@ichips.intel.com> Hefty, Sean wrote: > Here's an updated version of rping. I restructured to code to make it > more modular, reduce the size of some of the functions, simplify some > areas, and make it more consistent. The updated version worked for my > limited testing. Please review the changes to see if I changed any of > the intended functionality. I've committed this version of the patch. Thanks. - Sean From rolandd at cisco.com Fri Feb 10 16:51:58 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 10 Feb 2006 16:51:58 -0800 Subject: [openib-general] [PATCH 1/4] [RFC] Add ib_modify_qp_is_ok to core In-Reply-To: <20062101651.zETYyw9qncSeTl9W@cisco.com> Message-ID: <20062101651.JheWN0O5tEPAuWNc@cisco.com> Add ib_modify_qp_is_ok() to the IB midlayer. --- infiniband/core/verbs.c (revision 5364) +++ infiniband/core/verbs.c (working copy) @@ -244,6 +244,266 @@ struct ib_qp *ib_create_qp(struct ib_pd } EXPORT_SYMBOL(ib_create_qp); +static const struct { + int valid; + enum ib_qp_attr_mask req_param[IB_QPT_RAW_ETY + 1]; + enum ib_qp_attr_mask opt_param[IB_QPT_RAW_ETY + 1]; +} qp_state_table[IB_QPS_ERR + 1][IB_QPS_ERR + 1] = { + [IB_QPS_RESET] = { + [IB_QPS_RESET] = { .valid = 1 }, + [IB_QPS_ERR] = { .valid = 1 }, + [IB_QPS_INIT] = { + .valid = 1, + .req_param = { + [IB_QPT_UD] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_QKEY), + [IB_QPT_UC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), + [IB_QPT_RC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), + [IB_QPT_SMI] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [IB_QPT_GSI] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + } + }, + }, + [IB_QPS_INIT] = { + [IB_QPS_RESET] = { .valid = 1 }, + [IB_QPS_ERR] = { .valid = 1 }, + [IB_QPS_INIT] = { + .valid = 1, + .opt_param = { + [IB_QPT_UD] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_QKEY), + [IB_QPT_UC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), + [IB_QPT_RC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), + [IB_QPT_SMI] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [IB_QPT_GSI] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + } + }, + [IB_QPS_RTR] = { + .valid = 1, + .req_param = { + [IB_QPT_UC] = (IB_QP_AV | + IB_QP_PATH_MTU | + IB_QP_DEST_QPN | + IB_QP_RQ_PSN), + [IB_QPT_RC] = (IB_QP_AV | + IB_QP_PATH_MTU | + IB_QP_DEST_QPN | + IB_QP_RQ_PSN | + IB_QP_MAX_DEST_RD_ATOMIC | + IB_QP_MIN_RNR_TIMER), + }, + .opt_param = { + [IB_QPT_UD] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [IB_QPT_UC] = (IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX), + [IB_QPT_RC] = (IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX), + [IB_QPT_SMI] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [IB_QPT_GSI] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + } + } + }, + [IB_QPS_RTR] = { + [IB_QPS_RESET] = { .valid = 1 }, + [IB_QPS_ERR] = { .valid = 1 }, + [IB_QPS_RTS] = { + .valid = 1, + .req_param = { + [IB_QPT_UD] = IB_QP_SQ_PSN, + [IB_QPT_UC] = IB_QP_SQ_PSN, + [IB_QPT_RC] = (IB_QP_TIMEOUT | + IB_QP_RETRY_CNT | + IB_QP_RNR_RETRY | + IB_QP_SQ_PSN | + IB_QP_MAX_QP_RD_ATOMIC), + [IB_QPT_SMI] = IB_QP_SQ_PSN, + [IB_QPT_GSI] = IB_QP_SQ_PSN, + }, + .opt_param = { + [IB_QPT_UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [IB_QPT_UC] = (IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PATH_MIG_STATE), + [IB_QPT_RC] = (IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + [IB_QPT_SMI] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [IB_QPT_GSI] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + } + }, + [IB_QPS_RTS] = { + [IB_QPS_RESET] = { .valid = 1 }, + [IB_QPS_ERR] = { .valid = 1 }, + [IB_QPS_RTS] = { + .valid = 1, + .opt_param = { + [IB_QPT_UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [IB_QPT_UC] = (IB_QP_ACCESS_FLAGS | + IB_QP_ALT_PATH | + IB_QP_PATH_MIG_STATE), + [IB_QPT_RC] = (IB_QP_ACCESS_FLAGS | + IB_QP_ALT_PATH | + IB_QP_PATH_MIG_STATE | + IB_QP_MIN_RNR_TIMER), + [IB_QPT_SMI] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [IB_QPT_GSI] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + }, + [IB_QPS_SQD] = { + .valid = 1, + .opt_param = { + [IB_QPT_UD] = IB_QP_EN_SQD_ASYNC_NOTIFY, + [IB_QPT_UC] = IB_QP_EN_SQD_ASYNC_NOTIFY, + [IB_QPT_RC] = IB_QP_EN_SQD_ASYNC_NOTIFY, + [IB_QPT_SMI] = IB_QP_EN_SQD_ASYNC_NOTIFY, + [IB_QPT_GSI] = IB_QP_EN_SQD_ASYNC_NOTIFY + } + }, + }, + [IB_QPS_SQD] = { + [IB_QPS_RESET] = { .valid = 1 }, + [IB_QPS_ERR] = { .valid = 1 }, + [IB_QPS_RTS] = { + .valid = 1, + .opt_param = { + [IB_QPT_UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [IB_QPT_UC] = (IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PATH_MIG_STATE), + [IB_QPT_RC] = (IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + [IB_QPT_SMI] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [IB_QPT_GSI] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + }, + [IB_QPS_SQD] = { + .valid = 1, + .opt_param = { + [IB_QPT_UD] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [IB_QPT_UC] = (IB_QP_AV | + IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX | + IB_QP_PATH_MIG_STATE), + [IB_QPT_RC] = (IB_QP_AV | + IB_QP_TIMEOUT | + IB_QP_RETRY_CNT | + IB_QP_RNR_RETRY | + IB_QP_MAX_QP_RD_ATOMIC | + IB_QP_MAX_DEST_RD_ATOMIC | + IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + [IB_QPT_SMI] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [IB_QPT_GSI] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + } + } + }, + [IB_QPS_SQE] = { + [IB_QPS_RESET] = { .valid = 1 }, + [IB_QPS_ERR] = { .valid = 1 }, + [IB_QPS_RTS] = { + .valid = 1, + .opt_param = { + [IB_QPT_UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [IB_QPT_UC] = (IB_QP_CUR_STATE | + IB_QP_ACCESS_FLAGS), + [IB_QPT_SMI] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [IB_QPT_GSI] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + } + }, + [IB_QPS_ERR] = { + [IB_QPS_RESET] = { .valid = 1 }, + [IB_QPS_ERR] = { .valid = 1 } + } +}; + +int ib_modify_qp_is_ok(enum ib_qp_state cur_state, enum ib_qp_state next_state, + enum ib_qp_type type, enum ib_qp_attr_mask mask) +{ + enum ib_qp_attr_mask req_param, opt_param; + + if (cur_state < 0 || cur_state > IB_QPS_ERR || + next_state < 0 || next_state > IB_QPS_ERR) + return 0; + + if (mask & IB_QP_CUR_STATE && + cur_state != IB_QPS_RTR && cur_state != IB_QPS_RTS && + cur_state != IB_QPS_SQD && cur_state != IB_QPS_SQE) + return 0; + + if (!qp_state_table[cur_state][next_state].valid) { + printk(KERN_ERR "invalid trans %d -> %d\n", + cur_state, next_state); + return 0; + } + + req_param = qp_state_table[cur_state][next_state].req_param[type]; + opt_param = qp_state_table[cur_state][next_state].opt_param[type]; + + if ((mask & req_param) != req_param) { + printk(KERN_ERR "missing params %08x/%08x\n", + mask, req_param); + return 0; + } + + if (mask & ~(req_param | opt_param | IB_QP_STATE)) { + printk(KERN_ERR "extra params %08x/%08x\n", + mask, ~(req_param | opt_param | IB_QP_STATE)); + return 0; + } + + return 1; +} +EXPORT_SYMBOL(ib_modify_qp_is_ok); + int ib_modify_qp(struct ib_qp *qp, struct ib_qp_attr *qp_attr, int qp_attr_mask) --- infiniband/include/rdma/ib_verbs.h (revision 5364) +++ infiniband/include/rdma/ib_verbs.h (working copy) @@ -1012,6 +1012,24 @@ static inline int ib_copy_to_udata(struc return copy_to_user(udata->outbuf, src, len) ? -EFAULT : 0; } +/** + * ib_modify_qp_is_ok - Check that the supplied attribute mask + * contains all required attributes and no attributes not allowed for + * the given QP state transition. + * @cur_state: Current QP state + * @next_state: Next QP state + * @type: QP type + * @mask: Mask of supplied QP attributes + * + * This function is a helper function that a low-level driver's + * modify_qp method can use to validate the consumer's input. It + * checks that cur_state and next_state are valid QP states, that a + * transition from cur_state to next_state is allowed by the IB spec, + * and that the attribute mask supplied is allowed for the transition. + */ +int ib_modify_qp_is_ok(enum ib_qp_state cur_state, enum ib_qp_state next_state, + enum ib_qp_type type, enum ib_qp_attr_mask mask); + int ib_register_event_handler (struct ib_event_handler *event_handler); int ib_unregister_event_handler(struct ib_event_handler *event_handler); void ib_dispatch_event(struct ib_event *event); From rolandd at cisco.com Fri Feb 10 16:51:58 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 10 Feb 2006 16:51:58 -0800 Subject: [openib-general] [PATCH 2/4] [RFC] Use ib_modify_qp_is_ok in mthca In-Reply-To: <20062101651.JheWN0O5tEPAuWNc@cisco.com> Message-ID: <20062101651.5GOt73XX2DS6Q2Ot@cisco.com> Convert mthca to use ib_modify_qp_is_ok() instead of its own QP transition table. --- infiniband/hw/mthca/mthca_cmd.c (revision 5364) +++ infiniband/hw/mthca/mthca_cmd.c (working copy) @@ -1566,31 +1566,56 @@ int mthca_ARM_SRQ(struct mthca_dev *dev, CMD_TIME_CLASS_B, status); } -int mthca_MODIFY_QP(struct mthca_dev *dev, int trans, u32 num, - int is_ee, struct mthca_mailbox *mailbox, u32 optmask, +int mthca_MODIFY_QP(struct mthca_dev *dev, enum ib_qp_state cur, + enum ib_qp_state next, u32 num, int is_ee, + struct mthca_mailbox *mailbox, u32 optmask, u8 *status) { - static const u16 op[] = { - [MTHCA_TRANS_RST2INIT] = CMD_RST2INIT_QPEE, - [MTHCA_TRANS_INIT2INIT] = CMD_INIT2INIT_QPEE, - [MTHCA_TRANS_INIT2RTR] = CMD_INIT2RTR_QPEE, - [MTHCA_TRANS_RTR2RTS] = CMD_RTR2RTS_QPEE, - [MTHCA_TRANS_RTS2RTS] = CMD_RTS2RTS_QPEE, - [MTHCA_TRANS_SQERR2RTS] = CMD_SQERR2RTS_QPEE, - [MTHCA_TRANS_ANY2ERR] = CMD_2ERR_QPEE, - [MTHCA_TRANS_RTS2SQD] = CMD_RTS2SQD_QPEE, - [MTHCA_TRANS_SQD2SQD] = CMD_SQD2SQD_QPEE, - [MTHCA_TRANS_SQD2RTS] = CMD_SQD2RTS_QPEE, - [MTHCA_TRANS_ANY2RST] = CMD_ERR2RST_QPEE + static const u16 op[IB_QPS_ERR + 1][IB_QPS_ERR + 1] = { + [IB_QPS_RESET] = { + [IB_QPS_RESET] = CMD_ERR2RST_QPEE, + [IB_QPS_ERR] = CMD_2ERR_QPEE, + [IB_QPS_INIT] = CMD_RST2INIT_QPEE, + }, + [IB_QPS_INIT] = { + [IB_QPS_RESET] = CMD_ERR2RST_QPEE, + [IB_QPS_ERR] = CMD_2ERR_QPEE, + [IB_QPS_INIT] = CMD_INIT2INIT_QPEE, + [IB_QPS_RTR] = CMD_INIT2RTR_QPEE, + }, + [IB_QPS_RTR] = { + [IB_QPS_RESET] = CMD_ERR2RST_QPEE, + [IB_QPS_ERR] = CMD_2ERR_QPEE, + [IB_QPS_RTS] = CMD_RTR2RTS_QPEE, + }, + [IB_QPS_RTS] = { + [IB_QPS_RESET] = CMD_ERR2RST_QPEE, + [IB_QPS_ERR] = CMD_2ERR_QPEE, + [IB_QPS_RTS] = CMD_RTS2RTS_QPEE, + [IB_QPS_SQD] = CMD_RTS2SQD_QPEE, + }, + [IB_QPS_SQD] = { + [IB_QPS_RESET] = CMD_ERR2RST_QPEE, + [IB_QPS_ERR] = CMD_2ERR_QPEE, + [IB_QPS_RTS] = CMD_SQD2RTS_QPEE, + [IB_QPS_SQD] = CMD_SQD2SQD_QPEE, + }, + [IB_QPS_SQE] = { + [IB_QPS_RESET] = CMD_ERR2RST_QPEE, + [IB_QPS_ERR] = CMD_2ERR_QPEE, + [IB_QPS_RTS] = CMD_SQERR2RTS_QPEE, + }, + [IB_QPS_ERR] = { + [IB_QPS_RESET] = CMD_ERR2RST_QPEE, + [IB_QPS_ERR] = CMD_2ERR_QPEE, + } }; + u8 op_mod = 0; int my_mailbox = 0; int err; - if (trans < 0 || trans >= ARRAY_SIZE(op)) - return -EINVAL; - - if (trans == MTHCA_TRANS_ANY2RST) { + if (op[cur][next] == CMD_ERR2RST_QPEE) { op_mod = 3; /* don't write outbox, any->reset */ /* For debugging */ @@ -1602,34 +1627,35 @@ int mthca_MODIFY_QP(struct mthca_dev *de } else mailbox = NULL; } - } else { - if (0) { + + err = mthca_cmd_box(dev, 0, mailbox ? mailbox->dma : 0, + (!!is_ee << 24) | num, op_mod, + op[cur][next], CMD_TIME_CLASS_C, status); + + if (0 && mailbox) { int i; mthca_dbg(dev, "Dumping QP context:\n"); - printk(" opt param mask: %08x\n", be32_to_cpup(mailbox->buf)); + printk(" %08x\n", be32_to_cpup(mailbox->buf)); for (i = 0; i < 0x100 / 4; ++i) { if (i % 8 == 0) - printk(" [%02x] ", i * 4); + printk("[%02x] ", i * 4); printk(" %08x", be32_to_cpu(((__be32 *) mailbox->buf)[i + 2])); if ((i + 1) % 8 == 0) printk("\n"); } } - } - if (trans == MTHCA_TRANS_ANY2RST) { - err = mthca_cmd_box(dev, 0, mailbox ? mailbox->dma : 0, - (!!is_ee << 24) | num, op_mod, - op[trans], CMD_TIME_CLASS_C, status); - - if (0 && mailbox) { + if (my_mailbox) + mthca_free_mailbox(dev, mailbox); + } else { + if (0) { int i; mthca_dbg(dev, "Dumping QP context:\n"); - printk(" %08x\n", be32_to_cpup(mailbox->buf)); + printk(" opt param mask: %08x\n", be32_to_cpup(mailbox->buf)); for (i = 0; i < 0x100 / 4; ++i) { if (i % 8 == 0) - printk("[%02x] ", i * 4); + printk(" [%02x] ", i * 4); printk(" %08x", be32_to_cpu(((__be32 *) mailbox->buf)[i + 2])); if ((i + 1) % 8 == 0) @@ -1637,13 +1663,9 @@ int mthca_MODIFY_QP(struct mthca_dev *de } } - } else - err = mthca_cmd(dev, mailbox->dma, - optmask | (!!is_ee << 24) | num, - op_mod, op[trans], CMD_TIME_CLASS_C, status); - - if (my_mailbox) - mthca_free_mailbox(dev, mailbox); + err = mthca_cmd(dev, mailbox->dma, optmask | (!!is_ee << 24) | num, + op_mod, op[cur][next], CMD_TIME_CLASS_C, status); + } return err; } --- infiniband/hw/mthca/mthca_cmd.h (revision 5364) +++ infiniband/hw/mthca/mthca_cmd.h (working copy) @@ -306,8 +306,9 @@ int mthca_SW2HW_SRQ(struct mthca_dev *de int mthca_HW2SW_SRQ(struct mthca_dev *dev, struct mthca_mailbox *mailbox, int srq_num, u8 *status); int mthca_ARM_SRQ(struct mthca_dev *dev, int srq_num, int limit, u8 *status); -int mthca_MODIFY_QP(struct mthca_dev *dev, int trans, u32 num, - int is_ee, struct mthca_mailbox *mailbox, u32 optmask, +int mthca_MODIFY_QP(struct mthca_dev *dev, enum ib_qp_state cur, + enum ib_qp_state next, u32 num, int is_ee, + struct mthca_mailbox *mailbox, u32 optmask, u8 *status); int mthca_QUERY_QP(struct mthca_dev *dev, u32 num, int is_ee, struct mthca_mailbox *mailbox, u8 *status); --- infiniband/hw/mthca/mthca_qp.c (revision 5364) +++ infiniband/hw/mthca/mthca_qp.c (working copy) @@ -284,213 +284,6 @@ static int to_mthca_st(int transport) } } -static const struct { - int trans; - u32 req_param[NUM_TRANS]; - u32 opt_param[NUM_TRANS]; -} state_table[IB_QPS_ERR + 1][IB_QPS_ERR + 1] = { - [IB_QPS_RESET] = { - [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, - [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, - [IB_QPS_INIT] = { - .trans = MTHCA_TRANS_RST2INIT, - .req_param = { - [UD] = (IB_QP_PKEY_INDEX | - IB_QP_PORT | - IB_QP_QKEY), - [UC] = (IB_QP_PKEY_INDEX | - IB_QP_PORT | - IB_QP_ACCESS_FLAGS), - [RC] = (IB_QP_PKEY_INDEX | - IB_QP_PORT | - IB_QP_ACCESS_FLAGS), - [MLX] = (IB_QP_PKEY_INDEX | - IB_QP_QKEY), - }, - /* bug-for-bug compatibility with VAPI: */ - .opt_param = { - [MLX] = IB_QP_PORT - } - }, - }, - [IB_QPS_INIT] = { - [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, - [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, - [IB_QPS_INIT] = { - .trans = MTHCA_TRANS_INIT2INIT, - .opt_param = { - [UD] = (IB_QP_PKEY_INDEX | - IB_QP_PORT | - IB_QP_QKEY), - [UC] = (IB_QP_PKEY_INDEX | - IB_QP_PORT | - IB_QP_ACCESS_FLAGS), - [RC] = (IB_QP_PKEY_INDEX | - IB_QP_PORT | - IB_QP_ACCESS_FLAGS), - [MLX] = (IB_QP_PKEY_INDEX | - IB_QP_QKEY), - } - }, - [IB_QPS_RTR] = { - .trans = MTHCA_TRANS_INIT2RTR, - .req_param = { - [UC] = (IB_QP_AV | - IB_QP_PATH_MTU | - IB_QP_DEST_QPN | - IB_QP_RQ_PSN), - [RC] = (IB_QP_AV | - IB_QP_PATH_MTU | - IB_QP_DEST_QPN | - IB_QP_RQ_PSN | - IB_QP_MAX_DEST_RD_ATOMIC | - IB_QP_MIN_RNR_TIMER), - }, - .opt_param = { - [UD] = (IB_QP_PKEY_INDEX | - IB_QP_QKEY), - [UC] = (IB_QP_ALT_PATH | - IB_QP_ACCESS_FLAGS | - IB_QP_PKEY_INDEX), - [RC] = (IB_QP_ALT_PATH | - IB_QP_ACCESS_FLAGS | - IB_QP_PKEY_INDEX), - [MLX] = (IB_QP_PKEY_INDEX | - IB_QP_QKEY), - } - } - }, - [IB_QPS_RTR] = { - [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, - [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, - [IB_QPS_RTS] = { - .trans = MTHCA_TRANS_RTR2RTS, - .req_param = { - [UD] = IB_QP_SQ_PSN, - [UC] = IB_QP_SQ_PSN, - [RC] = (IB_QP_TIMEOUT | - IB_QP_RETRY_CNT | - IB_QP_RNR_RETRY | - IB_QP_SQ_PSN | - IB_QP_MAX_QP_RD_ATOMIC), - [MLX] = IB_QP_SQ_PSN, - }, - .opt_param = { - [UD] = (IB_QP_CUR_STATE | - IB_QP_QKEY), - [UC] = (IB_QP_CUR_STATE | - IB_QP_ALT_PATH | - IB_QP_ACCESS_FLAGS | - IB_QP_PATH_MIG_STATE), - [RC] = (IB_QP_CUR_STATE | - IB_QP_ALT_PATH | - IB_QP_ACCESS_FLAGS | - IB_QP_MIN_RNR_TIMER | - IB_QP_PATH_MIG_STATE), - [MLX] = (IB_QP_CUR_STATE | - IB_QP_QKEY), - } - } - }, - [IB_QPS_RTS] = { - [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, - [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, - [IB_QPS_RTS] = { - .trans = MTHCA_TRANS_RTS2RTS, - .opt_param = { - [UD] = (IB_QP_CUR_STATE | - IB_QP_QKEY), - [UC] = (IB_QP_ACCESS_FLAGS | - IB_QP_ALT_PATH | - IB_QP_PATH_MIG_STATE), - [RC] = (IB_QP_ACCESS_FLAGS | - IB_QP_ALT_PATH | - IB_QP_PATH_MIG_STATE | - IB_QP_MIN_RNR_TIMER), - [MLX] = (IB_QP_CUR_STATE | - IB_QP_QKEY), - } - }, - [IB_QPS_SQD] = { - .trans = MTHCA_TRANS_RTS2SQD, - .opt_param = { - [UD] = IB_QP_EN_SQD_ASYNC_NOTIFY, - [UC] = IB_QP_EN_SQD_ASYNC_NOTIFY, - [RC] = IB_QP_EN_SQD_ASYNC_NOTIFY, - [MLX] = IB_QP_EN_SQD_ASYNC_NOTIFY - } - }, - }, - [IB_QPS_SQD] = { - [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, - [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, - [IB_QPS_RTS] = { - .trans = MTHCA_TRANS_SQD2RTS, - .opt_param = { - [UD] = (IB_QP_CUR_STATE | - IB_QP_QKEY), - [UC] = (IB_QP_CUR_STATE | - IB_QP_ALT_PATH | - IB_QP_ACCESS_FLAGS | - IB_QP_PATH_MIG_STATE), - [RC] = (IB_QP_CUR_STATE | - IB_QP_ALT_PATH | - IB_QP_ACCESS_FLAGS | - IB_QP_MIN_RNR_TIMER | - IB_QP_PATH_MIG_STATE), - [MLX] = (IB_QP_CUR_STATE | - IB_QP_QKEY), - } - }, - [IB_QPS_SQD] = { - .trans = MTHCA_TRANS_SQD2SQD, - .opt_param = { - [UD] = (IB_QP_PKEY_INDEX | - IB_QP_QKEY), - [UC] = (IB_QP_AV | - IB_QP_CUR_STATE | - IB_QP_ALT_PATH | - IB_QP_ACCESS_FLAGS | - IB_QP_PKEY_INDEX | - IB_QP_PATH_MIG_STATE), - [RC] = (IB_QP_AV | - IB_QP_TIMEOUT | - IB_QP_RETRY_CNT | - IB_QP_RNR_RETRY | - IB_QP_MAX_QP_RD_ATOMIC | - IB_QP_MAX_DEST_RD_ATOMIC | - IB_QP_CUR_STATE | - IB_QP_ALT_PATH | - IB_QP_ACCESS_FLAGS | - IB_QP_PKEY_INDEX | - IB_QP_MIN_RNR_TIMER | - IB_QP_PATH_MIG_STATE), - [MLX] = (IB_QP_PKEY_INDEX | - IB_QP_QKEY), - } - } - }, - [IB_QPS_SQE] = { - [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, - [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, - [IB_QPS_RTS] = { - .trans = MTHCA_TRANS_SQERR2RTS, - .opt_param = { - [UD] = (IB_QP_CUR_STATE | - IB_QP_QKEY), - [UC] = (IB_QP_CUR_STATE | - IB_QP_ACCESS_FLAGS), - [MLX] = (IB_QP_CUR_STATE | - IB_QP_QKEY), - } - } - }, - [IB_QPS_ERR] = { - [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, - [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR } - } -}; - static void store_attrs(struct mthca_sqp *sqp, struct ib_qp_attr *attr, int attr_mask) { @@ -580,19 +373,12 @@ int mthca_modify_qp(struct ib_qp *ibqp, struct mthca_mailbox *mailbox; struct mthca_qp_param *qp_param; struct mthca_qp_context *qp_context; - u32 req_param, opt_param; u32 sqd_event = 0; u8 status; int err; if (attr_mask & IB_QP_CUR_STATE) { - if (attr->cur_qp_state != IB_QPS_RTR && - attr->cur_qp_state != IB_QPS_RTS && - attr->cur_qp_state != IB_QPS_SQD && - attr->cur_qp_state != IB_QPS_SQE) - return -EINVAL; - else - cur_state = attr->cur_qp_state; + cur_state = attr->cur_qp_state; } else { spin_lock_irq(&qp->sq.lock); spin_lock(&qp->rq.lock); @@ -601,37 +387,13 @@ int mthca_modify_qp(struct ib_qp *ibqp, spin_unlock_irq(&qp->sq.lock); } - if (attr_mask & IB_QP_STATE) { - if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR) - return -EINVAL; - new_state = attr->qp_state; - } else - new_state = cur_state; - - if (state_table[cur_state][new_state].trans == MTHCA_TRANS_INVALID) { - mthca_dbg(dev, "Illegal QP transition " - "%d->%d\n", cur_state, new_state); - return -EINVAL; - } - - req_param = state_table[cur_state][new_state].req_param[qp->transport]; - opt_param = state_table[cur_state][new_state].opt_param[qp->transport]; - - if ((req_param & attr_mask) != req_param) { - mthca_dbg(dev, "QP transition " - "%d->%d missing req attr 0x%08x\n", - cur_state, new_state, - req_param & ~attr_mask); - return -EINVAL; - } + new_state = attr_mask & IB_QP_STATE ? attr->qp_state : cur_state; - if (attr_mask & ~(req_param | opt_param | IB_QP_STATE)) { - mthca_dbg(dev, "QP transition (transport %d) " - "%d->%d has extra attr 0x%08x\n", - qp->transport, - cur_state, new_state, - attr_mask & ~(req_param | opt_param | - IB_QP_STATE)); + if (!ib_modify_qp_is_ok(cur_state, new_state, ibqp->qp_type, attr_mask)) { + mthca_dbg(dev, "Bad QP transition (transport %d) " + "%d->%d with attr 0x%08x\n", + qp->transport, cur_state, new_state, + attr_mask); return -EINVAL; } @@ -851,11 +613,11 @@ int mthca_modify_qp(struct ib_qp *ibqp, attr->en_sqd_async_notify) sqd_event = 1 << 31; - err = mthca_MODIFY_QP(dev, state_table[cur_state][new_state].trans, - qp->qpn, 0, mailbox, sqd_event, &status); + err = mthca_MODIFY_QP(dev, cur_state, new_state, qp->qpn, 0, + mailbox, sqd_event, &status); if (status) { - mthca_warn(dev, "modify QP %d returned status %02x.\n", - state_table[cur_state][new_state].trans, status); + mthca_warn(dev, "modify QP %d->%d returned status %02x.\n", + cur_state, new_state, status); err = -EINVAL; } @@ -1403,7 +1165,8 @@ void mthca_free_qp(struct mthca_dev *dev wait_event(qp->wait, !atomic_read(&qp->refcount)); if (qp->state != IB_QPS_RESET) - mthca_MODIFY_QP(dev, MTHCA_TRANS_ANY2RST, qp->qpn, 0, NULL, 0, &status); + mthca_MODIFY_QP(dev, qp->state, IB_QPS_RESET, qp->qpn, 0, + NULL, 0, &status); /* * If this is a userspace QP, the buffers, MR, CQs and so on From rolandd at cisco.com Fri Feb 10 16:51:58 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 10 Feb 2006 16:51:58 -0800 Subject: [openib-general] [PATCH 3/4] [RFC] Use ib_modify_qp_is_ok in ipath In-Reply-To: <20062101651.5GOt73XX2DS6Q2Ot@cisco.com> Message-ID: <20062101651.TEgri0fV1I6PJfvx@cisco.com> Convert ipath to use ib_modify_qp_is_ok() instead of its own QP transition table. Lightly tested. PathScale people: please test this out and apply if it's OK. --- infiniband/hw/ipath/ipath_verbs.c (revision 5364) +++ infiniband/hw/ipath/ipath_verbs.c (working copy) @@ -1003,213 +1003,6 @@ static void send_complete(unsigned long } /* - * This is the QP state transition table. - * See ipath_modify_qp() for details. - */ -static const struct { - int trans; - u32 req_param[IB_QPT_RAW_IPV6]; - u32 opt_param[IB_QPT_RAW_IPV6]; -} qp_state_table[IB_QPS_ERR + 1][IB_QPS_ERR + 1] = { - [IB_QPS_RESET] = { - [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST }, - [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR }, - [IB_QPS_INIT] = { - .trans = IPATH_TRANS_RST2INIT, - .req_param = { - [IB_QPT_SMI] = (IB_QP_PKEY_INDEX | - IB_QP_QKEY), - [IB_QPT_GSI] = (IB_QP_PKEY_INDEX | - IB_QP_QKEY), - [IB_QPT_UD] = (IB_QP_PKEY_INDEX | - IB_QP_PORT | - IB_QP_QKEY), - [IB_QPT_UC] = (IB_QP_PKEY_INDEX | - IB_QP_PORT | - IB_QP_ACCESS_FLAGS), - [IB_QPT_RC] = (IB_QP_PKEY_INDEX | - IB_QP_PORT | - IB_QP_ACCESS_FLAGS), - }, - }, - }, - [IB_QPS_INIT] = { - [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST }, - [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR }, - [IB_QPS_INIT] = { - .trans = IPATH_TRANS_INIT2INIT, - .opt_param = { - [IB_QPT_SMI] = (IB_QP_PKEY_INDEX | - IB_QP_QKEY), - [IB_QPT_GSI] = (IB_QP_PKEY_INDEX | - IB_QP_QKEY), - [IB_QPT_UD] = (IB_QP_PKEY_INDEX | - IB_QP_PORT | - IB_QP_QKEY), - [IB_QPT_UC] = (IB_QP_PKEY_INDEX | - IB_QP_PORT | - IB_QP_ACCESS_FLAGS), - [IB_QPT_RC] = (IB_QP_PKEY_INDEX | - IB_QP_PORT | - IB_QP_ACCESS_FLAGS), - } - }, - [IB_QPS_RTR] = { - .trans = IPATH_TRANS_INIT2RTR, - .req_param = { - [IB_QPT_UC] = (IB_QP_AV | - IB_QP_PATH_MTU | - IB_QP_DEST_QPN | - IB_QP_RQ_PSN), - [IB_QPT_RC] = (IB_QP_AV | - IB_QP_PATH_MTU | - IB_QP_DEST_QPN | - IB_QP_RQ_PSN | - IB_QP_MAX_DEST_RD_ATOMIC | - IB_QP_MIN_RNR_TIMER), - }, - .opt_param = { - [IB_QPT_SMI] = (IB_QP_PKEY_INDEX | - IB_QP_QKEY), - [IB_QPT_GSI] = (IB_QP_PKEY_INDEX | - IB_QP_QKEY), - [IB_QPT_UD] = (IB_QP_PKEY_INDEX | - IB_QP_QKEY), - [IB_QPT_UC] = (IB_QP_ALT_PATH | - IB_QP_ACCESS_FLAGS | - IB_QP_PKEY_INDEX), - [IB_QPT_RC] = (IB_QP_ALT_PATH | - IB_QP_ACCESS_FLAGS | - IB_QP_PKEY_INDEX), - } - } - }, - [IB_QPS_RTR] = { - [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST }, - [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR }, - [IB_QPS_RTS] = { - .trans = IPATH_TRANS_RTR2RTS, - .req_param = { - [IB_QPT_SMI] = IB_QP_SQ_PSN, - [IB_QPT_GSI] = IB_QP_SQ_PSN, - [IB_QPT_UD] = IB_QP_SQ_PSN, - [IB_QPT_UC] = IB_QP_SQ_PSN, - [IB_QPT_RC] = (IB_QP_TIMEOUT | - IB_QP_RETRY_CNT | - IB_QP_RNR_RETRY | - IB_QP_SQ_PSN | - IB_QP_MAX_QP_RD_ATOMIC), - }, - .opt_param = { - [IB_QPT_SMI] = (IB_QP_CUR_STATE | IB_QP_QKEY), - [IB_QPT_GSI] = (IB_QP_CUR_STATE | IB_QP_QKEY), - [IB_QPT_UD] = (IB_QP_CUR_STATE | IB_QP_QKEY), - [IB_QPT_UC] = (IB_QP_CUR_STATE | - IB_QP_ALT_PATH | - IB_QP_ACCESS_FLAGS | - IB_QP_PKEY_INDEX | - IB_QP_PATH_MIG_STATE), - [IB_QPT_RC] = (IB_QP_CUR_STATE | - IB_QP_ALT_PATH | - IB_QP_ACCESS_FLAGS | - IB_QP_PKEY_INDEX | - IB_QP_MIN_RNR_TIMER | - IB_QP_PATH_MIG_STATE), - } - } - }, - [IB_QPS_RTS] = { - [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST }, - [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR }, - [IB_QPS_RTS] = { - .trans = IPATH_TRANS_RTS2RTS, - .opt_param = { - [IB_QPT_SMI] = (IB_QP_CUR_STATE | IB_QP_QKEY), - [IB_QPT_GSI] = (IB_QP_CUR_STATE | IB_QP_QKEY), - [IB_QPT_UD] = (IB_QP_CUR_STATE | IB_QP_QKEY), - [IB_QPT_UC] = (IB_QP_ACCESS_FLAGS | - IB_QP_ALT_PATH | - IB_QP_PATH_MIG_STATE), - [IB_QPT_RC] = (IB_QP_ACCESS_FLAGS | - IB_QP_ALT_PATH | - IB_QP_PATH_MIG_STATE | - IB_QP_MIN_RNR_TIMER), - } - }, - [IB_QPS_SQD] = { - .trans = IPATH_TRANS_RTS2SQD, - }, - }, - [IB_QPS_SQD] = { - [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST }, - [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR }, - [IB_QPS_RTS] = { - .trans = IPATH_TRANS_SQD2RTS, - .opt_param = { - [IB_QPT_SMI] = (IB_QP_CUR_STATE | IB_QP_QKEY), - [IB_QPT_GSI] = (IB_QP_CUR_STATE | IB_QP_QKEY), - [IB_QPT_UD] = (IB_QP_CUR_STATE | IB_QP_QKEY), - [IB_QPT_UC] = (IB_QP_CUR_STATE | - IB_QP_ALT_PATH | - IB_QP_ACCESS_FLAGS | - IB_QP_PATH_MIG_STATE), - [IB_QPT_RC] = (IB_QP_CUR_STATE | - IB_QP_ALT_PATH | - IB_QP_ACCESS_FLAGS | - IB_QP_MIN_RNR_TIMER | - IB_QP_PATH_MIG_STATE), - } - }, - [IB_QPS_SQD] = { - .trans = IPATH_TRANS_SQD2SQD, - .opt_param = { - [IB_QPT_SMI] = (IB_QP_CUR_STATE | IB_QP_QKEY), - [IB_QPT_GSI] = (IB_QP_CUR_STATE | IB_QP_QKEY), - [IB_QPT_UD] = (IB_QP_PKEY_INDEX | IB_QP_QKEY), - [IB_QPT_UC] = (IB_QP_AV | - IB_QP_TIMEOUT | - IB_QP_CUR_STATE | - IB_QP_ALT_PATH | - IB_QP_ACCESS_FLAGS | - IB_QP_PKEY_INDEX | - IB_QP_PATH_MIG_STATE), - [IB_QPT_RC] = (IB_QP_AV | - IB_QP_TIMEOUT | - IB_QP_RETRY_CNT | - IB_QP_RNR_RETRY | - IB_QP_MAX_QP_RD_ATOMIC | - IB_QP_MAX_DEST_RD_ATOMIC | - IB_QP_CUR_STATE | - IB_QP_ALT_PATH | - IB_QP_ACCESS_FLAGS | - IB_QP_PKEY_INDEX | - IB_QP_MIN_RNR_TIMER | - IB_QP_PATH_MIG_STATE), - } - } - }, - [IB_QPS_SQE] = { - [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST }, - [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR }, - [IB_QPS_RTS] = { - .trans = IPATH_TRANS_SQERR2RTS, - .opt_param = { - [IB_QPT_SMI] = (IB_QP_CUR_STATE | IB_QP_QKEY), - [IB_QPT_GSI] = (IB_QP_CUR_STATE | IB_QP_QKEY), - [IB_QPT_UD] = (IB_QP_CUR_STATE | IB_QP_QKEY), - [IB_QPT_UC] = IB_QP_CUR_STATE, - [IB_QPT_RC] = (IB_QP_CUR_STATE | - IB_QP_MIN_RNR_TIMER), - } - } - }, - [IB_QPS_ERR] = { - [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST }, - [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR } - } -}; - -/* * Initialize the QP state to the reset state. */ static void ipath_reset_qp(struct ipath_qp *qp) @@ -1340,55 +1133,30 @@ static int ipath_modify_qp(struct ib_qp { struct ipath_qp *qp = to_iqp(ibqp); enum ib_qp_state cur_state, new_state; - u32 req_param, opt_param; unsigned long flags; - if (attr_mask & IB_QP_CUR_STATE) { - cur_state = attr->cur_qp_state; - if (cur_state != IB_QPS_RTR && - cur_state != IB_QPS_RTS && - cur_state != IB_QPS_SQD && cur_state != IB_QPS_SQE) - return -EINVAL; - spin_lock_irqsave(&qp->r_rq.lock, flags); - spin_lock(&qp->s_lock); - } else { - spin_lock_irqsave(&qp->r_rq.lock, flags); - spin_lock(&qp->s_lock); - cur_state = qp->state; - } + spin_lock_irqsave(&qp->r_rq.lock, flags); + spin_lock(&qp->s_lock); - if (attr_mask & IB_QP_STATE) { - new_state = attr->qp_state; - if (new_state < 0 || new_state > IB_QPS_ERR) - goto inval; - } else - new_state = cur_state; + cur_state = attr_mask & IB_QP_CUR_STATE ? attr->cur_qp_state : qp->state; + new_state = attr_mask & IB_QP_STATE ? attr->qp_state : cur_state; - switch (qp_state_table[cur_state][new_state].trans) { - case IPATH_TRANS_INVALID: + if (!ib_modify_qp_is_ok(cur_state, new_state, ibqp->qp_type, attr_mask)) goto inval; - case IPATH_TRANS_ANY2RST: + switch (new_state) { + case IB_QPS_RESET: ipath_reset_qp(qp); break; - case IPATH_TRANS_ANY2ERR: + case IB_QPS_ERR: ipath_error_qp(qp); break; + default: + break; } - req_param = - qp_state_table[cur_state][new_state].req_param[qp->ibqp.qp_type]; - opt_param = - qp_state_table[cur_state][new_state].opt_param[qp->ibqp.qp_type]; - - if ((req_param & attr_mask) != req_param) - goto inval; - - if (attr_mask & ~(req_param | opt_param | IB_QP_STATE)) - goto inval; - if (attr_mask & IB_QP_PKEY_INDEX) { struct ipath_ibdev *dev = to_idev(ibqp->device); From rolandd at cisco.com Fri Feb 10 16:51:58 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 10 Feb 2006 16:51:58 -0800 Subject: [openib-general] [PATCH 4/4] [RFC] Use ib_modify_qp_is_ok in ehca In-Reply-To: <20062101651.TEgri0fV1I6PJfvx@cisco.com> Message-ID: <20062101651.pUO5xLBrNqM5Q2R8@cisco.com> Convert ehca to use ib_modify_qp_is_ok() instead of its own QP transition table. Compile tested only because of lack of hardware. (Well, actually IBM has kindly supplied a couple of systems, but the rack isn't quite set up). In any case, IBM people: please make sure I didn't break anything, and if the patch is OK, please apply. --- infiniband/hw/ehca/ehca_qp.c (revision 5364) +++ infiniband/hw/ehca/ehca_qp.c (working copy) @@ -178,223 +178,13 @@ static inline enum ehca_qp_type ib2ehcaq } } -/** @brief struct describes a state transition via modify qp - */ -struct ehca_modqp_statetrans { - enum ib_qp_statetrans statetrans; - int req_attr[QPT_MAX]; - int opt_attr[QPT_MAX]; -}; - -/** @brief state transition table used by modify qp - * the order is defined by transitions listed in enum ib_qp_statetrans - */ -static const struct ehca_modqp_statetrans modqp_statetrans_table[IB_QPST_MAX] = { - [IB_QPST_ANY2RESET] = {.statetrans = IB_QPST_ANY2RESET}, - [IB_QPST_ANY2ERR] = {.statetrans = IB_QPST_ANY2ERR}, - [IB_QPST_RESET2INIT] = { - .statetrans = IB_QPST_RESET2INIT, - .req_attr = { - [QPT_RC] = (IB_QP_STATE | - IB_QP_PKEY_INDEX | - IB_QP_PORT | - IB_QP_ACCESS_FLAGS), - [QPT_UC] = (IB_QP_STATE | - IB_QP_PKEY_INDEX | - IB_QP_PORT | - IB_QP_ACCESS_FLAGS), - [QPT_UD] = (IB_QP_STATE | - IB_QP_PKEY_INDEX | - IB_QP_PORT | - IB_QP_QKEY), - [QPT_SQP] = (IB_QP_STATE | - IB_QP_PKEY_INDEX | - IB_QP_PORT | - IB_QP_QKEY) - } - }, - [IB_QPST_INIT2RTR] = { - .statetrans = IB_QPST_INIT2RTR, - .req_attr = { - [QPT_RC] = (IB_QP_STATE | - IB_QP_RQ_PSN | - IB_QP_PATH_MTU | - IB_QP_MAX_DEST_RD_ATOMIC | - IB_QP_DEST_QPN | - IB_QP_AV | - IB_QP_MIN_RNR_TIMER), - [QPT_UC] = (IB_QP_STATE | - IB_QP_RQ_PSN | - IB_QP_PATH_MTU | - IB_QP_DEST_QPN | - IB_QP_AV), - [QPT_UD] = (IB_QP_STATE), - [QPT_SQP] = (IB_QP_STATE) - }, - .opt_attr = { - [QPT_RC] = (IB_QP_ALT_PATH | - IB_QP_ACCESS_FLAGS | - IB_QP_PKEY_INDEX), - [QPT_UC] = (IB_QP_ALT_PATH | - IB_QP_ACCESS_FLAGS | - IB_QP_PKEY_INDEX), - [QPT_UD] = (IB_QP_PKEY_INDEX | - IB_QP_QKEY), - [QPT_SQP] = (IB_QP_PKEY_INDEX | - IB_QP_QKEY) - } - }, - [IB_QPST_INIT2INIT] = { - .statetrans = IB_QPST_INIT2INIT, - .opt_attr = { - [QPT_RC] = (IB_QP_PKEY_INDEX | - IB_QP_ACCESS_FLAGS | - IB_QP_PORT), - [QPT_UC] = (IB_QP_PKEY_INDEX | - IB_QP_ACCESS_FLAGS | - IB_QP_PORT), - [QPT_UD] = (IB_QP_QKEY | - IB_QP_PKEY_INDEX | - IB_QP_PORT), - [QPT_SQP] = (IB_QP_QKEY | - IB_QP_PKEY_INDEX | - IB_QP_PORT) - } - }, - [IB_QPST_RTR2RTS] = { - .statetrans = IB_QPST_RTR2RTS, - .req_attr = { - [QPT_RC] = (IB_QP_STATE | - IB_QP_SQ_PSN | - IB_QP_MAX_QP_RD_ATOMIC | - IB_QP_RNR_RETRY | - IB_QP_TIMEOUT | - IB_QP_RETRY_CNT), - [QPT_UC] = (IB_QP_STATE | - IB_QP_SQ_PSN), - [QPT_UD] = (IB_QP_STATE | - IB_QP_SQ_PSN), - [QPT_SQP] = (IB_QP_STATE | - IB_QP_SQ_PSN) - }, - .opt_attr = { - [QPT_RC] = (IB_QP_PATH_MIG_STATE | - IB_QP_ALT_PATH | - IB_QP_MIN_RNR_TIMER | - IB_QP_ACCESS_FLAGS | - IB_QP_CUR_STATE), - [QPT_UC] = (IB_QP_PATH_MIG_STATE | - IB_QP_ALT_PATH | - IB_QP_ACCESS_FLAGS | - IB_QP_CUR_STATE), - [QPT_UD] = (IB_QP_QKEY | - IB_QP_CUR_STATE), - [QPT_SQP] = (IB_QP_QKEY | - IB_QP_CUR_STATE) - } - }, - [IB_QPST_RTS2SQD] = { - .statetrans = IB_QPST_RTS2SQD, - .req_attr = { - [QPT_RC] = (IB_QP_STATE), - [QPT_UC] = (IB_QP_STATE), - [QPT_UD] = (IB_QP_STATE), - [QPT_SQP] = (IB_QP_STATE) - }, - }, - [IB_QPST_RTS2RTS] = { - .statetrans = IB_QPST_RTS2RTS, - .opt_attr = { - [QPT_RC] = (IB_QP_PATH_MIG_STATE | - IB_QP_ALT_PATH | - IB_QP_MIN_RNR_TIMER | - IB_QP_ACCESS_FLAGS), - [QPT_UC] = (IB_QP_PATH_MIG_STATE | - IB_QP_ALT_PATH | - IB_QP_ACCESS_FLAGS), - [QPT_UD] = (IB_QP_QKEY | - IB_QP_CUR_STATE), - [QPT_SQP] = (IB_QP_QKEY | - IB_QP_CUR_STATE) - } - }, - [IB_QPST_SQD2RTS] = { - .statetrans = IB_QPST_SQD2RTS, - .req_attr = { - [QPT_RC] = (IB_QP_STATE), - [QPT_UC] = (IB_QP_STATE), - [QPT_UD] = (IB_QP_STATE), - [QPT_SQP] = (IB_QP_STATE) - }, - .opt_attr = { - [QPT_RC] = (IB_QP_ALT_PATH | - IB_QP_ACCESS_FLAGS | - IB_QP_MIN_RNR_TIMER | - IB_QP_PATH_MIG_STATE | - IB_QP_CUR_STATE), - [QPT_UC] = (IB_QP_ALT_PATH | - IB_QP_ACCESS_FLAGS | - IB_QP_PATH_MIG_STATE | - IB_QP_CUR_STATE), - [QPT_UD] = (IB_QP_QKEY | - IB_QP_CUR_STATE), - [QPT_SQP] = (IB_QP_QKEY | - IB_QP_CUR_STATE) - } - }, - [IB_QPST_SQE2RTS] = { - .statetrans = IB_QPST_SQE2RTS, - .req_attr = { - [QPT_RC] = (IB_QP_STATE), - [QPT_UC] = (IB_QP_STATE), - [QPT_UD] = (IB_QP_STATE), - [QPT_SQP] = (IB_QP_STATE) - }, - .opt_attr = { - [QPT_UC] = (IB_QP_CUR_STATE | - IB_QP_ACCESS_FLAGS), - [QPT_UD] = (IB_QP_QKEY | - IB_QP_CUR_STATE), - [QPT_SQP] = (IB_QP_QKEY | - IB_QP_CUR_STATE) - } - }, - [IB_QPST_SQD2SQD] = { - .statetrans = IB_QPST_SQD2SQD, - .opt_attr = { - [QPT_RC] = (IB_QP_ALT_PATH | - IB_QP_ACCESS_FLAGS | - IB_QP_MIN_RNR_TIMER | - IB_QP_PATH_MIG_STATE | - IB_QP_AV | - IB_QP_MAX_QP_RD_ATOMIC | - IB_QP_MAX_DEST_RD_ATOMIC | - IB_QP_CUR_STATE | - IB_QP_PKEY_INDEX | - IB_QP_TIMEOUT | - IB_QP_RETRY_CNT | - IB_QP_RNR_RETRY), - [QPT_UC] = (IB_QP_ALT_PATH | - IB_QP_ACCESS_FLAGS | - IB_QP_PATH_MIG_STATE | - IB_QP_AV | - IB_QP_CUR_STATE | - IB_QP_PKEY_INDEX), - [QPT_UD] = (IB_QP_QKEY | - IB_QP_PKEY_INDEX), - [QPT_SQP] = (IB_QP_QKEY | - IB_QP_PKEY_INDEX) - } - }, -}; - /** @brief validates qp state transition from/to state * output: req and opt attr masks in statetrans * returns: 0 if transition valid * -EINVAL if not */ -static inline int get_modqp_statetrans(int ib_fromstate, int ib_tostate, - struct ehca_modqp_statetrans *statetrans) +static inline enum ib_qp_statetrans +get_modqp_statetrans(int ib_fromstate, int ib_tostate) { int index = -EINVAL; switch (ib_tostate) { @@ -439,9 +229,6 @@ static inline int get_modqp_statetrans(i return -EINVAL; } - if (index >= 0) - *statetrans = modqp_statetrans_table[index]; - return index; } @@ -1001,10 +788,9 @@ static int internal_modify_qp(struct ib_ int attr_mask, int smi_reset2init) { enum ib_qp_state qp_cur_state = 0, qp_new_state = 0; - int req_qp_attr_mask = 0, - opt_qp_attr_mask = 0, cnt = 0, qp_attr_idx = 0, retcode = 0; + int cnt = 0, qp_attr_idx = 0, retcode = 0; - struct ehca_modqp_statetrans qp_state_xsit={.statetrans=0}; + enum ib_qp_statetrans statetrans; struct hcp_modify_qp_control_block *mqpcb = NULL; struct ehca_qp *my_qp = NULL; struct ehca_shca *shca = NULL; @@ -1100,18 +886,10 @@ static int internal_modify_qp(struct ib_ "new qp_state=%x attribute_mask=%x", my_qp, ibqp->qp_num, qp_cur_state, attr->qp_state, attr_mask); - if (attr_mask & IB_QP_STATE) { - if (attr->qp_state < IB_QPS_RESET || - attr->qp_state > IB_QPS_ERR) { - retcode = -EINVAL; - EDEB_ERR(4, "Invalid new qp state attr->qp_state=%x " - "ehca_qp=%p qp_num=%x", - attr->qp_state, my_qp, ibqp->qp_num); - goto modify_qp_exit1; - } - qp_new_state = attr->qp_state; - } else - qp_new_state = qp_cur_state; + qp_new_state = attr_mask & IB_QP_CUR_STATE ? attr->qp_state : qp_cur_state; + + if (!ib_modify_qp_is_ok(qp_cur_state, qp_new_state, ibqp->qp_type, attr_mask)) + return -EINVAL; if ((mqpcb->qp_state = ib2ehca_qp_state(qp_new_state))) { update_mask = EHCA_BMASK_SET(MQPCB_MASK_QP_STATE, 1); @@ -1124,14 +902,14 @@ static int internal_modify_qp(struct ib_ } /* retrieve state transition struct to get req and opt attrs */ - if (get_modqp_statetrans(qp_cur_state, - qp_new_state, &qp_state_xsit) < 0) { + statetrans = get_modqp_statetrans(qp_cur_state, qp_new_state); + if (statetrans < 0) { retcode = -EINVAL; EDEB_ERR(4, " qp_cur_state=%x " "new_qp_state=%x State_xsition=%x " "ehca_qp=%p qp_num=%x", qp_cur_state, qp_new_state, - qp_state_xsit.statetrans, my_qp, ibqp->qp_num); + statetrans, my_qp, ibqp->qp_num); goto modify_qp_exit1; } @@ -1144,35 +922,14 @@ static int internal_modify_qp(struct ib_ goto modify_qp_exit1; } - req_qp_attr_mask = qp_state_xsit.req_attr[qp_attr_idx]; - opt_qp_attr_mask = qp_state_xsit.opt_attr[qp_attr_idx]; - - if ((attr_mask & req_qp_attr_mask) != req_qp_attr_mask) { - retcode = -EINVAL; - EDEB_ERR(4, " ehca_qp=%p qp_num=%x " - "req_mask=%x opt_mask=%x submitted_mask=%x " - "qp_type=%x", - my_qp, ibqp->qp_num, - req_qp_attr_mask, opt_qp_attr_mask, - attr_mask, ibqp->qp_type); - goto modify_qp_exit1; - } else if ((attr_mask & ~(req_qp_attr_mask | opt_qp_attr_mask))) { - EDEB(7, " more attributes " - "specified than allowed!!! req_mask=%x opt_mask=%x " - "submitted_mask=%x", - req_qp_attr_mask, opt_qp_attr_mask, attr_mask); - } - - EDEB(7, "ehca_qp=%p qp_num=%x " - "req_mask=%x opt_mask=%x qp_state_xsit=%x ", - my_qp, ibqp->qp_num, req_qp_attr_mask, - opt_qp_attr_mask, qp_state_xsit.statetrans); + EDEB(7, "ehca_qp=%p qp_num=%x qp_state_xsit=%x ", + my_qp, ibqp->qp_num, statetrans); /* sqe -> rts: set purge bit of bad wqe before actual trans */ if ((my_qp->ehca_qp_core.qp_type == IB_QPT_UD || my_qp->ehca_qp_core.qp_type == IB_QPT_GSI || my_qp->ehca_qp_core.qp_type == IB_QPT_SMI) - && qp_state_xsit.statetrans == IB_QPST_SQE2RTS) { + && statetrans == IB_QPST_SQE2RTS) { /* mark next free wqe if kernel */ if (my_qp->uspace_squeue == 0) { struct ehca_wqe *wqe = NULL; @@ -1198,13 +955,13 @@ static int internal_modify_qp(struct ib_ /* enable RDMA_Atomic_Control if reset->init und reliable con this is necessary since gen2 does not provide that flag, but pHyp requires it */ - if (qp_state_xsit.statetrans == IB_QPST_RESET2INIT && + if (statetrans == IB_QPST_RESET2INIT && (ibqp->qp_type == IB_QPT_RC || ibqp->qp_type == IB_QPT_UC)) { mqpcb->rdma_atomic_ctrl = 3; update_mask |= EHCA_BMASK_SET(MQPCB_MASK_RDMA_ATOMIC_CTRL, 1); } /* circ. pHyp requires #RDMA/Atomic Responder Resources for UC INIT -> RTR */ - if (qp_state_xsit.statetrans == IB_QPST_INIT2RTR && + if (statetrans == IB_QPST_INIT2RTR && (ibqp->qp_type == IB_QPT_UC) && !(attr_mask & IB_QP_MAX_DEST_RD_ATOMIC)) { mqpcb->rdma_nr_atomic_resp_res = 1; /* default to 1 */ @@ -1439,15 +1196,15 @@ static int internal_modify_qp(struct ib_ if ((my_qp->ehca_qp_core.qp_type == IB_QPT_UD || my_qp->ehca_qp_core.qp_type == IB_QPT_GSI || my_qp->ehca_qp_core.qp_type == IB_QPT_SMI) - && qp_state_xsit.statetrans == IB_QPST_SQE2RTS) { + && statetrans == IB_QPST_SQE2RTS) { /* doorbell to reprocessing wqes */ iosync(); /* serialize GAL register access */ hipz_update_SQA(&my_qp->ehca_qp_core, bad_wqe_cnt-1); EDEB(6, "doorbell for %x wqes", bad_wqe_cnt); } - if (qp_state_xsit.statetrans == IB_QPST_RESET2INIT || - qp_state_xsit.statetrans == IB_QPST_INIT2INIT) { + if (statetrans == IB_QPST_RESET2INIT || + statetrans == IB_QPST_INIT2INIT) { mqpcb->qp_enable = TRUE; mqpcb->qp_state = EHCA_QPS_INIT; update_mask = 0; @@ -1475,7 +1232,7 @@ static int internal_modify_qp(struct ib_ } } - if (qp_state_xsit.statetrans == IB_QPST_ANY2RESET) { + if (statetrans == IB_QPST_ANY2RESET) { ipz_QEit_reset(&my_qp->ehca_qp_core.ipz_rqueue); ipz_QEit_reset(&my_qp->ehca_qp_core.ipz_squeue); } From rolandd at cisco.com Fri Feb 10 16:51:57 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 10 Feb 2006 16:51:57 -0800 Subject: [openib-general] [PATCH 0/4] [RFC] Consolidate modify_qp checks Message-ID: <20062101651.zETYyw9qncSeTl9W@cisco.com> Here is a series of patches that adds a new function ib_modify_qp_is_ok(), which low-level drivers can use to replace boilerplate logic for validating the parameters to the modify_qp method. In addition to getting rid of duplicated bugs, this ends up saving quite a lot of duplicated code across mthca, ipath and ehca: core/verbs.c | 260 ++++++++++++++++++++++++++++++++++++++++++++ hw/ehca/ehca_qp.c | 283 +++--------------------------------------------- hw/ipath/ipath_verbs.c | 252 +----------------------------------------- hw/mthca/mthca_cmd.c | 98 ++++++++++------ hw/mthca/mthca_cmd.h | 5 hw/mthca/mthca_qp.c | 263 ++------------------------------------------ include/rdma/ib_verbs.h | 18 +++ 7 files changed, 384 insertions(+), 795 deletions(-) I made this a library function rather than putting the logic directly into the ib_modify_qp() to give low-level drivers more flexibility in their implementation, and also to simplify things for things like iWARP drivers, where modify_qp will be somewhat different. I'll commit the core and mthca pieces soon if no one objects. Once that happens, I hope the PathScale and IBM people can check what I did to ipath and ehca and commit the changes there as well. From rjwalsh at pathscale.com Fri Feb 10 16:57:18 2006 From: rjwalsh at pathscale.com (Robert Walsh) Date: Fri, 10 Feb 2006 16:57:18 -0800 Subject: [openib-general] [PATCH 0/4] [RFC] Consolidate modify_qp checks In-Reply-To: <20062101651.zETYyw9qncSeTl9W@cisco.com> References: <20062101651.zETYyw9qncSeTl9W@cisco.com> Message-ID: <1139619438.25459.0.camel@hematite.internal.keyresearch.com> On Fri, 2006-02-10 at 16:51 -0800, Roland Dreier wrote: > Here is a series of patches that adds a new function ib_modify_qp_is_ok(), > which low-level drivers can use to replace boilerplate logic for > validating the parameters to the modify_qp method. > > In addition to getting rid of duplicated bugs, this ends up saving > quite a lot of duplicated code across mthca, ipath and ehca: > > core/verbs.c | 260 ++++++++++++++++++++++++++++++++++++++++++++ > hw/ehca/ehca_qp.c | 283 +++--------------------------------------------- > hw/ipath/ipath_verbs.c | 252 +----------------------------------------- > hw/mthca/mthca_cmd.c | 98 ++++++++++------ > hw/mthca/mthca_cmd.h | 5 > hw/mthca/mthca_qp.c | 263 ++------------------------------------------ > include/rdma/ib_verbs.h | 18 +++ > 7 files changed, 384 insertions(+), 795 deletions(-) > > I made this a library function rather than putting the logic directly > into the ib_modify_qp() to give low-level drivers more flexibility in > their implementation, and also to simplify things for things like > iWARP drivers, where modify_qp will be somewhat different. > > I'll commit the core and mthca pieces soon if no one objects. Once > that happens, I hope the PathScale and IBM people can check what I did > to ipath and ehca and commit the changes there as well. Hi Roland, I'll look it over this weekend or early next week and commit it then. Thanks for the patch! Regards, Robert. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 481 bytes Desc: This is a digitally signed message part URL: From rdreier at cisco.com Fri Feb 10 17:15:55 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 10 Feb 2006 17:15:55 -0800 Subject: [openib-general] [PATCH 0/4] [RFC] Consolidate modify_qp checks In-Reply-To: <1139619438.25459.0.camel@hematite.internal.keyresearch.com> (Robert Walsh's message of "Fri, 10 Feb 2006 16:57:18 -0800") References: <20062101651.zETYyw9qncSeTl9W@cisco.com> <1139619438.25459.0.camel@hematite.internal.keyresearch.com> Message-ID: Robert> I'll look it over this weekend or early next week and Robert> commit it then. Thanks for the patch! Great. I'm going to wait until Monday afternoon or so for feedback on the core part, so no hurry. - R. From info at ooeer.com Fri Feb 10 17:02:49 2006 From: info at ooeer.com (info at ooeer.com) Date: 11 Feb 2006 10:02:49 +0900 Subject: [openib-general] $B7n(B100$BK|!A$N<}F~$r(B!$B%$%s%?!<%M%C%H:_Bp6HL3(B!! Message-ID: <20060211010249.13654.qmail@mail.ooeer.com> $B7P83$N$J$$J}$K$b%T%C%?%j!#(B $B5.J}$O:#$N;E;v$KBP$7$F>-MhIT0B$,$"$j$^$9$+!)(B $B5.J}$O:#$N;E;v$KK\Ev$KK~B-$7$F$$$^$9$+!)(B $B5.J}$OO78e$KBP$7$FIT0B$,$"$j$^$9$+!)(B $B5.J}$O(B30$BG/8e$b%5%i%j!<%^%s(BOL$B$@$H;W$$$^$9$+!)(B (YES$B$N?M$O:#$9$0$7$?$N(BURL$B$+$i;22C"-"-"-(B) http://www.gyakuten5.net/?sf $B!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a(B NO.I don't veceive your mail gendar7_net at yahoo.ca $B:#8e!"l9g$O(B gendar7_net at yahoo.ca $B!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a(B From chalamirik at freemail.lt Fri Feb 10 19:22:39 2006 From: chalamirik at freemail.lt (chalamirik at freemail.lt) Date: Fri, 10 Feb 2006 19:22:39 -0800 (PST) Subject: [openib-general] =?iso-2022-jp?b?WxskQkhgSn04QkRqGyhCXQ==?= =?iso-2022-jp?b?GyRCJTslbCVWJEokKj5uTU1MNU5BPlIycBsoQg==?= Message-ID: 20060211112917.41385mail@mail.freelove_womankiss-freefreeserver78_happykisslove_system03_freelovekiss.cx 今回の地域紹介が逆¥限定ですので、希望女性会員の数により、男性様へのご紹介は人数限定と致しまして、 長時間経過してもご返答のない方は自動破棄と判断し他の方に移行させて頂く事も有りますので、予めご了承下さい。 ★紹介会員(女性):ノリカ さん(28)人妻 ■無料紹介→ http://wmn-line.cx/h/ 「はじめまして、ノリカです。  今回の紹介申し込みは彼氏じゃなく、お互い都合いい時に会える人(大人の関係!?)を探してみようかと思っているからです!一応結婚している(主婦?笑い)ので、主人にばれないような協力をして欲しいです、その代わりに出来る範囲なら感謝代は私がだします。プロフィールの写真は見れるはずなので、良かったら返事を下さい。宜しくお願いします。  わがまま書いてしまって申し訳ないです。」 ■無料紹介→ http://wmn-line.cx/h/ From info at ooeer.com Fri Feb 10 19:29:19 2006 From: info at ooeer.com (info at ooeer.com) Date: 11 Feb 2006 12:29:19 +0900 Subject: [openib-general] $B$$$D$b$N#S#E#X$KK0$-$F$^$;$s$+!)(B Message-ID: <20060211032919.17890.qmail@mail.ooeer.com> $B!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g(B $B!AAj@\R2pHqEyA4$FL5NA!*!*5U1g=u%5%$%H(B http://www.gyakuten5.net/?sf $B%a!<%k $B%+%C%W%kB3!9CB at 8Cf!*(B http://xucom.info/lvtm/ 06 $BITMW$NJ}$O(B refusal at ok.kz From info at ooeer.com Fri Feb 10 21:30:59 2006 From: info at ooeer.com (info at ooeer.com) Date: 11 Feb 2006 14:30:59 +0900 Subject: [openib-general] $BFM$-(B30$B%^%s(B Message-ID: <20060211053059.23186.qmail@mail.ooeer.com> $BEv%5%$%H$OCK at -$N5U1g4uK>$N>r7o$r>5Bz$5$l$?$*5RMM$N$_(B $BF~2q$5$l$FD:$$$F$*$j$^$9$N$G!"CK at -$NJ}$G$*6b$b%;%C%/%9$b$7$?$$J}$K(B $B:GE,$J%5%$%H$K$J$C$F$*$j$^$9!#(B http://www.gyakuten5.net/?sf $B%a!<%k References: <43EB7E28.3050402@ichips.intel.com> <20060209234654.GB5447@mellanox.co.il> <43EBD82B.3090804@ichips.intel.com> Message-ID: <1139661571.4475.4965.camel@hal.voltaire.com> On Thu, 2006-02-09 at 19:02, Sean Hefty wrote: > Does anything outside of userspace send a multi-segment RMPP MAD? Is it likely > that a kernel component would need to ? The only case I can think of here is a SA GetMultiPath for any kernel components as GetMulti is two sided but the kernel wouldn't be sending multisegments here (other than ACKs which it does in other cases too). -- Hal From halr at voltaire.com Sat Feb 11 06:03:55 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Feb 2006 09:03:55 -0500 Subject: [openib-general] Re: FW: [PATCH 1 of 3] mad: large RMPP support In-Reply-To: <1139661571.4475.4965.camel@hal.voltaire.com> References: <43EB7E28.3050402@ichips.intel.com> <20060209234654.GB5447@mellanox.co.il> <43EBD82B.3090804@ichips.intel.com> <1139661571.4475.4965.camel@hal.voltaire.com> Message-ID: <1139666635.4475.5381.camel@hal.voltaire.com> On Sat, 2006-02-11 at 07:39, Hal Rosenstock wrote: > On Thu, 2006-02-09 at 19:02, Sean Hefty wrote: > > Does anything outside of userspace send a multi-segment RMPP MAD? Is it likely > > that a kernel component would need to ? > > The only case I can think of here is a SA GetMultiPath for any kernel > components as GetMulti is two sided but the kernel wouldn't be sending > multisegments here (other than ACKs which it does in other cases too). I amend this answer. The SA GetMulti request could possibly be multisegment. Although this is unlikely, I'm not sure it should be precluded. -- Hal > -- Hal > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Sat Feb 11 08:39:35 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Feb 2006 11:39:35 -0500 Subject: [openib-general] [RFC] [PATCH] mad.c: Add support for switch SMI Message-ID: <1139675973.4475.6062.camel@hal.voltaire.com> mad.c: Add support for switch SMI This is a first cut of adding in switch SMI support. I have tested this with HCAs to make sure I didn't break anything but I have no way of testing this with a switch. The biggest unknown to me is whether the SMI itself (smi.c) is correct for switches. It was written from the IB spec. In order to support this on a switch, there are 2 things that are needed from the driver: 1. On the receive side, the physical port number that a DR SMP was received on must be filled into the ib_wc. 2. On the send side, the driver must support the optional query_ah verb in order to obtain the send side port number (actual switch external port on which to send the DR SMP). I'm interested in feedback on the latter point (is there a better way ?) and any testing experience/feedback with this. Signed-off-by: Hal Rosenstock Index: mad.c =================================================================== --- mad.c (revision 5369) +++ mad.c (working copy) @@ -661,9 +661,27 @@ static int handle_outgoing_dr_smp(struct struct ib_mad_port_private *port_priv; struct ib_mad_agent_private *recv_mad_agent = NULL; struct ib_device *device = mad_agent_priv->agent.device; - u8 port_num = mad_agent_priv->agent.port_num; + u8 port_num; struct ib_wc mad_wc; struct ib_send_wr *send_wr = &mad_send_wr->send_wr; + struct ib_ah_attr ah_attr; + + if (device->node_type != RDMA_NODE_IB_SWITCH) + port_num = mad_agent_priv->agent.port_num; + else { + /* For a switch, port number is obtained from AH attribute */ + if (!device->query_ah) { + ret = -ENOTSUPP; + printk(KERN_ERR PFX "Query AH not supported\n"); + goto out; + } + ret = ib_query_ah(mad_send_wr->send_wr.wr.ud.ah, &ah_attr); + if (ret) { + printk(KERN_ERR PFX "Query AH failed\n"); + goto out; + } + port_num = ah_attr.port_num; + } /* * Directed route handling starts if the initial LID routed part of @@ -1631,6 +1649,7 @@ static void ib_mad_recv_done_handler(str struct ib_mad_private *recv, *response; struct ib_mad_list_head *mad_list; struct ib_mad_agent_private *mad_agent; + int port_num; response = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL); if (!response) @@ -1666,16 +1685,20 @@ static void ib_mad_recv_done_handler(str if (recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { + if (port_priv->device->node_type != RDMA_NODE_IB_SWITCH) + port_num = port_priv->port_num; + else + port_num = wc->port_num; if (!smi_handle_dr_smp_recv(&recv->mad.smp, port_priv->device->node_type, - port_priv->port_num, + port_num, port_priv->device->phys_port_cnt)) goto out; if (!smi_check_forward_dr_smp(&recv->mad.smp)) goto local; if (!smi_handle_dr_smp_send(&recv->mad.smp, port_priv->device->node_type, - port_priv->port_num)) + port_num)) goto out; if (!smi_check_local_smp(&recv->mad.smp, port_priv->device)) goto out; From rolandd at cisco.com Sat Feb 11 12:22:21 2006 From: rolandd at cisco.com (Roland Dreier) Date: Sat, 11 Feb 2006 20:22:21 +0000 Subject: [openib-general] [git patch review 1/4] IPoIB: Don't start send-only joins while multicast thread is stopped Message-ID: <1139689341370-68b63fa9b8e76d91@cisco.com> Fix the following race scenario: - Device is up. - Port event or set mcast list triggers ipoib_mcast_stop_thread, this cancels the query and waits on mcast "done" completion. - Completion is called and "done" is set. - Meanwhile, ipoib_mcast_send arrives and starts a new query, re-initializing "done". Fix this by adding a "multicast started" bit and checking it before starting a send-only join. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/ipoib/ipoib.h | 1 + drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 15 +++++++++++++++ 2 files changed, 16 insertions(+), 0 deletions(-) 479a079663bd4c5f3d2714643b1b8c406aaba3e0 diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index e0a5412..2f85a9a 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -78,6 +78,7 @@ enum { IPOIB_FLAG_SUBINTERFACE = 4, IPOIB_MCAST_RUN = 5, IPOIB_STOP_REAPER = 6, + IPOIB_MCAST_STARTED = 7, IPOIB_MAX_BACKOFF_SECONDS = 16, diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index ccaa0c3..1c71482 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -601,6 +601,10 @@ int ipoib_mcast_start_thread(struct net_ queue_work(ipoib_workqueue, &priv->mcast_task); mutex_unlock(&mcast_mutex); + spin_lock_irq(&priv->lock); + set_bit(IPOIB_MCAST_STARTED, &priv->flags); + spin_unlock_irq(&priv->lock); + return 0; } @@ -611,6 +615,10 @@ int ipoib_mcast_stop_thread(struct net_d ipoib_dbg_mcast(priv, "stopping multicast thread\n"); + spin_lock_irq(&priv->lock); + clear_bit(IPOIB_MCAST_STARTED, &priv->flags); + spin_unlock_irq(&priv->lock); + mutex_lock(&mcast_mutex); clear_bit(IPOIB_MCAST_RUN, &priv->flags); cancel_delayed_work(&priv->mcast_task); @@ -693,6 +701,12 @@ void ipoib_mcast_send(struct net_device */ spin_lock(&priv->lock); + if (!test_bit(IPOIB_MCAST_STARTED, &priv->flags)) { + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + goto unlock; + } + mcast = __ipoib_mcast_find(dev, mgid); if (!mcast) { /* Let's create a new send only group now */ @@ -754,6 +768,7 @@ out: ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN); } +unlock: spin_unlock(&priv->lock); } -- 1.1.3 From rolandd at cisco.com Sat Feb 11 12:22:21 2006 From: rolandd at cisco.com (Roland Dreier) Date: Sat, 11 Feb 2006 20:22:21 +0000 Subject: [openib-general] [git patch review 3/4] IB/mthca: Don't print debugging info until we have all values In-Reply-To: <1139689341370-af4160238007a6e3@cisco.com> Message-ID: <1139689341370-48a55ba994088cbc@cisco.com> When debugging is enabled, the mthca_QUERY_DEV_LIM() firmware command function prints out some of the device limits that it queries. However the debugging prints happen before all of the fields are extracted from the firmware response, so some of the values that get printed are uninitialized junk. Move the prints to the end of the function to fix this. Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_cmd.c | 38 ++++++++++++++++--------------- 1 files changed, 19 insertions(+), 19 deletions(-) f295c79b6766b25fe8c1aad88211c54d1caa7e0b diff --git a/drivers/infiniband/hw/mthca/mthca_cmd.c b/drivers/infiniband/hw/mthca/mthca_cmd.c index f9b9b93..2825615 100644 --- a/drivers/infiniband/hw/mthca/mthca_cmd.c +++ b/drivers/infiniband/hw/mthca/mthca_cmd.c @@ -1029,25 +1029,6 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev MTHCA_GET(size, outbox, QUERY_DEV_LIM_UAR_ENTRY_SZ_OFFSET); dev_lim->uar_scratch_entry_sz = size; - mthca_dbg(dev, "Max QPs: %d, reserved QPs: %d, entry size: %d\n", - dev_lim->max_qps, dev_lim->reserved_qps, dev_lim->qpc_entry_sz); - mthca_dbg(dev, "Max SRQs: %d, reserved SRQs: %d, entry size: %d\n", - dev_lim->max_srqs, dev_lim->reserved_srqs, dev_lim->srq_entry_sz); - mthca_dbg(dev, "Max CQs: %d, reserved CQs: %d, entry size: %d\n", - dev_lim->max_cqs, dev_lim->reserved_cqs, dev_lim->cqc_entry_sz); - mthca_dbg(dev, "Max EQs: %d, reserved EQs: %d, entry size: %d\n", - dev_lim->max_eqs, dev_lim->reserved_eqs, dev_lim->eqc_entry_sz); - mthca_dbg(dev, "reserved MPTs: %d, reserved MTTs: %d\n", - dev_lim->reserved_mrws, dev_lim->reserved_mtts); - mthca_dbg(dev, "Max PDs: %d, reserved PDs: %d, reserved UARs: %d\n", - dev_lim->max_pds, dev_lim->reserved_pds, dev_lim->reserved_uars); - mthca_dbg(dev, "Max QP/MCG: %d, reserved MGMs: %d\n", - dev_lim->max_pds, dev_lim->reserved_mgms); - mthca_dbg(dev, "Max CQEs: %d, max WQEs: %d, max SRQ WQEs: %d\n", - dev_lim->max_cq_sz, dev_lim->max_qp_sz, dev_lim->max_srq_sz); - - mthca_dbg(dev, "Flags: %08x\n", dev_lim->flags); - if (mthca_is_memfree(dev)) { MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET); dev_lim->max_srq_sz = 1 << field; @@ -1093,6 +1074,25 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev dev_lim->mpt_entry_sz = MTHCA_MPT_ENTRY_SIZE; } + mthca_dbg(dev, "Max QPs: %d, reserved QPs: %d, entry size: %d\n", + dev_lim->max_qps, dev_lim->reserved_qps, dev_lim->qpc_entry_sz); + mthca_dbg(dev, "Max SRQs: %d, reserved SRQs: %d, entry size: %d\n", + dev_lim->max_srqs, dev_lim->reserved_srqs, dev_lim->srq_entry_sz); + mthca_dbg(dev, "Max CQs: %d, reserved CQs: %d, entry size: %d\n", + dev_lim->max_cqs, dev_lim->reserved_cqs, dev_lim->cqc_entry_sz); + mthca_dbg(dev, "Max EQs: %d, reserved EQs: %d, entry size: %d\n", + dev_lim->max_eqs, dev_lim->reserved_eqs, dev_lim->eqc_entry_sz); + mthca_dbg(dev, "reserved MPTs: %d, reserved MTTs: %d\n", + dev_lim->reserved_mrws, dev_lim->reserved_mtts); + mthca_dbg(dev, "Max PDs: %d, reserved PDs: %d, reserved UARs: %d\n", + dev_lim->max_pds, dev_lim->reserved_pds, dev_lim->reserved_uars); + mthca_dbg(dev, "Max QP/MCG: %d, reserved MGMs: %d\n", + dev_lim->max_pds, dev_lim->reserved_mgms); + mthca_dbg(dev, "Max CQEs: %d, max WQEs: %d, max SRQ WQEs: %d\n", + dev_lim->max_cq_sz, dev_lim->max_qp_sz, dev_lim->max_srq_sz); + + mthca_dbg(dev, "Flags: %08x\n", dev_lim->flags); + out: mthca_free_mailbox(dev, mailbox); return err; -- 1.1.3 From rolandd at cisco.com Sat Feb 11 12:22:21 2006 From: rolandd at cisco.com (Roland Dreier) Date: Sat, 11 Feb 2006 20:22:21 +0000 Subject: [openib-general] [git patch review 4/4] IPoIB: Yet another fix for send-only joins In-Reply-To: <1139689341370-48a55ba994088cbc@cisco.com> Message-ID: <1139689341371-7b843cf913438d83@cisco.com> Even after the last fix, it's still possible for a send-only join to start before the join for the broadcast group has finished. This could cause us to create a multicast group using attributes from the broadcast group that haven't been initialized yet, so we would use garbage for the Q_Key, etc. Fix this by waiting until the broadcast group's attached flag is set before starting send-only joins. Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 15 ++++++++++----- 1 files changed, 10 insertions(+), 5 deletions(-) 20b83382d1c5d4d1a73fc5671261db5239d1dbb3 diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 932bf13..a2408d7 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -533,8 +533,10 @@ void ipoib_mcast_join_task(void *dev_ptr } if (!priv->broadcast) { - priv->broadcast = ipoib_mcast_alloc(dev, 1); - if (!priv->broadcast) { + struct ipoib_mcast *broadcast; + + broadcast = ipoib_mcast_alloc(dev, 1); + if (!broadcast) { ipoib_warn(priv, "failed to allocate broadcast group\n"); mutex_lock(&mcast_mutex); if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) @@ -544,10 +546,11 @@ void ipoib_mcast_join_task(void *dev_ptr return; } - memcpy(priv->broadcast->mcmember.mgid.raw, priv->dev->broadcast + 4, + spin_lock_irq(&priv->lock); + memcpy(broadcast->mcmember.mgid.raw, priv->dev->broadcast + 4, sizeof (union ib_gid)); + priv->broadcast = broadcast; - spin_lock_irq(&priv->lock); __ipoib_mcast_add(dev, priv->broadcast); spin_unlock_irq(&priv->lock); } @@ -701,7 +704,9 @@ void ipoib_mcast_send(struct net_device */ spin_lock(&priv->lock); - if (!test_bit(IPOIB_MCAST_STARTED, &priv->flags) || !priv->broadcast) { + if (!test_bit(IPOIB_MCAST_STARTED, &priv->flags) || + !priv->broadcast || + !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) { ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); goto unlock; -- 1.1.3 From rolandd at cisco.com Sat Feb 11 12:22:21 2006 From: rolandd at cisco.com (Roland Dreier) Date: Sat, 11 Feb 2006 20:22:21 +0000 Subject: [openib-general] [git patch review 2/4] IPoIB: Fix another send-only join race In-Reply-To: <1139689341370-68b63fa9b8e76d91@cisco.com> Message-ID: <1139689341370-af4160238007a6e3@cisco.com> Further, there's an additional issue that I saw in testing: ipoib_mcast_send may get called when priv->broadcast is NULL (e.g. if the device was downed and then upped internally because of a port event). If this happends and the send-only join request gets completed before priv->broadcast is set, we get an oops. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) 7bcb974ef6a0ae903888272c92c66ea779388c01 diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 1c71482..932bf13 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -701,7 +701,7 @@ void ipoib_mcast_send(struct net_device */ spin_lock(&priv->lock); - if (!test_bit(IPOIB_MCAST_STARTED, &priv->flags)) { + if (!test_bit(IPOIB_MCAST_STARTED, &priv->flags) || !priv->broadcast) { ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); goto unlock; -- 1.1.3 From akpm at osdl.org Sat Feb 11 14:02:09 2006 From: akpm at osdl.org (Andrew Morton) Date: Sat, 11 Feb 2006 14:02:09 -0800 Subject: [openib-general] Re: [git patch review 1/4] IPoIB: Don't start send-only joins while multicast thread is stopped In-Reply-To: <1139689341370-68b63fa9b8e76d91@cisco.com> References: <1139689341370-68b63fa9b8e76d91@cisco.com> Message-ID: <20060211140209.57af1b16.akpm@osdl.org> Roland Dreier wrote: > > + spin_lock_irq(&priv->lock); > + set_bit(IPOIB_MCAST_STARTED, &priv->flags); > + spin_unlock_irq(&priv->lock); Strange to put a lock around an atomic op like that. Sometimes it's valid. If another cpu was doing: spin_lock(lock); if (test_bit(IPOIB_MCAST_STARTED)) something(); ... if (test_bit(IPOIB_MCAST_STARTED)) something_else(); spin_unlock(lock); then the locked set_bit() makes sense. But often it doesn't ;) From rdreier at cisco.com Sat Feb 11 18:03:18 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sat, 11 Feb 2006 18:03:18 -0800 Subject: [openib-general] Re: [git patch review 1/4] IPoIB: Don't start send-only joins while multicast thread is stopped In-Reply-To: <20060211140209.57af1b16.akpm@osdl.org> (Andrew Morton's message of "Sat, 11 Feb 2006 14:02:09 -0800") References: <1139689341370-68b63fa9b8e76d91@cisco.com> <20060211140209.57af1b16.akpm@osdl.org> Message-ID: > Roland Dreier wrote: > > > > + spin_lock_irq(&priv->lock); > > + set_bit(IPOIB_MCAST_STARTED, &priv->flags); > > + spin_unlock_irq(&priv->lock); > > Strange to put a lock around an atomic op like that. > > Sometimes it's valid. If another cpu was doing: > > spin_lock(lock); > > if (test_bit(IPOIB_MCAST_STARTED)) > something(); > ... > if (test_bit(IPOIB_MCAST_STARTED)) > something_else(); > > spin_unlock(lock); > > then the locked set_bit() makes sense. > > But often it doesn't ;) Good point. Michael, any reason why the lock is there around the set_bit()? (And similarly for the corresponding clear_bit()) Thanks, Roland From mst at mellanox.co.il Sat Feb 11 23:50:37 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 12 Feb 2006 09:50:37 +0200 Subject: [openib-general] Re: [git patch review 1/4] IPoIB: Don't start send-only joins while multicast thread is stopped In-Reply-To: References: <1139689341370-68b63fa9b8e76d91@cisco.com> <20060211140209.57af1b16.akpm@osdl.org> Message-ID: <20060212075037.GA11550@mellanox.co.il> Quoting r. Roland Dreier : > Subject: [openib-general] Re: [git patch review 1/4] IPoIB: Don't start send-only joins while multicast thread is stopped > > > Roland Dreier wrote: > > > > > > + spin_lock_irq(&priv->lock); > > > + set_bit(IPOIB_MCAST_STARTED, &priv->flags); > > > + spin_unlock_irq(&priv->lock); > > > > Strange to put a lock around an atomic op like that. > > > > Sometimes it's valid. If another cpu was doing: > > > > spin_lock(lock); > > > > if (test_bit(IPOIB_MCAST_STARTED)) > > something(); > > ... > > if (test_bit(IPOIB_MCAST_STARTED)) > > something_else(); > > > > spin_unlock(lock); > > > > then the locked set_bit() makes sense. > > > > But often it doesn't ;) > > Good point. Michael, any reason why the lock is there around the > set_bit()? (And similarly for the corresponding clear_bit()) > > Thanks, > Roland Basically, its as Andrew said: the lock around clear_bit is there to ensure that ipoib_mcast_send isnt running already when we stop the thread. Thats why test_bit has to be inside the lock, too. This was discussed with Krishna Kumar when I posted the patch originally. For more detail, please review this thread: http://www.mail-archive.com/openib-general at openib.org/msg13206.html or here http://openib.org/pipermail/openib-general/2005-December/014370.html -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From yael at mellanox.co.il Sat Feb 11 23:52:42 2006 From: yael at mellanox.co.il (Yael Kalka) Date: Sun, 12 Feb 2006 09:52:42 +0200 Subject: [openib-general] RE: [PATCH] Opensm - clean osm_vendor_mlx_sa.c code Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FD9A@mtlexch01.mtl.com> Hi Hal, In answer to your questions: 1. This is still one code base for gen1 too. 2. I don't think it is necessary to add osm_arbitrary_context_t in all vendors, just in the ones using it. Yael -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Thursday, February 09, 2006 9:09 PM To: Yael Kalka Cc: openib-general at openib.org; Eitan Zahavi Subject: Re: [PATCH] Opensm - clean osm_vendor_mlx_sa.c code Hi Yael, On Mon, 2006-02-06 at 07:39, Yael Kalka wrote: > Hi Hal, > > Currently in osm_vendor_mlx_sa.c the sent context is saved arbitrarily > as nodeInfo_context. This results in need for strange castings from > long to pointer and vice-versa. The following patch adds another > possible context - arbitrary context, which will be used in this case. Thanks. Applied with one question below. BTW, I have no way to test this (other than that things still work for OpenIB). Is this still one code base for gen1 too ? -- Hal > Thanks, > Yael > > Signed-off-by: Yael Kalka > > Index: libvendor/osm_vendor_mlx_sa.c > =================================================================== > --- libvendor/osm_vendor_mlx_sa.c (revision 5307) > +++ libvendor/osm_vendor_mlx_sa.c (working copy) > @@ -96,9 +96,9 @@ __osmv_sa_mad_rcv_cb( > goto Exit; > } > > - /* obtain the sent context since we store it during send in the ni_ctx */ > + /* obtain the sent context */ > p_query_req_copy = > - (osmv_query_req_t *)CAST_P2LONG(p_req_madw->context.ni_context.node_guid); > + (osmv_query_req_t *)(p_req_madw->context.arb_context.context1); > > /* provide the context of the original request in the result */ > query_res.query_context = p_query_req_copy->query_context; > @@ -207,7 +207,7 @@ __osmv_sa_mad_err_cb( > > /* Obtain the sent context etc */ > p_query_req_copy = > - (osmv_query_req_t *)CAST_P2LONG(p_madw->context.ni_context.node_guid); > + (osmv_query_req_t *)(p_madw->context.arb_context.context1); > > /* provide the context of the original request in the result */ > query_res.query_context = p_query_req_copy->query_context; > @@ -561,10 +561,17 @@ __osmv_send_sa_req( > /* > Provide the address to send to > */ > + /* Patch to handle IBAL - host order , where it should take destination lid in network order */ > +#ifdef OSM_VENDOR_INTF_AL > + p_madw->mad_addr.dest_lid = p_bind->sm_lid; > +#else > p_madw->mad_addr.dest_lid = cl_hton16(p_bind->sm_lid); > +#endif > p_madw->mad_addr.addr_type.smi.source_lid = > cl_hton16(p_bind->lid); > p_madw->mad_addr.addr_type.gsi.remote_qp = CL_HTON32(1); > + p_madw->mad_addr.addr_type.gsi.remote_qkey = IB_QP1_WELL_KNOWN_Q_KEY; > + p_madw->mad_addr.addr_type.gsi.pkey = IB_DEFAULT_PKEY; > p_madw->resp_expected = TRUE; > p_madw->fail_msg = CL_DISP_MSGID_NONE; > > @@ -574,12 +581,11 @@ __osmv_send_sa_req( > Since we can not rely on the client to keep it arroud until > the response - we duplicate it and will later dispose it (in CB). > To store on the MADW we cast it into what opensm has: > - p_madw->context.ni_context.node_guid > + p_madw->context.arb_context.context1 > */ > p_query_req_copy = cl_malloc(sizeof(*p_query_req_copy)); > *p_query_req_copy = *p_query_req; > - p_madw->context.ni_context.node_guid = > - (ib_net64_t)CAST_P2LONG(p_query_req_copy); > + p_madw->context.arb_context.context1 = p_query_req_copy; > > /* we can support async as well as sync calls */ > sync = ((p_query_req->flags & OSM_SA_FLAGS_SYNC) == OSM_SA_FLAGS_SYNC); > Index: include/opensm/osm_madw.h > =================================================================== > --- include/opensm/osm_madw.h (revision 5307) > +++ include/opensm/osm_madw.h (working copy) > @@ -315,6 +315,22 @@ typedef struct _osm_vla_context > boolean_t set_method; > } osm_vla_context_t; > /*********/ > +/****s* OpenSM: MAD Wrapper/osm_arbitrary_context_t > +* NAME > +* osm_sa_context_t > +* > +* DESCRIPTION > +* Context needed by arbitrary recipient. > +* > +* SYNOPSIS > +*/ > +typedef struct _osm_arbitrary_context > +{ > + void* context1; > + void* context2; > +} osm_arbitrary_context_t; > +/*********/ > + > /****s* OpenSM: MAD Wrapper/osm_madw_context_t > * NAME > * osm_madw_context_t > @@ -335,6 +351,7 @@ typedef union _osm_madw_context > osm_smi_context_t smi_context; > osm_slvl_context_t slvl_context; > osm_pkey_context_t pkey_context; > + osm_arbitrary_context_t arb_context; Should this be carried for for all vendor layers or only the ones which need this ? > } osm_madw_context_t; > /*********/ > > @@ -880,6 +897,34 @@ osm_madw_get_vla_context_ptr( > } > /* > * PARAMETERS > +* p_madw > +* [in] Pointer to an osm_madw_t object. > +* > +* RETURN VALUES > +* Pointer to the start of the context structure. > +* > +* NOTES > +* > +* SEE ALSO > +*********/ > + > +/****f* OpenSM: MAD Wrapper/osm_madw_get_arbitrary_context_ptr > +* NAME > +* osm_madw_get_arbitrary_context_ptr > +* > +* DESCRIPTION > +* Gets a pointer to the arbitrary context in this MAD. > +* > +* SYNOPSIS > +*/ > +static inline osm_arbitrary_context_t* > +osm_madw_get_arbitrary_context_ptr( > + IN const osm_madw_t* const p_madw ) > +{ > + return( (osm_arbitrary_context_t*)&p_madw->context ); > +} > +/* > +* PARAMETERS > * p_madw > * [in] Pointer to an osm_madw_t object. > * > From yael at mellanox.co.il Sat Feb 11 23:56:05 2006 From: yael at mellanox.co.il (Yael Kalka) Date: Sun, 12 Feb 2006 09:56:05 +0200 Subject: [openib-general] RE: [PATCH] Opensm - cl_event_wheel casting Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FD9B@mtlexch01.mtl.com> Hi Hal, I am not sure it really matters, as the timeout used will be of uint32_t size anyways. If you think using the max 32 bit makes more sense - I am fine with that too. Yael -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Thursday, February 09, 2006 5:29 PM To: Yael Kalka Cc: openib-general at openib.org; Eitan Zahavi Subject: Re: [PATCH] Opensm - cl_event_wheel casting Hi Yael, On Mon, 2006-02-06 at 03:53, Yael Kalka wrote: > Hi Hal, > > The following patch adds the casting done in a clearer way - to avoid > compilation errors in windows. Also - added a clear message if the > timeout was trimmed (due to the casting). > > Thanks, > Yael > > Signed-off-by: Yael Kalka > > Index: complib/cl_event_wheel.c > =================================================================== > --- complib/cl_event_wheel.c (revision 5307) > +++ complib/cl_event_wheel.c (working copy) > @@ -426,8 +426,18 @@ cl_event_wheel_reg( > * cl_timer_stop(&p_event_wheel->timer); > */ > > + /* The timeout for the cl_timer_start should be given as uint32_t. > + if there is an overflow - warn about it. */ > + if ( timeout > (uint32_t)timeout ) > + { > + osm_log (p_event_wheel->p_log, OSM_LOG_INFO, > + "cl_event_wheel_reg: " > + "timeout requested is too large. Using timeout: %u \n", > + (uint32_t)timeout ); > + } > + > /* start the timer to the timeout [msec] */ > - cl_status = cl_timer_start(&p_event_wheel->timer, timeout); > + cl_status = cl_timer_start(&p_event_wheel->timer, (uint32_t)timeout); Shouldn't this use the max 32 bit timeout here rather than the low 32 bits ? -- Hal > if (cl_status != CL_SUCCESS) > { > From mrmacman_g4 at mac.com Sun Feb 12 00:13:02 2006 From: mrmacman_g4 at mac.com (Kyle Moffett) Date: Sun, 12 Feb 2006 03:13:02 -0500 Subject: [openib-general] Re: [git patch review 1/4] IPoIB: Don't start send-only joins while multicast thread is stopped In-Reply-To: <20060212075037.GA11550@mellanox.co.il> References: <1139689341370-68b63fa9b8e76d91@cisco.com> <20060211140209.57af1b16.akpm@osdl.org> <20060212075037.GA11550@mellanox.co.il> Message-ID: On Feb 12, 2006, at 02:50, Michael S. Tsirkin wrote: > Basically, its as Andrew said: the lock around clear_bit is there > to ensure that ipoib_mcast_send isnt running already when we stop > the thread. Thats why test_bit has to be inside the lock, too. Looks like you guys could use nonatomic versions to improve bus efficiency slightly, but they appear to be relying on the fact that when the function calling set_bit() returns, the multicast thread will be guaranteed to be finished and never run again. The set_bit() can only happen when the thread is not doing work (due to the lock), and since the thread firsts checks the bit before doing any work, it provides more guarantees than just the atomics would. Cheers, Kyle Moffett -----BEGIN GEEK CODE BLOCK----- Version: 3.12 GCM/CS/IT/E/U d- s++: a18 C++++>$ ULBX*++++(+++)>$ P++++(+++)>$ L++++ (+++)>$ !E- W+++(++) N+++(++) o? K? w--- O? M++ V? PS+() PE+(-) Y+ PGP + t+(+++) 5 X R? !tv-(--) b++++(++) DI+(++) D+++ G e>++++$ h*(+)>++$ r %(--) !y?-(--) ------END GEEK CODE BLOCK------ From mst at mellanox.co.il Sun Feb 12 00:19:10 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 12 Feb 2006 10:19:10 +0200 Subject: [openib-general] Re: Re: [git patch review 1/4] IPoIB: Don't start send-only joins while multicast thread is stopped In-Reply-To: References: <1139689341370-68b63fa9b8e76d91@cisco.com> <20060211140209.57af1b16.akpm@osdl.org> <20060212075037.GA11550@mellanox.co.il> Message-ID: <20060212081910.GA11812@mellanox.co.il> Quoting r. Kyle Moffett : > On Feb 12, 2006, at 02:50, Michael S. Tsirkin wrote: > >Basically, its as Andrew said: the lock around clear_bit is there > >to ensure that ipoib_mcast_send isnt running already when we stop > >the thread. Thats why test_bit has to be inside the lock, too. > > Looks like you guys could use nonatomic versions to improve bus > efficiency slightly I think we need atomics since other places touch bits in the same word without taking the lock. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From info at swhdy.com Sun Feb 12 00:48:58 2006 From: info at swhdy.com (info at swhdy.com) Date: 12 Feb 2006 17:48:58 +0900 Subject: [openib-general] $B6[5^$N$*CN$i$;(B Message-ID: <20060212084858.13573.qmail@mail.swhdy.com> $B(#"c6[5^B.Js"d($(B $B!y40A4L5NA!y(B $B(&!=!=!=!=!=!=(%(B $B:#5U1g=u$,%"%D$$"M(B http://www.awg5.net/?sf $B$*6b$b$+$1$:$K$9$0$K(BH$B$J=w$N;R$H2q$($k$J$iM7$s$@J}$,$$$$$h$M!)(B $BKhLk3Z$7$$;~4V2a$4$7$F$_$^$;$s$+!)(B $B"((BI don't veceive your mail$B!!"M(Bpriority7_net at yahoo.ca $B"(%a!<%kITMW!!"M(Bpriority7_net at yahoo.ca From ogerlitz at voltaire.com Sun Feb 12 02:17:01 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 12 Feb 2006 12:17:01 +0200 (IST) Subject: [openib-general] iser: 3 changesets Message-ID: ------------------------------------------------------------------------ r5377 | ogerlitz | 2006-02-12 11:56:24 +0200 (Sun, 12 Feb 2006) | 5 lines put all the iscsi_iser.h #define statements before enum and struct declarations, where some low level #defines where moved to the code, cleanups Signed-off-by: Or Gerlitz ------------------------------------------------------------------------ r5376 | ogerlitz | 2006-02-12 11:48:42 +0200 (Sun, 12 Feb 2006) | 4 lines re-arrange iscsi_iser.c such that functions are in the order of drivers/scsi/iscsi_tcp.c Signed-off-by: Or Gerlitz ------------------------------------------------------------------------ r5375 | ogerlitz | 2006-02-12 11:43:10 +0200 (Sun, 12 Feb 2006) | 4 lines reformat iscsi_iser.h such that its more aligned with drivers/scsi/iscsi_tcp.c, cleanups Signed-off-by: Or Gerlitz From ogerlitz at voltaire.com Sun Feb 12 02:17:40 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 12 Feb 2006 12:17:40 +0200 (IST) Subject: [openib-general] [PATCH] iser: bugfix in disconnect flow Message-ID: bugfix in disconnect flow, wrong field was checked, remove iscsi_iser_conn->state Signed-off-by: Or Gerlitz Index: iscsi_iser.h =================================================================== --- iscsi_iser.h (revision 5377) +++ iscsi_iser.h (revision 5378) @@ -314,7 +314,6 @@ struct iscsi_iser_conn { struct iser_conn *ib_conn; /* iSER IB conn */ int ff_mode_enabled; /* To be removed ??? */ - atomic_t state; /* iSCSI connection state */ atomic_t post_recv_buf_count; atomic_t post_send_buf_count; wait_queue_head_t disconnect_wait_q; /* used by sync term */ Index: iser_verbs.c =================================================================== --- iser_verbs.c (revision 5377) +++ iser_verbs.c (revision 5378) @@ -821,7 +821,7 @@ void iser_comp_error_worker(void *data) if (p_iser_conn == NULL) iser_bug("NULL p_desc->p_conn \n"); - if (atomic_read(&p_iser_conn->state) == ISER_CONN_UP) + if (atomic_read(&p_iser_conn->ib_conn->state) == ISER_CONN_UP) iser_conn_async_terminate(p_iser_conn->ib_conn); iser_complete_conn_termination(p_iser_conn); From yael at mellanox.co.il Sun Feb 12 03:25:30 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 12 Feb 2006 13:25:30 +0200 Subject: [openib-general] [PATCH] Opensm - add includes for PRIx64 Message-ID: <5z3biolsyd.fsf@mtl066.yok.mtl.com> Hi Hal, In gen2 stack, the PRIx64 definitions come either from inttypes.h or from cl_debug_osd.h file (included by cl_debug.h). In windows stack inttypes.h file doesn't exist, thus the PRIx64 definition comes only from the cl_debug_osd.h file. The following patch adds includes to cl_debug.h in files where this include is missing, and there is use of PRIx64. Thanks, Yael Signed-off-by: Yael Kalka Index: libvendor/osm_vendor_mlx_sa.c =================================================================== --- libvendor/osm_vendor_mlx_sa.c (revision 5372) +++ libvendor/osm_vendor_mlx_sa.c (working copy) @@ -40,6 +40,7 @@ # include #endif /* HAVE_CONFIG_H */ +#include #include #include #include Index: opensm/osm_subnet.c =================================================================== --- opensm/osm_subnet.c (revision 5372) +++ opensm/osm_subnet.c (working copy) @@ -51,6 +51,7 @@ # include #endif /* HAVE_CONFIG_H */ +#include #include #include #include Index: opensm/osm_inform.c =================================================================== --- opensm/osm_inform.c (revision 5372) +++ opensm/osm_inform.c (working copy) @@ -49,6 +49,7 @@ # include #endif /* HAVE_CONFIG_H */ +#include #include #include #include Index: opensm/osm_service.c =================================================================== --- opensm/osm_service.c (revision 5372) +++ opensm/osm_service.c (working copy) @@ -49,6 +49,7 @@ # include #endif /* HAVE_CONFIG_H */ +#include #include #include #include Index: opensm/osm_mtree.c =================================================================== --- opensm/osm_mtree.c (revision 5372) +++ opensm/osm_mtree.c (working copy) @@ -50,6 +50,7 @@ # include #endif /* HAVE_CONFIG_H */ +#include #include /********************************************************************** Index: opensm/osm_ucast_updn.c =================================================================== --- opensm/osm_ucast_updn.c (revision 5372) +++ opensm/osm_ucast_updn.c (working copy) @@ -50,6 +50,7 @@ # include #endif /* HAVE_CONFIG_H */ +#include #include #include #include Index: opensm/osm_db_pack.c =================================================================== --- opensm/osm_db_pack.c (revision 5372) +++ opensm/osm_db_pack.c (working copy) @@ -40,6 +40,7 @@ #endif /* HAVE_CONFIG_H */ #include +#include #include static inline void __osm_pack_guid(uint64_t guid, char *p_guid_str) From yael at mellanox.co.il Sun Feb 12 03:31:13 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 12 Feb 2006 13:31:13 +0200 Subject: [openib-general] [PATCH] Opensm - opensm/osm_ucast_updn.c Message-ID: <5z1wy8lsou.fsf@mtl066.yok.mtl.com> Hi Hal, The following patch cleans up the osm_ucast_updn.c construct function that you pointed out, and also adds a check if the construct succeeded during updn_init. Thanks, Yael Signed-off-by: Yael Kalka Index: opensm/osm_ucast_updn.c =================================================================== --- opensm/osm_ucast_updn.c (revision 5372) +++ opensm/osm_ucast_updn.c (working copy) @@ -495,12 +496,6 @@ updn_construct(void) OSM_LOG_ENTER( &(osm.log) , updn_construct); p_updn = cl_zalloc(sizeof(updn_t)); - if (p_updn == NULL) - { - goto Exit; - } - - Exit : OSM_LOG_EXIT( &(osm.log) ); return(p_updn); } @@ -519,6 +514,12 @@ updn_init( ib_api_status_t status = IB_SUCCESS; OSM_LOG_ENTER( &(osm.log) , updn_init ); + /* Make sure the p_updn isn't NULL */ + if (!p_updn) + { + status = IB_ERROR; + goto Exit_Bad; + } p_updn->state = UPDN_INIT; cl_qmap_init( &p_updn->guid_rank_tbl); p_list = (cl_list_t*)cl_malloc(sizeof(cl_list_t)); From halr at voltaire.com Sun Feb 12 04:41:22 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Feb 2006 07:41:22 -0500 Subject: [openib-general] RE: [PATCH] Opensm - clean osm_vendor_mlx_sa.c code In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FD9A@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FD9A@mtlexch01.mtl.com> Message-ID: <1139748078.4475.6430.camel@hal.voltaire.com> On Sun, 2006-02-12 at 02:52, Yael Kalka wrote: > Hi Hal, > In answer to your questions: > 1. This is still one code base for gen1 too. > 2. I don't think it is necessary to add osm_arbitrary_context_t in all > vendors, just in the ones using it. That looks to me like OSMV_SIM, OSMV_GEN1, and OSMV_VAPI. Can you confirm ? -- Hal > Yael > > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Thursday, February 09, 2006 9:09 PM > To: Yael Kalka > Cc: openib-general at openib.org; Eitan Zahavi > Subject: Re: [PATCH] Opensm - clean osm_vendor_mlx_sa.c code > > > Hi Yael, > > On Mon, 2006-02-06 at 07:39, Yael Kalka wrote: > > Hi Hal, > > > > Currently in osm_vendor_mlx_sa.c the sent context is saved arbitrarily > > as nodeInfo_context. This results in need for strange castings from > > long to pointer and vice-versa. The following patch adds another > > possible context - arbitrary context, which will be used in this case. > > Thanks. Applied with one question below. > > BTW, I have no way to test this (other than that things still work for > OpenIB). Is this still one code base for gen1 too ? > > -- Hal > > > Thanks, > > Yael > > > > Signed-off-by: Yael Kalka > > > > Index: libvendor/osm_vendor_mlx_sa.c > > =================================================================== > > --- libvendor/osm_vendor_mlx_sa.c (revision 5307) > > +++ libvendor/osm_vendor_mlx_sa.c (working copy) > > @@ -96,9 +96,9 @@ __osmv_sa_mad_rcv_cb( > > goto Exit; > > } > > > > - /* obtain the sent context since we store it during send in the ni_ctx */ > > + /* obtain the sent context */ > > p_query_req_copy = > > - (osmv_query_req_t *)CAST_P2LONG(p_req_madw->context.ni_context.node_guid); > > + (osmv_query_req_t *)(p_req_madw->context.arb_context.context1); > > > > /* provide the context of the original request in the result */ > > query_res.query_context = p_query_req_copy->query_context; > > @@ -207,7 +207,7 @@ __osmv_sa_mad_err_cb( > > > > /* Obtain the sent context etc */ > > p_query_req_copy = > > - (osmv_query_req_t *)CAST_P2LONG(p_madw->context.ni_context.node_guid); > > + (osmv_query_req_t *)(p_madw->context.arb_context.context1); > > > > /* provide the context of the original request in the result */ > > query_res.query_context = p_query_req_copy->query_context; > > @@ -561,10 +561,17 @@ __osmv_send_sa_req( > > /* > > Provide the address to send to > > */ > > + /* Patch to handle IBAL - host order , where it should take destination lid in network order */ > > +#ifdef OSM_VENDOR_INTF_AL > > + p_madw->mad_addr.dest_lid = p_bind->sm_lid; > > +#else > > p_madw->mad_addr.dest_lid = cl_hton16(p_bind->sm_lid); > > +#endif > > p_madw->mad_addr.addr_type.smi.source_lid = > > cl_hton16(p_bind->lid); > > p_madw->mad_addr.addr_type.gsi.remote_qp = CL_HTON32(1); > > + p_madw->mad_addr.addr_type.gsi.remote_qkey = IB_QP1_WELL_KNOWN_Q_KEY; > > + p_madw->mad_addr.addr_type.gsi.pkey = IB_DEFAULT_PKEY; > > p_madw->resp_expected = TRUE; > > p_madw->fail_msg = CL_DISP_MSGID_NONE; > > > > @@ -574,12 +581,11 @@ __osmv_send_sa_req( > > Since we can not rely on the client to keep it arroud until > > the response - we duplicate it and will later dispose it (in CB). > > To store on the MADW we cast it into what opensm has: > > - p_madw->context.ni_context.node_guid > > + p_madw->context.arb_context.context1 > > */ > > p_query_req_copy = cl_malloc(sizeof(*p_query_req_copy)); > > *p_query_req_copy = *p_query_req; > > - p_madw->context.ni_context.node_guid = > > - (ib_net64_t)CAST_P2LONG(p_query_req_copy); > > + p_madw->context.arb_context.context1 = p_query_req_copy; > > > > /* we can support async as well as sync calls */ > > sync = ((p_query_req->flags & OSM_SA_FLAGS_SYNC) == OSM_SA_FLAGS_SYNC); > > Index: include/opensm/osm_madw.h > > =================================================================== > > --- include/opensm/osm_madw.h (revision 5307) > > +++ include/opensm/osm_madw.h (working copy) > > @@ -315,6 +315,22 @@ typedef struct _osm_vla_context > > boolean_t set_method; > > } osm_vla_context_t; > > /*********/ > > +/****s* OpenSM: MAD Wrapper/osm_arbitrary_context_t > > +* NAME > > +* osm_sa_context_t > > +* > > +* DESCRIPTION > > +* Context needed by arbitrary recipient. > > +* > > +* SYNOPSIS > > +*/ > > +typedef struct _osm_arbitrary_context > > +{ > > + void* context1; > > + void* context2; > > +} osm_arbitrary_context_t; > > +/*********/ > > + > > /****s* OpenSM: MAD Wrapper/osm_madw_context_t > > * NAME > > * osm_madw_context_t > > @@ -335,6 +351,7 @@ typedef union _osm_madw_context > > osm_smi_context_t smi_context; > > osm_slvl_context_t slvl_context; > > osm_pkey_context_t pkey_context; > > + osm_arbitrary_context_t arb_context; > > Should this be carried for for all vendor layers or only the ones which > need this ? > > > } osm_madw_context_t; > > /*********/ > > > > @@ -880,6 +897,34 @@ osm_madw_get_vla_context_ptr( > > } > > /* > > * PARAMETERS > > +* p_madw > > +* [in] Pointer to an osm_madw_t object. > > +* > > +* RETURN VALUES > > +* Pointer to the start of the context structure. > > +* > > +* NOTES > > +* > > +* SEE ALSO > > +*********/ > > + > > +/****f* OpenSM: MAD Wrapper/osm_madw_get_arbitrary_context_ptr > > +* NAME > > +* osm_madw_get_arbitrary_context_ptr > > +* > > +* DESCRIPTION > > +* Gets a pointer to the arbitrary context in this MAD. > > +* > > +* SYNOPSIS > > +*/ > > +static inline osm_arbitrary_context_t* > > +osm_madw_get_arbitrary_context_ptr( > > + IN const osm_madw_t* const p_madw ) > > +{ > > + return( (osm_arbitrary_context_t*)&p_madw->context ); > > +} > > +/* > > +* PARAMETERS > > * p_madw > > * [in] Pointer to an osm_madw_t object. > > * > > From halr at voltaire.com Sun Feb 12 05:05:43 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Feb 2006 08:05:43 -0500 Subject: [openib-general] RE: [PATCH] Opensm - cl_event_wheel casting In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FD9B@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FD9B@mtlexch01.mtl.com> Message-ID: <1139749542.4475.6497.camel@hal.voltaire.com> Hi Yael, On Sun, 2006-02-12 at 02:56, Yael Kalka wrote: > Hi Hal, > I am not sure it really matters, as the timeout used will be of uint32_t size > anyways. Then why check and warn about an overflow in the clause right above it ? A max 32 bit timeout would be pretty long (slightly more than 4294 seconds). > If you think using the max 32 bit makes more sense - I am fine with that too. I will change it for this case which likely wouldn't be hit. -- Hal > Yael > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Thursday, February 09, 2006 5:29 PM > To: Yael Kalka > Cc: openib-general at openib.org; Eitan Zahavi > Subject: Re: [PATCH] Opensm - cl_event_wheel casting > > > Hi Yael, > > On Mon, 2006-02-06 at 03:53, Yael Kalka wrote: > > Hi Hal, > > > > The following patch adds the casting done in a clearer way - to avoid > > compilation errors in windows. Also - added a clear message if the > > timeout was trimmed (due to the casting). > > > > Thanks, > > Yael > > > > Signed-off-by: Yael Kalka > > > > Index: complib/cl_event_wheel.c > > =================================================================== > > --- complib/cl_event_wheel.c (revision 5307) > > +++ complib/cl_event_wheel.c (working copy) > > @@ -426,8 +426,18 @@ cl_event_wheel_reg( > > * cl_timer_stop(&p_event_wheel->timer); > > */ > > > > + /* The timeout for the cl_timer_start should be given as uint32_t. > > + if there is an overflow - warn about it. */ > > + if ( timeout > (uint32_t)timeout ) > > + { > > + osm_log (p_event_wheel->p_log, OSM_LOG_INFO, > > + "cl_event_wheel_reg: " > > + "timeout requested is too large. Using timeout: %u \n", > > + (uint32_t)timeout ); > > + } > > + > > /* start the timer to the timeout [msec] */ > > - cl_status = cl_timer_start(&p_event_wheel->timer, timeout); > > + cl_status = cl_timer_start(&p_event_wheel->timer, (uint32_t)timeout); > > Shouldn't this use the max 32 bit timeout here rather than the low 32 > bits ? > > -- Hal > > > if (cl_status != CL_SUCCESS) > > { > > From eitan at mellanox.co.il Sun Feb 12 05:25:20 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 12 Feb 2006 15:25:20 +0200 Subject: [openib-general] IPoIB and lid change Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B737@mtlexch01.mtl.com> Hi The issue with IPoIB address change is not just LID change but also QP change. (IPoIB define the MAC to be QP,GID) . Anytime you do ifconfig down/up you might get a new QP and thus you need to refresh the ARP... I second Mike K. and propose we use gratuitous ARP reply whenever an IPoIB interface is brought up. > I wonder if this is why when I reload the IB drivers on one node > I sometimes have to reload them on other nodes too. Otherwise > ping over IPoIB doesn't work. > > > The remote LID may get changed for other reasons too without an SM > > change (SM merge of 2 separate subnets). How can this be handled ? > > Isn't this just another case of the SM changing for one of the subnets? > > grant > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Sun Feb 12 05:21:56 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Feb 2006 08:21:56 -0500 Subject: [openib-general] IPoIB and lid change In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B737@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B737@mtlexch01.mtl.com> Message-ID: <1139750516.4475.6536.camel@hal.voltaire.com> On Sun, 2006-02-12 at 08:25, Eitan Zahavi wrote: > Hi > > The issue with IPoIB address change is not just LID change but also QP > change. > (IPoIB define the MAC to be QP,GID) . > > Anytime you do ifconfig down/up you might get a new QP and thus you need > to refresh the ARP... > > I second Mike K. and propose we use gratuitous ARP reply whenever an > IPoIB interface is brought up. That helps but not sure this is a total solution as I think there are nodes which do not cache a gratutious response unless they requested it (as it is not a requirement to do so). -- Hal > > I wonder if this is why when I reload the IB drivers on one node > > I sometimes have to reload them on other nodes too. Otherwise > > ping over IPoIB doesn't work. > > > > > The remote LID may get changed for other reasons too without an SM > > > change (SM merge of 2 separate subnets). How can this be handled ? > > > > Isn't this just another case of the SM changing for one of the > subnets? > > > > grant > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From tziporet at mellanox.co.il Sun Feb 12 05:33:28 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Sun, 12 Feb 2006 15:33:28 +0200 Subject: [openib-general] IPoIB and lid change In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B737@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B737@mtlexch01.mtl.com> Message-ID: <43EF3928.1090402@mellanox.co.il> Eitan Zahavi wrote: > Hi > > The issue with IPoIB address change is not just LID change but also QP > change. > (IPoIB define the MAC to be QP,GID) . > > Anytime you do ifconfig down/up you might get a new QP and thus you need > to refresh the ARP... > > I second Mike K. and propose we use gratuitous ARP reply whenever an > IPoIB interface is brought up. > > Note that currently ifconfig down/up keeps the QP number since it only reset the QP and not destroy it. A QP number will be changed only if the IPoIB module is downloaded. Tziporet From mst at mellanox.co.il Sun Feb 12 05:36:49 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 12 Feb 2006 15:36:49 +0200 Subject: [openib-general] Re: IPoIB and lid change In-Reply-To: <1139587533.4450.6094.camel@hal.voltaire.com> References: <20060208201404.GE32759@mellanox.co.il> <1139587533.4450.6094.camel@hal.voltaire.com> Message-ID: <20060212133649.GA12737@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: Re: IPoIB and lid change > > On Wed, 2006-02-08 at 15:14, Michael S. Tsirkin wrote: > > Hi, Roland! > > One issue we have with IPoIB is that IPoIB may cache a remote node path for > > a long time. Remote LID may get changed e.g. if the SM is changed, and IPoIB > > might lose connectivity. > > The remote LID may get changed for other reasons too without an SM > change (SM merge of 2 separate subnets). How can this be handled ? Change the SM to trigger SM lid change for both subnets? > > One simple way to address this would be to have a list of all > > address handles per net device and kill them on an SM change event. > > > > What do you think? > -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From nanako_norinori at hhji.com Sun Feb 12 04:53:21 2006 From: nanako_norinori at hhji.com (=?ISO-2022-JP?B?GyRCJCpAJE9DJEskSiRDJEYkKiRqJF4kORsoQg==?=) Date: 12 Feb 2006 21:53:21 +0900 Subject: [openib-general] $BL$Mh$G$9(B Message-ID: <20060212125321.3285.qmail@mail.hhji.com> 突然のメールで申し訳ありません。 当グループでは現在、女性の会員数に男性の会員数が追いつかない状況となっております。 そこで、掲示板に書き込みを頂いたお客様にメールをさせていただきました。 もし、怪しいと思ったり、まったく興味がない方はこのメールを破棄してくださいます ようよろしくお願い致します。 当番組は日々女性から紹介希望を受け、地域に合わせ、全ての情報を地域男性会員に公開 しております。 本日貴方様にご紹介したいのは此方の女性会員です。 登録名前:未来 ■年齢 ♀29才 ■職業 会社事務 ■身長 155-159 cm ■体型 ノーマルより太め(恥かしい) ■ポイント  『実は私…出会い系の利用は初めてです。 複数の方からメールくるのが恐くて、貴方だけへの紹介を頼みました。 仕事は事務なので女性だけの職場です。毎日仕事と家の往復ばっかりで退屈しています。 先月社員寮を出て一人暮らしし始めました! もし良かったらこれをきっかけに、知り合いになってくれませんか? あっ!!チャットとかってOKな人ですか?もし良かったらアドレス教えてくださいね。 経済的な負担など一切かけたくないので、遊び代は私が出します。宜しくね〜』 ※紹介された方は登録・紹介などの費用は一切頂きませんので、ご安心下さい。     http://www.jumpb2.net?Renewal (無料登録用) 【説明】 貴方からのメールが到着次第、女性の直メールにて女性との連絡方法をご連絡申し上げます。 理解できない方 iranai at jumpb2.net ※18歳以上の方のみのサイトです※ From halr at voltaire.com Sun Feb 12 05:37:43 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Feb 2006 08:37:43 -0500 Subject: [openib-general] Re: IPoIB and lid change In-Reply-To: <20060212133649.GA12737@mellanox.co.il> References: <20060208201404.GE32759@mellanox.co.il> <1139587533.4450.6094.camel@hal.voltaire.com> <20060212133649.GA12737@mellanox.co.il> Message-ID: <1139751065.4475.6568.camel@hal.voltaire.com> On Sun, 2006-02-12 at 08:36, Michael S. Tsirkin wrote: > Quoting r. Hal Rosenstock : > > Subject: Re: IPoIB and lid change > > > > On Wed, 2006-02-08 at 15:14, Michael S. Tsirkin wrote: > > > Hi, Roland! > > > One issue we have with IPoIB is that IPoIB may cache a remote node path for > > > a long time. Remote LID may get changed e.g. if the SM is changed, and IPoIB > > > might lose connectivity. > > > > The remote LID may get changed for other reasons too without an SM > > change (SM merge of 2 separate subnets). How can this be handled ? > > Change the SM to trigger SM lid change for both subnets? I think that's overly disruptive for the subnet whose SM didn't change. There should be a gentler way... -- Hal > > > One simple way to address this would be to have a list of all > > > address handles per net device and kill them on an SM change event. > > > > > > What do you think? > > From mst at mellanox.co.il Sun Feb 12 05:52:07 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 12 Feb 2006 15:52:07 +0200 Subject: [openib-general] Re: IPoIB and lid change In-Reply-To: <1139751065.4475.6568.camel@hal.voltaire.com> References: <20060208201404.GE32759@mellanox.co.il> <1139587533.4450.6094.camel@hal.voltaire.com> <20060212133649.GA12737@mellanox.co.il> <1139751065.4475.6568.camel@hal.voltaire.com> Message-ID: <20060212135207.GB12737@mellanox.co.il> Quoting Hal Rosenstock : > > > > Hi, Roland! One issue we have with IPoIB is that IPoIB may cache a > > > > remote node path for a long time. Remote LID may get changed e.g. if the > > > > SM is changed, and IPoIB might lose connectivity. > > > > > > The remote LID may get changed for other reasons too without an SM > > > change (SM merge of 2 separate subnets). How can this be handled ? > > > > Change the SM to trigger SM lid change for both subnets? > > I think that's overly disruptive for the subnet whose SM didn't change. > There should be a gentler way... So that another issue with IPoIB subnet spanning multiple IB subnets. The IPoIB spec already lists several of these. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From halr at voltaire.com Sun Feb 12 05:49:59 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Feb 2006 08:49:59 -0500 Subject: [openib-general] Re: [PATCH] Opensm - add includes for PRIx64 In-Reply-To: <5z3biolsyd.fsf@mtl066.yok.mtl.com> References: <5z3biolsyd.fsf@mtl066.yok.mtl.com> Message-ID: <1139751562.4475.6586.camel@hal.voltaire.com> On Sun, 2006-02-12 at 06:25, Yael Kalka wrote: > Hi Hal, > > In gen2 stack, the PRIx64 definitions come either from inttypes.h or > from cl_debug_osd.h file (included by cl_debug.h). In windows stack > inttypes.h file doesn't exist, thus the PRIx64 definition comes only > from the cl_debug_osd.h file. The following patch adds includes to > cl_debug.h in files where this include is missing, and there is use of > PRIx64. Thanks. Applied. -- Hal > Thanks, > Yael > > Signed-off-by: Yael Kalka From halr at voltaire.com Sun Feb 12 05:53:13 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Feb 2006 08:53:13 -0500 Subject: [openib-general] Re: IPoIB and lid change In-Reply-To: <20060212135207.GB12737@mellanox.co.il> References: <20060208201404.GE32759@mellanox.co.il> <1139587533.4450.6094.camel@hal.voltaire.com> <20060212133649.GA12737@mellanox.co.il> <1139751065.4475.6568.camel@hal.voltaire.com> <20060212135207.GB12737@mellanox.co.il> Message-ID: <1139752048.4475.6605.camel@hal.voltaire.com> On Sun, 2006-02-12 at 08:52, Michael S. Tsirkin wrote: > Quoting Hal Rosenstock : > > > > > Hi, Roland! One issue we have with IPoIB is that IPoIB may cache a > > > > > remote node path for a long time. Remote LID may get changed e.g. if the > > > > > SM is changed, and IPoIB might lose connectivity. > > > > > > > > The remote LID may get changed for other reasons too without an SM > > > > change (SM merge of 2 separate subnets). How can this be handled ? > > > > > > Change the SM to trigger SM lid change for both subnets? > > > > I think that's overly disruptive for the subnet whose SM didn't change. > > There should be a gentler way... > > So that another issue with IPoIB subnet spanning multiple IB subnets. > The IPoIB spec already lists several of these. I was referring to the subnet merge case. In the case of 2 IB subnets within a single IPoIB subnet, there is another layer of this issue as you point out. -- Hal From mst at mellanox.co.il Sun Feb 12 06:09:31 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 12 Feb 2006 16:09:31 +0200 Subject: [openib-general] Re: IPoIB and lid change In-Reply-To: <1139752048.4475.6605.camel@hal.voltaire.com> References: <20060208201404.GE32759@mellanox.co.il> <1139587533.4450.6094.camel@hal.voltaire.com> <20060212133649.GA12737@mellanox.co.il> <1139751065.4475.6568.camel@hal.voltaire.com> <20060212135207.GB12737@mellanox.co.il> <1139752048.4475.6605.camel@hal.voltaire.com> Message-ID: <20060212140931.GE11812@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: Re: IPoIB and lid change > > On Sun, 2006-02-12 at 08:52, Michael S. Tsirkin wrote: > > Quoting Hal Rosenstock : > > > > > > Hi, Roland! One issue we have with IPoIB is that IPoIB may cache a > > > > > > remote node path for a long time. Remote LID may get changed e.g. if the > > > > > > SM is changed, and IPoIB might lose connectivity. > > > > > > > > > > The remote LID may get changed for other reasons too without an SM > > > > > change (SM merge of 2 separate subnets). How can this be handled ? > > > > > > > > Change the SM to trigger SM lid change for both subnets? > > > > > > I think that's overly disruptive for the subnet whose SM didn't change. > > > There should be a gentler way... > > > > So that another issue with IPoIB subnet spanning multiple IB subnets. > > The IPoIB spec already lists several of these. > > I was referring to the subnet merge case. In the case of 2 IB subnets > within a single IPoIB subnet, there is another layer of this issue as > you point out. But if one of the subnets being merged was not part of the IPoIB subnet, we dont have IPoIB chaching the IB path. So IB path changes dont affect IPoIB. What am I missing? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From halr at voltaire.com Sun Feb 12 06:14:32 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Feb 2006 09:14:32 -0500 Subject: [openib-general] Re: IPoIB and lid change In-Reply-To: <20060212140931.GE11812@mellanox.co.il> References: <20060208201404.GE32759@mellanox.co.il> <1139587533.4450.6094.camel@hal.voltaire.com> <20060212133649.GA12737@mellanox.co.il> <1139751065.4475.6568.camel@hal.voltaire.com> <20060212135207.GB12737@mellanox.co.il> <1139752048.4475.6605.camel@hal.voltaire.com> <20060212140931.GE11812@mellanox.co.il> Message-ID: <1139753671.4475.15.camel@hal.voltaire.com> On Sun, 2006-02-12 at 09:09, Michael S. Tsirkin wrote: > Quoting r. Hal Rosenstock : > > Subject: Re: IPoIB and lid change > > > > On Sun, 2006-02-12 at 08:52, Michael S. Tsirkin wrote: > > > Quoting Hal Rosenstock : > > > > > > > Hi, Roland! One issue we have with IPoIB is that IPoIB may cache a > > > > > > > remote node path for a long time. Remote LID may get changed e.g. if the > > > > > > > SM is changed, and IPoIB might lose connectivity. > > > > > > > > > > > > The remote LID may get changed for other reasons too without an SM > > > > > > change (SM merge of 2 separate subnets). How can this be handled ? > > > > > > > > > > Change the SM to trigger SM lid change for both subnets? > > > > > > > > I think that's overly disruptive for the subnet whose SM didn't change. > > > > There should be a gentler way... > > > > > > So that another issue with IPoIB subnet spanning multiple IB subnets. > > > The IPoIB spec already lists several of these. > > > > I was referring to the subnet merge case. In the case of 2 IB subnets > > within a single IPoIB subnet, there is another layer of this issue as > > you point out. > > But if one of the subnets being merged was not part of the IPoIB > subnet, we dont have IPoIB chaching the IB path. > So IB path changes dont affect IPoIB. What am I missing? A subnet merge in the remote subnet can affect the existing LIDs of the remote subnet to which there were paths depending on the SM policy. That's one case related to the case in the local side. Of course, we're talking theory here because there are no IB routers, right ? -- Hal From halr at voltaire.com Sun Feb 12 06:19:45 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Feb 2006 09:19:45 -0500 Subject: [openib-general] Re: [PATCH] Opensm - opensm/osm_ucast_updn.c In-Reply-To: <5z1wy8lsou.fsf@mtl066.yok.mtl.com> References: <5z1wy8lsou.fsf@mtl066.yok.mtl.com> Message-ID: <1139753985.4475.28.camel@hal.voltaire.com> On Sun, 2006-02-12 at 06:31, Yael Kalka wrote: > Hi Hal, > > The following patch cleans up the osm_ucast_updn.c construct function > that you pointed out, and also adds a check if the construct succeeded > during updn_init. > > Thanks, > Yael Thanks. Applied. -- Hal > Signed-off-by: Yael Kalka From jackm at mellanox.co.il Sun Feb 12 07:27:44 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Sun, 12 Feb 2006 17:27:44 +0200 Subject: [openib-general] [PATCH 1 of 3] mad: large RMPP support, Round 2 Message-ID: <20060212152744.GA19049@mellanox.co.il> Implement large RMPP support: changes to header files. Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Index: src/drivers/infiniband/include/rdma/ib_mad.h =================================================================== --- src.orig/drivers/infiniband/include/rdma/ib_mad.h 2006-02-12 16:08:25.503620000 +0200 +++ src/drivers/infiniband/include/rdma/ib_mad.h 2006-02-12 16:08:48.063975000 +0200 @@ -141,6 +141,13 @@ struct ib_rmpp_hdr { __be32 paylen_newwin; }; +struct ib_mad_multipacket_seg { + struct list_head list; + u32 num; + u16 size; + u8 data[0]; +}; + typedef u64 __bitwise ib_sa_comp_mask; #define IB_SA_COMP_MASK(n) ((__force ib_sa_comp_mask) cpu_to_be64(1ull << n)) @@ -485,17 +492,6 @@ int ib_unregister_mad_agent(struct ib_ma int ib_post_send_mad(struct ib_mad_send_buf *send_buf, struct ib_mad_send_buf **bad_send_buf); -/** - * ib_coalesce_recv_mad - Coalesces received MAD data into a single buffer. - * @mad_recv_wc: Work completion information for a received MAD. - * @buf: User-provided data buffer to receive the coalesced buffers. The - * referenced buffer should be at least the size of the mad_len specified - * by @mad_recv_wc. - * - * This call copies a chain of received MAD segments into a single data buffer, - * removing duplicated headers. - */ -void ib_coalesce_recv_mad(struct ib_mad_recv_wc *mad_recv_wc, void *buf); /** * ib_free_recv_mad - Returns data buffers used to receive a MAD. @@ -601,6 +597,16 @@ struct ib_mad_send_buf * ib_create_send_ gfp_t gfp_mask); /** + * *ib_mad_get_multipacket_seg - returns a given RMPP segment. + * @send_buf: Previously allocated send data buffer. + * @seg_num: number of segment to return + * + * This routine returns a pointer to a segment of a multipacket RMPP message. + */ +struct ib_mad_multipacket_seg +*ib_mad_get_multipacket_seg(struct ib_mad_send_buf *send_buf, int seg_num); + +/** * ib_free_send_mad - Returns data buffers used to send a MAD. * @send_buf: Previously allocated send data buffer. */ Index: src/drivers/infiniband/core/mad_priv.h =================================================================== --- src.orig/drivers/infiniband/core/mad_priv.h 2006-02-12 16:08:25.304631000 +0200 +++ src/drivers/infiniband/core/mad_priv.h 2006-02-12 16:08:48.073973000 +0200 @@ -119,10 +119,12 @@ struct ib_mad_send_wr_private { struct list_head agent_list; struct ib_mad_agent_private *mad_agent_priv; struct ib_mad_send_buf send_buf; - DECLARE_PCI_UNMAP_ADDR(mapping) + DECLARE_PCI_UNMAP_ADDR(header_mapping) + DECLARE_PCI_UNMAP_ADDR(payload_mapping) struct ib_send_wr send_wr; struct ib_sge sg_list[IB_MAD_SEND_REQ_MAX_SG]; __be64 tid; + void *mad_payload; /* RMPP: changed per segment */ unsigned long timeout; int retries; int retry; @@ -130,9 +132,13 @@ struct ib_mad_send_wr_private { enum ib_wc_status status; /* RMPP control */ + struct list_head multipacket_list; + struct ib_mad_multipacket_seg *last_ack_seg; + struct ib_mad_multipacket_seg *seg_num_seg; int last_ack; int seg_num; int newwin; + int total_length; int total_seg; int data_offset; int pad; @@ -218,4 +224,7 @@ void ib_mark_mad_done(struct ib_mad_send void ib_reset_mad_timeout(struct ib_mad_send_wr_private *mad_send_wr, int timeout_ms); +struct ib_mad_multipacket_seg +*ib_rmpp_get_multipacket_seg(struct ib_mad_send_wr_private *wr, int seg_num); + #endif /* __IB_MAD_PRIV_H__ */ From jackm at mellanox.co.il Sun Feb 12 07:28:59 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Sun, 12 Feb 2006 17:28:59 +0200 Subject: [openib-general] [PATCH 2 of 3] mad: large RMPP support, Round 2 Message-ID: <20060212152859.GB19049@mellanox.co.il> Implement large RMPP support: Receive side: copy the arriving MADs to chunks instead of coalescing to one large buffer in kernel space. Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Index: src/drivers/infiniband/core/mad_rmpp.c =================================================================== --- src.orig/drivers/infiniband/core/mad_rmpp.c 2006-02-12 16:30:34.211437000 +0200 +++ src/drivers/infiniband/core/mad_rmpp.c 2006-02-12 16:30:44.624175000 +0200 @@ -433,44 +433,6 @@ static struct ib_mad_recv_wc * complete_ return rmpp_wc; } -void ib_coalesce_recv_mad(struct ib_mad_recv_wc *mad_recv_wc, void *buf) -{ - struct ib_mad_recv_buf *seg_buf; - struct ib_rmpp_mad *rmpp_mad; - void *data; - int size, len, offset; - u8 flags; - - len = mad_recv_wc->mad_len; - if (len <= sizeof(struct ib_mad)) { - memcpy(buf, mad_recv_wc->recv_buf.mad, len); - return; - } - - offset = data_offset(mad_recv_wc->recv_buf.mad->mad_hdr.mgmt_class); - - list_for_each_entry(seg_buf, &mad_recv_wc->rmpp_list, list) { - rmpp_mad = (struct ib_rmpp_mad *)seg_buf->mad; - flags = ib_get_rmpp_flags(&rmpp_mad->rmpp_hdr); - - if (flags & IB_MGMT_RMPP_FLAG_FIRST) { - data = rmpp_mad; - size = sizeof(*rmpp_mad); - } else { - data = (void *) rmpp_mad + offset; - if (flags & IB_MGMT_RMPP_FLAG_LAST) - size = len; - else - size = sizeof(*rmpp_mad) - offset; - } - - memcpy(buf, data, size); - len -= size; - buf += size; - } -} -EXPORT_SYMBOL(ib_coalesce_recv_mad); - static struct ib_mad_recv_wc * continue_rmpp(struct ib_mad_agent_private *agent, struct ib_mad_recv_wc *mad_recv_wc) @@ -570,13 +532,6 @@ start_rmpp(struct ib_mad_agent_private * return mad_recv_wc; } -static inline u64 get_seg_addr(struct ib_mad_send_wr_private *mad_send_wr) -{ - return mad_send_wr->sg_list[0].addr + mad_send_wr->data_offset + - (sizeof(struct ib_rmpp_mad) - mad_send_wr->data_offset) * - (mad_send_wr->seg_num - 1); -} - static int send_next_seg(struct ib_mad_send_wr_private *mad_send_wr) { struct ib_rmpp_mad *rmpp_mad; Index: src/drivers/infiniband/core/user_mad.c =================================================================== --- src.orig/drivers/infiniband/core/user_mad.c 2006-02-12 16:30:34.293433000 +0200 +++ src/drivers/infiniband/core/user_mad.c 2006-02-12 16:30:44.636158000 +0200 @@ -123,6 +123,7 @@ struct ib_umad_packet { struct ib_mad_send_buf *msg; struct list_head list; int length; + struct list_head seg_list; struct ib_user_mad mad; }; @@ -176,6 +177,73 @@ static int queue_packet(struct ib_umad_f return ret; } +static int data_offset(u8 mgmt_class) +{ + if (mgmt_class == IB_MGMT_CLASS_SUBN_ADM) + return IB_MGMT_SA_HDR; + else if ((mgmt_class >= IB_MGMT_CLASS_VENDOR_RANGE2_START) && + (mgmt_class <= IB_MGMT_CLASS_VENDOR_RANGE2_END)) + return IB_MGMT_VENDOR_HDR; + else + return IB_MGMT_RMPP_HDR; +} + +static int copy_recv_mad(struct ib_mad_recv_wc *mad_recv_wc, + struct ib_umad_packet *packet) +{ + struct ib_mad_recv_buf *seg_buf; + struct ib_rmpp_mad *rmpp_mad; + void *data; + struct ib_mad_multipacket_seg *seg; + int size, len, offset; + u8 flags; + + len = mad_recv_wc->mad_len; + if (len <= sizeof(struct ib_mad)) { + memcpy(&packet->mad.data, mad_recv_wc->recv_buf.mad, len); + return 0; + } + + /* Multipacket (RMPP) MAD */ + offset = data_offset(mad_recv_wc->recv_buf.mad->mad_hdr.mgmt_class); + + list_for_each_entry(seg_buf, &mad_recv_wc->rmpp_list, list) { + rmpp_mad = (struct ib_rmpp_mad *) seg_buf->mad; + flags = ib_get_rmpp_flags(&rmpp_mad->rmpp_hdr); + + if (flags & IB_MGMT_RMPP_FLAG_FIRST) { + size = sizeof(*rmpp_mad); + memcpy(&packet->mad.data, rmpp_mad, size); + } else { + data = (void *) rmpp_mad + offset; + if (flags & IB_MGMT_RMPP_FLAG_LAST) + size = len; + else + size = sizeof(*rmpp_mad) - offset; + seg = kmalloc(sizeof(struct ib_mad_multipacket_seg) + + sizeof(struct ib_rmpp_mad) - offset, + GFP_KERNEL); + if (!seg) + return -ENOMEM; + memcpy(seg->data, data, size); + list_add_tail(&seg->list, &packet->seg_list); + } + len -= size; + } + return 0; +} + +static void free_packet(struct ib_umad_packet *packet) +{ + struct ib_mad_multipacket_seg *seg, *tmp; + + list_for_each_entry_safe(seg, tmp, &packet->seg_list, list) { + list_del(&seg->list); + kfree(seg); + } + kfree(packet); +} + static void send_handler(struct ib_mad_agent *agent, struct ib_mad_send_wc *send_wc) { @@ -204,34 +272,6 @@ out: kfree(packet); } -static struct ib_umad_packet *alloc_packet(int buf_size) -{ - struct ib_umad_packet *packet; - int length = sizeof *packet + buf_size; - - if (length >= PAGE_SIZE) - packet = (void *)__get_free_pages(GFP_KERNEL, long_log2(roundup_pow_of_two(length)) - PAGE_SHIFT); - else - packet = kmalloc(length, GFP_KERNEL); - - if (!packet) - return NULL; - - memset(packet, 0, length); - return packet; -} - -static void free_packet(struct ib_umad_packet *packet) -{ - int length = packet->length + sizeof *packet; - if (length >= PAGE_SIZE) - free_pages((unsigned long) packet, long_log2(roundup_pow_of_two(length)) - PAGE_SHIFT); - else - kfree(packet); -} - - - static void recv_handler(struct ib_mad_agent *agent, struct ib_mad_recv_wc *mad_recv_wc) { @@ -243,13 +283,16 @@ static void recv_handler(struct ib_mad_a goto out; length = mad_recv_wc->mad_len; - packet = alloc_packet(length); + packet = kzalloc(sizeof *packet + sizeof(struct ib_mad), GFP_KERNEL); if (!packet) goto out; - + INIT_LIST_HEAD(&packet->seg_list); packet->length = length; - ib_coalesce_recv_mad(mad_recv_wc, packet->mad.data); + if (copy_recv_mad(mad_recv_wc, packet)) { + free_packet(packet); + goto out; + } packet->mad.hdr.status = 0; packet->mad.hdr.length = length + sizeof (struct ib_user_mad); @@ -278,6 +321,7 @@ static ssize_t ib_umad_read(struct file size_t count, loff_t *pos) { struct ib_umad_file *file = filp->private_data; + struct ib_mad_multipacket_seg *seg; struct ib_umad_packet *packet; ssize_t ret; @@ -304,18 +348,44 @@ static ssize_t ib_umad_read(struct file spin_unlock_irq(&file->recv_lock); - if (count < packet->length + sizeof (struct ib_user_mad)) { - /* Return length needed (and first RMPP segment) if too small */ - if (copy_to_user(buf, &packet->mad, - sizeof (struct ib_user_mad) + sizeof (struct ib_mad))) - ret = -EFAULT; - else - ret = -ENOSPC; - } else if (copy_to_user(buf, &packet->mad, - packet->length + sizeof (struct ib_user_mad))) + if (copy_to_user(buf, &packet->mad, + sizeof(struct ib_user_mad) + sizeof(struct ib_mad))) { ret = -EFAULT; - else + goto err; + } + + if (count < packet->length + sizeof (struct ib_user_mad)) + /* + * User buffer too small. Return first RMPP segment (which + * includes RMPP message length). + */ + ret = -ENOSPC; + else if (packet->length <= sizeof(struct ib_mad)) + ret = packet->length + sizeof(struct ib_user_mad); + else { + int len = packet->length - sizeof(struct ib_mad); + struct ib_rmpp_mad *rmpp_mad = + (struct ib_rmpp_mad *) packet->mad.data; + int max_seg_payload = sizeof(struct ib_mad) - + data_offset(rmpp_mad->mad_hdr.mgmt_class); + int seg_payload; + /* + * Multipacket RMPP MAD message. Copy remainder of message. + * Note that last segment may have a shorter payload. + */ + buf += sizeof(struct ib_user_mad) + sizeof(struct ib_mad); + list_for_each_entry(seg, &packet->seg_list, list) { + seg_payload = min_t(int, len, max_seg_payload); + if (copy_to_user(buf, seg->data, seg_payload)) { + ret = -EFAULT; + goto err; + } + buf += seg_payload; + len -= seg_payload; + } ret = packet->length + sizeof (struct ib_user_mad); + } +err: if (ret < 0) { /* Requeue packet */ spin_lock_irq(&file->recv_lock); From jackm at mellanox.co.il Sun Feb 12 07:30:36 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Sun, 12 Feb 2006 17:30:36 +0200 Subject: [openib-general] [PATCH 3 of 3] mad: large RMPP support, Round 2 Message-ID: <20060212153036.GC19049@mellanox.co.il> Implement large RMPP support: Send side: split a multipacket MAD buffer to a list of segments, (multipacket_list) and send these using a gather list of size 2. Also, save pointer to last sent segment, and retrieve requested segments by walking list starting at last sent segment. Finally, save pointer to last-acked segment. When retrying, retrieve segments for resending relative to this pointer. When updating last ack, start at this pointer. List scan for get next segment is thus reduced from O(N^^2) to O(N). In normal flow, the segment list will be scanned only twice (once for retrieving next segment to send, once for updating the last-ack pointer). Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Index: src/drivers/infiniband/core/mad_rmpp.c =================================================================== --- src.orig/drivers/infiniband/core/mad_rmpp.c 2006-02-12 16:30:44.624175000 +0200 +++ src/drivers/infiniband/core/mad_rmpp.c 2006-02-12 16:30:53.114901000 +0200 @@ -535,6 +535,7 @@ start_rmpp(struct ib_mad_agent_private * static int send_next_seg(struct ib_mad_send_wr_private *mad_send_wr) { struct ib_rmpp_mad *rmpp_mad; + struct ib_mad_multipacket_seg *seg; int timeout; u32 paylen; @@ -547,14 +548,16 @@ static int send_next_seg(struct ib_mad_s paylen = mad_send_wr->total_seg * IB_MGMT_RMPP_DATA - mad_send_wr->pad; rmpp_mad->rmpp_hdr.paylen_newwin = cpu_to_be32(paylen); - mad_send_wr->sg_list[0].length = sizeof(struct ib_rmpp_mad); } else { - mad_send_wr->send_wr.num_sge = 2; - mad_send_wr->sg_list[0].length = mad_send_wr->data_offset; - mad_send_wr->sg_list[1].addr = get_seg_addr(mad_send_wr); - mad_send_wr->sg_list[1].length = sizeof(struct ib_rmpp_mad) - - mad_send_wr->data_offset; - mad_send_wr->sg_list[1].lkey = mad_send_wr->sg_list[0].lkey; + seg = ib_rmpp_get_multipacket_seg(mad_send_wr, + mad_send_wr->seg_num); + if (!seg) { + printk(KERN_ERR PFX "send_next_seg: " + "could not find segment %d\n", + mad_send_wr->seg_num); + return -EINVAL; + } + mad_send_wr->mad_payload = seg->data; rmpp_mad->rmpp_hdr.paylen_newwin = 0; } @@ -600,6 +603,28 @@ out: spin_unlock_irqrestore(&agent->lock, flags); } +static inline void adjust_last_ack(struct ib_mad_send_wr_private *wr) +{ + struct ib_mad_multipacket_seg *seg; + + if (wr->last_ack < 2) + return; + else if (!wr->last_ack_seg) + list_for_each_entry(seg, &wr->multipacket_list, list) { + if (wr->last_ack == seg->num) { + wr->last_ack_seg = seg; + break; + } + } + else + list_for_each_entry(seg, &wr->last_ack_seg->list, list) { + if (wr->last_ack == seg->num) { + wr->last_ack_seg = seg; + break; + } + } +} + static void process_rmpp_ack(struct ib_mad_agent_private *agent, struct ib_mad_recv_wc *mad_recv_wc) { @@ -647,6 +672,7 @@ static void process_rmpp_ack(struct ib_m if (seg_num > mad_send_wr->last_ack) { mad_send_wr->last_ack = seg_num; + adjust_last_ack(mad_send_wr); mad_send_wr->retries = mad_send_wr->send_buf.retries; } mad_send_wr->newwin = newwin; @@ -793,7 +819,7 @@ out: int ib_send_rmpp_mad(struct ib_mad_send_wr_private *mad_send_wr) { struct ib_rmpp_mad *rmpp_mad; - int i, total_len, ret; + int ret; rmpp_mad = mad_send_wr->send_buf.mad; if (!(ib_get_rmpp_flags(&rmpp_mad->rmpp_hdr) & @@ -803,20 +829,16 @@ int ib_send_rmpp_mad(struct ib_mad_send_ if (rmpp_mad->rmpp_hdr.rmpp_type != IB_MGMT_RMPP_TYPE_DATA) return IB_RMPP_RESULT_INTERNAL; - if (mad_send_wr->send_wr.num_sge > 1) - return -EINVAL; /* TODO: support num_sge > 1 */ + if (mad_send_wr->send_wr.num_sge != 2) + return -EINVAL; mad_send_wr->seg_num = 1; mad_send_wr->newwin = 1; mad_send_wr->data_offset = data_offset(rmpp_mad->mad_hdr.mgmt_class); - total_len = 0; - for (i = 0; i < mad_send_wr->send_wr.num_sge; i++) - total_len += mad_send_wr->send_wr.sg_list[i].length; - - mad_send_wr->total_seg = (total_len - mad_send_wr->data_offset) / + mad_send_wr->total_seg = (mad_send_wr->total_length - mad_send_wr->data_offset) / (sizeof(struct ib_rmpp_mad) - mad_send_wr->data_offset); - mad_send_wr->pad = total_len - IB_MGMT_RMPP_HDR - + mad_send_wr->pad = mad_send_wr->total_length - IB_MGMT_RMPP_HDR - be32_to_cpu(rmpp_mad->rmpp_hdr.paylen_newwin); /* We need to wait for the final ACK even if there isn't a response */ @@ -880,6 +902,8 @@ int ib_retry_rmpp(struct ib_mad_send_wr_ return IB_RMPP_RESULT_PROCESSED; mad_send_wr->seg_num = mad_send_wr->last_ack + 1; + mad_send_wr->seg_num_seg = mad_send_wr->last_ack_seg; + ret = send_next_seg(mad_send_wr); if (ret) return IB_RMPP_RESULT_PROCESSED; Index: src/drivers/infiniband/core/mad.c =================================================================== --- src.orig/drivers/infiniband/core/mad.c 2006-02-12 16:30:29.940545000 +0200 +++ src/drivers/infiniband/core/mad.c 2006-02-12 16:30:53.131904000 +0200 @@ -779,6 +779,54 @@ static int get_buf_length(int hdr_len, i return hdr_len + data_len + pad; } +static void free_send_multipacket_list(struct ib_mad_send_wr_private * + mad_send_wr) +{ + struct ib_mad_multipacket_seg *s, *t; + + list_for_each_entry_safe(s, t, &mad_send_wr->multipacket_list, list) { + list_del(&s->list); + kfree(s); + } +} + +static inline int alloc_send_rmpp_segs(struct ib_mad_send_wr_private *send_wr, + int message_size, int hdr_len, + int data_len, u8 rmpp_version, + gfp_t gfp_mask) +{ + struct ib_mad_multipacket_seg *seg; + struct ib_rmpp_mad *rmpp_mad = send_wr->send_buf.mad; + int seg_size, i = 2; + + rmpp_mad->rmpp_hdr.paylen_newwin = + cpu_to_be32(hdr_len - IB_MGMT_RMPP_HDR + data_len); + rmpp_mad->rmpp_hdr.rmpp_version = rmpp_version; + rmpp_mad->rmpp_hdr.rmpp_type = IB_MGMT_RMPP_TYPE_DATA; + ib_set_rmpp_flags(&rmpp_mad->rmpp_hdr, IB_MGMT_RMPP_FLAG_ACTIVE); + send_wr->total_length = message_size; + /* allocate RMPP buffers */ + message_size -= sizeof(struct ib_mad); + seg_size = sizeof(struct ib_mad) - hdr_len; + while (message_size > 0) { + seg = kmalloc(sizeof(struct ib_mad_multipacket_seg) + seg_size, + gfp_mask); + if (!seg) { + printk(KERN_ERR "ib_create_send_mad: RMPP mem " + "alloc failed for len %zd, gfp %#x\n", + sizeof(struct ib_mad_multipacket_seg) + seg_size, + gfp_mask); + free_send_multipacket_list(send_wr); + return -ENOMEM; + } + seg->size = seg_size; + seg->num = i++; + list_add_tail(&seg->list, &send_wr->multipacket_list); + message_size -= seg_size; + } + return 0; +} + struct ib_mad_send_buf * ib_create_send_mad(struct ib_mad_agent *mad_agent, u32 remote_qpn, u16 pkey_index, int rmpp_active, @@ -787,53 +835,54 @@ struct ib_mad_send_buf * ib_create_send_ { struct ib_mad_agent_private *mad_agent_priv; struct ib_mad_send_wr_private *mad_send_wr; - int length, buf_size; + int length, message_size, ret; void *buf; mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private, agent); - buf_size = get_buf_length(hdr_len, data_len); + message_size = get_buf_length(hdr_len, data_len); if ((!mad_agent->rmpp_version && - (rmpp_active || buf_size > sizeof(struct ib_mad))) || - (!rmpp_active && buf_size > sizeof(struct ib_mad))) + (rmpp_active || message_size > sizeof(struct ib_mad))) || + (!rmpp_active && message_size > sizeof(struct ib_mad))) return ERR_PTR(-EINVAL); - length = sizeof *mad_send_wr + buf_size; - if (length >= PAGE_SIZE) - buf = (void *)__get_free_pages(gfp_mask, long_log2(roundup_pow_of_two(length)) - PAGE_SHIFT); - else - buf = kmalloc(length, gfp_mask); + length = sizeof *mad_send_wr + message_size; + buf = kzalloc(sizeof *mad_send_wr + sizeof(struct ib_mad), gfp_mask); if (!buf) return ERR_PTR(-ENOMEM); - memset(buf, 0, length); - - mad_send_wr = buf + buf_size; + mad_send_wr = buf + sizeof(struct ib_mad); + INIT_LIST_HEAD(&mad_send_wr->multipacket_list); mad_send_wr->send_buf.mad = buf; + mad_send_wr->mad_payload = buf + hdr_len; mad_send_wr->mad_agent_priv = mad_agent_priv; - mad_send_wr->sg_list[0].length = buf_size; + mad_send_wr->sg_list[0].length = hdr_len; mad_send_wr->sg_list[0].lkey = mad_agent->mr->lkey; + mad_send_wr->sg_list[1].length = sizeof(struct ib_mad) - hdr_len; + mad_send_wr->sg_list[1].lkey = mad_agent->mr->lkey; mad_send_wr->send_wr.wr_id = (unsigned long) mad_send_wr; mad_send_wr->send_wr.sg_list = mad_send_wr->sg_list; - mad_send_wr->send_wr.num_sge = 1; + mad_send_wr->send_wr.num_sge = 2; mad_send_wr->send_wr.opcode = IB_WR_SEND; mad_send_wr->send_wr.send_flags = IB_SEND_SIGNALED; mad_send_wr->send_wr.wr.ud.remote_qpn = remote_qpn; mad_send_wr->send_wr.wr.ud.remote_qkey = IB_QP_SET_QKEY; mad_send_wr->send_wr.wr.ud.pkey_index = pkey_index; + mad_send_wr->last_ack_seg = NULL; + mad_send_wr->seg_num_seg = NULL; if (rmpp_active) { - struct ib_rmpp_mad *rmpp_mad = mad_send_wr->send_buf.mad; - rmpp_mad->rmpp_hdr.paylen_newwin = cpu_to_be32(hdr_len - - IB_MGMT_RMPP_HDR + data_len); - rmpp_mad->rmpp_hdr.rmpp_version = mad_agent->rmpp_version; - rmpp_mad->rmpp_hdr.rmpp_type = IB_MGMT_RMPP_TYPE_DATA; - ib_set_rmpp_flags(&rmpp_mad->rmpp_hdr, - IB_MGMT_RMPP_FLAG_ACTIVE); + ret = alloc_send_rmpp_segs(mad_send_wr, message_size, hdr_len, + data_len, mad_agent->rmpp_version, + gfp_mask); + if (ret) { + kfree(buf); + return ERR_PTR(ret); + } } mad_send_wr->send_buf.mad_agent = mad_agent; @@ -842,23 +891,71 @@ struct ib_mad_send_buf * ib_create_send_ } EXPORT_SYMBOL(ib_create_send_mad); +struct ib_mad_multipacket_seg +*ib_rmpp_get_multipacket_seg(struct ib_mad_send_wr_private *wr, int seg_num) +{ + struct ib_mad_multipacket_seg *seg; + + if (seg_num == 2) { + wr->seg_num_seg = + container_of(wr->multipacket_list.next, + struct ib_mad_multipacket_seg, list); + return wr->seg_num_seg; + } + + /* get first list entry if was not already done */ + if (!wr->seg_num_seg) + wr->seg_num_seg = + container_of(wr->multipacket_list.next, + struct ib_mad_multipacket_seg, list); + + if (wr->seg_num_seg->num == seg_num) + return wr->seg_num_seg; + else if (wr->seg_num_seg->num < seg_num) { + list_for_each_entry(seg, &wr->seg_num_seg->list, list) { + if (seg->num == seg_num) { + wr->seg_num_seg = seg; + return wr->seg_num_seg; + } + } + return NULL; + } else { + list_for_each_entry_reverse(seg, &wr->seg_num_seg->list, list) { + if (seg->num == seg_num) { + wr->seg_num_seg = seg; + return wr->seg_num_seg; + } + } + return NULL; + } + return NULL; +} + +struct ib_mad_multipacket_seg +*ib_mad_get_multipacket_seg(struct ib_mad_send_buf *send_buf, int seg_num) +{ + struct ib_mad_send_wr_private *wr; + + if (seg_num < 2) + return NULL; + + wr = container_of(send_buf, struct ib_mad_send_wr_private, send_buf); + return ib_rmpp_get_multipacket_seg(wr, seg_num); +} +EXPORT_SYMBOL(ib_mad_get_multipacket_seg); + void ib_free_send_mad(struct ib_mad_send_buf *send_buf) { struct ib_mad_agent_private *mad_agent_priv; - void *mad_send_wr; - int length; + struct ib_mad_send_wr_private *mad_send_wr; mad_agent_priv = container_of(send_buf->mad_agent, struct ib_mad_agent_private, agent); mad_send_wr = container_of(send_buf, struct ib_mad_send_wr_private, send_buf); - length = sizeof(struct ib_mad_send_wr_private) + (mad_send_wr - send_buf->mad); - if (length >= PAGE_SIZE) - free_pages((unsigned long)send_buf->mad, long_log2(roundup_pow_of_two(length)) - PAGE_SHIFT); - else - kfree(send_buf->mad); - + free_send_multipacket_list(mad_send_wr); + kfree(send_buf->mad); if (atomic_dec_and_test(&mad_agent_priv->refcount)) wake_up(&mad_agent_priv->wait); } @@ -881,10 +978,17 @@ int ib_send_mad(struct ib_mad_send_wr_pr mad_agent = mad_send_wr->send_buf.mad_agent; sge = mad_send_wr->sg_list; - sge->addr = dma_map_single(mad_agent->device->dma_device, - mad_send_wr->send_buf.mad, sge->length, - DMA_TO_DEVICE); - pci_unmap_addr_set(mad_send_wr, mapping, sge->addr); + sge[0].addr = dma_map_single(mad_agent->device->dma_device, + mad_send_wr->send_buf.mad, + sge[0].length, + DMA_TO_DEVICE); + pci_unmap_addr_set(mad_send_wr, header_mapping, sge[0].addr); + + sge[1].addr = dma_map_single(mad_agent->device->dma_device, + mad_send_wr->mad_payload, + sge[1].length, + DMA_TO_DEVICE); + pci_unmap_addr_set(mad_send_wr, payload_mapping, sge[1].addr); spin_lock_irqsave(&qp_info->send_queue.lock, flags); if (qp_info->send_queue.count < qp_info->send_queue.max_active) { @@ -901,11 +1005,15 @@ int ib_send_mad(struct ib_mad_send_wr_pr list_add_tail(&mad_send_wr->mad_list.list, list); } spin_unlock_irqrestore(&qp_info->send_queue.lock, flags); - if (ret) + if (ret) { dma_unmap_single(mad_agent->device->dma_device, - pci_unmap_addr(mad_send_wr, mapping), - sge->length, DMA_TO_DEVICE); + pci_unmap_addr(mad_send_wr, header_mapping), + sge[0].length, DMA_TO_DEVICE); + dma_unmap_single(mad_agent->device->dma_device, + pci_unmap_addr(mad_send_wr, payload_mapping), + sge[1].length, DMA_TO_DEVICE); + } return ret; } @@ -1876,8 +1984,11 @@ static void ib_mad_send_done_handler(str retry: dma_unmap_single(mad_send_wr->send_buf.mad_agent->device->dma_device, - pci_unmap_addr(mad_send_wr, mapping), + pci_unmap_addr(mad_send_wr, header_mapping), mad_send_wr->sg_list[0].length, DMA_TO_DEVICE); + dma_unmap_single(mad_send_wr->send_buf.mad_agent->device->dma_device, + pci_unmap_addr(mad_send_wr, payload_mapping), + mad_send_wr->sg_list[1].length, DMA_TO_DEVICE); queued_send_wr = NULL; spin_lock_irqsave(&send_queue->lock, flags); list_del(&mad_list->list); Index: src/drivers/infiniband/core/user_mad.c =================================================================== --- src.orig/drivers/infiniband/core/user_mad.c 2006-02-12 16:30:44.636158000 +0200 +++ src/drivers/infiniband/core/user_mad.c 2006-02-12 16:30:53.142901000 +0200 @@ -255,10 +255,11 @@ static void send_handler(struct ib_mad_a ib_free_send_mad(packet->msg); if (send_wc->status == IB_WC_RESP_TIMEOUT_ERR) { - timeout = kzalloc(sizeof *timeout + IB_MGMT_MAD_HDR, GFP_KERNEL); + timeout = kzalloc(sizeof *timeout + sizeof(struct ib_mad), + GFP_KERNEL); if (!timeout) goto out; - + INIT_LIST_HEAD(&timeout->seg_list); timeout->length = IB_MGMT_MAD_HDR; timeout->mad.hdr.id = packet->mad.hdr.id; timeout->mad.hdr.status = ETIMEDOUT; @@ -266,7 +267,7 @@ static void send_handler(struct ib_mad_a sizeof (struct ib_mad_hdr)); if (queue_packet(file, agent, timeout)) - kfree(timeout); + free_packet(timeout); } out: kfree(packet); @@ -409,6 +410,8 @@ static ssize_t ib_umad_write(struct file __be64 *tid; int ret, length, hdr_len, copy_offset; int rmpp_active, has_rmpp_header; + int s, seg_num; + struct ib_mad_multipacket_seg *seg; if (count < sizeof (struct ib_user_mad) + IB_MGMT_RMPP_HDR) return -EINVAL; @@ -485,6 +488,11 @@ static ssize_t ib_umad_write(struct file goto err_ah; } + if (!rmpp_active && length > sizeof(struct ib_mad)) { + ret = -EINVAL; + goto err_ah; + } + packet->msg = ib_create_send_mad(agent, be32_to_cpu(packet->mad.hdr.qpn), 0, rmpp_active, @@ -502,14 +510,32 @@ static ssize_t ib_umad_write(struct file /* Copy MAD headers (RMPP header in place) */ memcpy(packet->msg->mad, packet->mad.data, IB_MGMT_MAD_HDR); - /* Now, copy rest of message from user into send buffer */ + /* complete copying first 256 bytes of message into send buffer */ if (copy_from_user(packet->msg->mad + copy_offset, buf + sizeof (struct ib_user_mad) + copy_offset, - length - copy_offset)) { + min_t(int, length, sizeof(struct ib_mad)) - copy_offset)) { ret = -EFAULT; goto err_msg; } + /* if RMPP, copy rest of send message from user to multipacket list */ + length -= sizeof(struct ib_mad); + if (length > 0) { + buf += sizeof (struct ib_user_mad) + sizeof(struct ib_mad); + for (seg_num = 2; length > 0; ++seg_num, buf += s, length -= s) { + seg = ib_mad_get_multipacket_seg(packet->msg, seg_num); + BUG_ON(!seg); + s = min_t(int, length, seg->size); + if (copy_from_user(seg->data, buf, s)) { + ret = -EFAULT; + goto err_msg; + } + } + /* Pad last segment with zeroes. */ + if (seg->size - s) + memset(seg->data + s, 0, seg->size - s); + } + /* * If userspace is generating a request that will generate a * response, we need to make sure the high-order part of the From eli at mellanox.co.il Sun Feb 12 07:41:43 2006 From: eli at mellanox.co.il (Eli Cohen) Date: Sun, 12 Feb 2006 17:41:43 +0200 Subject: [openib-general] [PATCH] mthca: query_qp and query_srq Message-ID: <1139758904.5814.6.camel@mtls03.yok.mtl.com> Signed-off-by: Eli Cohen Signed-off-by: Michael S. Tsirkin implement ib_query_qp() and ib_query_srq() for mthca. Index: openib_gen2/drivers/infiniband/hw/mthca/mthca_cmd.h =================================================================== --- openib_gen2.orig/drivers/infiniband/hw/mthca/mthca_cmd.h +++ openib_gen2/drivers/infiniband/hw/mthca/mthca_cmd.h @@ -303,6 +303,8 @@ int mthca_RESIZE_CQ(struct mthca_dev *de u8 *status); int mthca_SW2HW_SRQ(struct mthca_dev *dev, struct mthca_mailbox *mailbox, int srq_num, u8 *status); +int mthca_QUERY_SRQ(struct mthca_dev *dev, u32 num, + struct mthca_mailbox *mailbox, u8 *status); int mthca_HW2SW_SRQ(struct mthca_dev *dev, struct mthca_mailbox *mailbox, int srq_num, u8 *status); int mthca_ARM_SRQ(struct mthca_dev *dev, int srq_num, int limit, u8 *status); Index: openib_gen2/drivers/infiniband/core/verbs.c =================================================================== --- openib_gen2.orig/drivers/infiniband/core/verbs.c +++ openib_gen2/drivers/infiniband/core/verbs.c @@ -257,9 +257,18 @@ int ib_query_qp(struct ib_qp *qp, int qp_attr_mask, struct ib_qp_init_attr *qp_init_attr) { - return qp->device->query_qp ? + int err; + + err = qp->device->query_qp ? qp->device->query_qp(qp, qp_attr, qp_attr_mask, qp_init_attr) : -ENOSYS; + if (err) + return err; + qp_init_attr->recv_cq = qp->recv_cq; + qp_init_attr->send_cq = qp->send_cq; + qp_init_attr->srq = qp->srq; + + return err; } EXPORT_SYMBOL(ib_query_qp); Index: openib_gen2/drivers/infiniband/hw/mthca/mthca_srq.c =================================================================== --- openib_gen2.orig/drivers/infiniband/hw/mthca/mthca_srq.c +++ openib_gen2/drivers/infiniband/hw/mthca/mthca_srq.c @@ -357,6 +357,40 @@ int mthca_modify_srq(struct ib_srq *ibsr return 0; } + +int mthca_query_srq(struct ib_srq *ibsrq, struct ib_srq_attr *srq_attr) +{ + struct mthca_dev *dev = to_mdev(ibsrq->device); + struct mthca_srq *srq = to_msrq(ibsrq); + struct mthca_mailbox *mailbox; + struct mthca_arbel_srq_context *arbel_ctx; + u8 status; + int err; + + mailbox = mthca_alloc_mailbox(dev, GFP_KERNEL); + if (IS_ERR(mailbox)) + return PTR_ERR(mailbox); + + err = mthca_QUERY_SRQ(dev, srq->srqn, mailbox, &status); + if (err) + goto exit; + if (mthca_is_memfree(dev)) { + arbel_ctx = mailbox->buf; + srq_attr->srq_limit = arbel_ctx->limit_watermark; + } + else + srq_attr->srq_limit = 0; + + srq_attr->pd = ibsrq->pd; + srq_attr->max_wr = srq->max; + srq_attr->max_sge = srq->max_gs; + +exit: + mthca_free_mailbox(dev, mailbox); + + return err; +} + void mthca_srq_event(struct mthca_dev *dev, u32 srqn, enum ib_event_type event_type) { Index: openib_gen2/drivers/infiniband/include/rdma/ib_verbs.h =================================================================== --- openib_gen2.orig/drivers/infiniband/include/rdma/ib_verbs.h +++ openib_gen2/drivers/infiniband/include/rdma/ib_verbs.h @@ -423,9 +423,10 @@ enum ib_srq_attr_mask { }; struct ib_srq_attr { - u32 max_wr; - u32 max_sge; - u32 srq_limit; + u32 max_wr; + u32 max_sge; + u32 srq_limit; + struct ib_pd *pd; }; struct ib_srq_init_attr { Index: openib_gen2/drivers/infiniband/hw/mthca/mthca_provider.c =================================================================== --- openib_gen2.orig/drivers/infiniband/hw/mthca/mthca_provider.c +++ openib_gen2/drivers/infiniband/hw/mthca/mthca_provider.c @@ -1270,7 +1270,10 @@ int mthca_register_device(struct mthca_d (1ull << IB_USER_VERBS_CMD_DETACH_MCAST) | (1ull << IB_USER_VERBS_CMD_CREATE_SRQ) | (1ull << IB_USER_VERBS_CMD_MODIFY_SRQ) | - (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ); + (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ) | + (1ull << IB_USER_VERBS_CMD_QUERY_QP) | + (1ull << IB_USER_VERBS_CMD_QUERY_SRQ); + dev->ib_dev.node_type = RDMA_NODE_IB_CA; dev->ib_dev.phys_port_cnt = dev->limits.num_ports; dev->ib_dev.dma_device = &dev->pdev->dev; @@ -1291,7 +1294,8 @@ int mthca_register_device(struct mthca_d if (dev->mthca_flags & MTHCA_FLAG_SRQ) { dev->ib_dev.create_srq = mthca_create_srq; - dev->ib_dev.modify_srq = mthca_modify_srq; + dev->ib_dev.modify_srq = mthca_modify_srq; + dev->ib_dev.query_srq = mthca_query_srq; dev->ib_dev.destroy_srq = mthca_destroy_srq; if (mthca_is_memfree(dev)) @@ -1302,6 +1306,7 @@ int mthca_register_device(struct mthca_d dev->ib_dev.create_qp = mthca_create_qp; dev->ib_dev.modify_qp = mthca_modify_qp; + dev->ib_dev.query_qp = mthca_query_qp; dev->ib_dev.destroy_qp = mthca_destroy_qp; dev->ib_dev.create_cq = mthca_create_cq; dev->ib_dev.resize_cq = mthca_resize_cq; Index: openib_gen2/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- openib_gen2.orig/drivers/infiniband/hw/mthca/mthca_qp.c +++ openib_gen2/drivers/infiniband/hw/mthca/mthca_qp.c @@ -270,6 +270,32 @@ static int to_mthca_state(enum ib_qp_sta } } + +static inline enum ib_qp_state to_ib_qp_state(int mthca_state) +{ + switch (mthca_state) { + case MTHCA_QP_STATE_RST: return IB_QPS_RESET; + case MTHCA_QP_STATE_INIT: return IB_QPS_INIT; + case MTHCA_QP_STATE_RTR: return IB_QPS_RTR; + case MTHCA_QP_STATE_RTS: return IB_QPS_RTS; + case MTHCA_QP_STATE_DRAINING: + case MTHCA_QP_STATE_SQD: return IB_QPS_SQD; + case MTHCA_QP_STATE_SQE: return IB_QPS_SQE; + case MTHCA_QP_STATE_ERR: return IB_QPS_ERR; + default: return -1; + } +} + +static inline enum ib_mig_state to_ib_mig_state(int mthca_mig_state) +{ + switch (mthca_mig_state) { + case 0: return IB_MIG_ARMED; + case 1: return IB_MIG_REARM; + case 3: return IB_MIG_MIGRATED; + default: BUG(); + } +} + enum { RC, UC, UD, RD, RDEE, MLX, NUM_TRANS }; static int to_mthca_st(int transport) @@ -553,6 +579,44 @@ static __be32 get_hw_access_flags(struct return cpu_to_be32(hw_access_flags); } +static int to_ib_qp_access_flags(int mthca_flags) +{ + int ib_flags = 0; + + if (mthca_flags & MTHCA_QP_BIT_RRE) + ib_flags |= IB_ACCESS_REMOTE_READ; + + if (mthca_flags & MTHCA_QP_BIT_RWE) + ib_flags |= IB_ACCESS_REMOTE_WRITE; + + if (mthca_flags & MTHCA_QP_BIT_RAE) + ib_flags |= IB_ACCESS_REMOTE_ATOMIC; + + return ib_flags; +} + +static void to_ib_ah_attr(struct mthca_dev *dev, struct ib_ah_attr *ib_ah_attr, + struct mthca_qp_path *path) +{ + memset(ib_ah_attr, 0, sizeof *path); + ib_ah_attr->port_num = (be32_to_cpu(path->port_pkey) >> 24) & 0x3; + ib_ah_attr->dlid = be16_to_cpu(path->rlid); + ib_ah_attr->sl = be32_to_cpu(path->sl_tclass_flowlabel) >> 28; + ib_ah_attr->src_path_bits = path->g_mylmc & 0x7f; + ib_ah_attr->static_rate = path->static_rate & 0x7; + ib_ah_attr->ah_flags = (path->g_mylmc & (1 << 7)) ? IB_AH_GRH : 0; + if (ib_ah_attr->ah_flags) { + ib_ah_attr->grh.sgid_index = path->mgid_index & (dev->limits.gid_table_len - 1); + ib_ah_attr->grh.hop_limit = path->hop_limit; + ib_ah_attr->grh.traffic_class = + (be32_to_cpu(path->sl_tclass_flowlabel) >> 20) & 0xff; + ib_ah_attr->grh.flow_label = + be32_to_cpu(path->sl_tclass_flowlabel) & 0xfffff; + memcpy(ib_ah_attr->grh.dgid.raw, + path->rgid, sizeof ib_ah_attr->grh.dgid.raw); + } +} + static void mthca_path_set(struct ib_ah_attr *ah, struct mthca_qp_path *path) { path->g_mylmc = ah->src_path_bits & 0x7f; @@ -914,6 +978,79 @@ int mthca_modify_qp(struct ib_qp *ibqp, return err; } + + +int mthca_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *qp_attr, int qp_attr_mask, + struct ib_qp_init_attr *qp_init_attr) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + struct mthca_qp *qp = to_mqp(ibqp); + int err; + struct mthca_mailbox *mailbox; + struct mthca_qp_param *qp_param; + struct mthca_qp_context *context; + int mthca_state; + u8 status; + + mailbox = mthca_alloc_mailbox(dev, GFP_KERNEL); + if (IS_ERR(mailbox)) + return PTR_ERR(mailbox); + + err = mthca_QUERY_QP(dev, qp->qpn, 0, mailbox, &status); + if (err) + goto exit; + + qp_param = mailbox->buf; + context = &qp_param->context; + mthca_state = be32_to_cpu(context->flags) >> 28; + qp_attr->qp_state = to_ib_qp_state(mthca_state); + qp_attr->cur_qp_state = qp_attr->qp_state; + qp_attr->path_mtu = context->mtu_msgmax >> 5; + qp_attr->path_mig_state = + to_ib_mig_state((be32_to_cpu(context->flags) >> 11) & 0x3); + qp_attr->qkey = be32_to_cpu(context->qkey); + qp_attr->rq_psn = be32_to_cpu(context->rnr_nextrecvpsn) & 0xffffff; + qp_attr->sq_psn = be32_to_cpu(context->next_send_psn) & 0xffffff; + qp_attr->dest_qp_num = be32_to_cpu(context->remote_qpn) & 0xffffff; + qp_attr->qp_access_flags = + to_ib_qp_access_flags(be32_to_cpu(context->params2)); + qp_attr->cap.max_send_wr = qp->sq.max; + qp_attr->cap.max_recv_wr = qp->rq.max; + qp_attr->cap.max_send_sge = qp->sq.max_gs; + qp_attr->cap.max_recv_sge = qp->rq.max_gs; + qp_attr->cap.max_inline_data = qp->max_inline_data; + + to_ib_ah_attr(dev, &qp_attr->ah_attr, &context->pri_path); + to_ib_ah_attr(dev, &qp_attr->alt_ah_attr, &context->alt_path); + + qp_attr->pkey_index = be32_to_cpu(context->pri_path.port_pkey) & 0x7f; + qp_attr->alt_pkey_index = be32_to_cpu(context->alt_path.port_pkey) & 0x7f; + + + /* qp_attr->en_sqd_async_notify + this field is only applicable in modify qp */ + qp_attr->sq_draining = mthca_state == MTHCA_QP_STATE_DRAINING; + + qp_attr->max_rd_atomic = 1 << ((be32_to_cpu(context->params1) >> 21) & 0x7); + + qp_attr->max_dest_rd_atomic = + 1 << ((be32_to_cpu(context->params2) >> 21) & 0x7); + qp_attr->min_rnr_timer = + (be32_to_cpu(context->rnr_nextrecvpsn) >> 24) & 0x1f; + qp_attr->port_num = qp_attr->ah_attr.port_num; + qp_attr->timeout = context->pri_path.ackto >> 3; + qp_attr->retry_cnt = (be32_to_cpu(context->params1) >> 16) & 0x7; + qp_attr->rnr_retry = context->pri_path.rnr_retry >> 5; + qp_attr->alt_port_num = qp_attr->alt_ah_attr.port_num; + qp_attr->alt_timeout = context->alt_path.ackto >> 3; + qp_init_attr->cap = qp_attr->cap; + +exit: + mthca_free_mailbox(dev, mailbox); + return err; +} + + static int mthca_max_data_size(struct mthca_dev *dev, struct mthca_qp *qp, int desc_sz) { /* Index: openib_gen2/drivers/infiniband/hw/mthca/mthca_cmd.c =================================================================== --- openib_gen2.orig/drivers/infiniband/hw/mthca/mthca_cmd.c +++ openib_gen2/drivers/infiniband/hw/mthca/mthca_cmd.c @@ -1554,6 +1554,15 @@ int mthca_SW2HW_SRQ(struct mthca_dev *de CMD_TIME_CLASS_A, status); } + +int mthca_QUERY_SRQ(struct mthca_dev *dev, u32 num, + struct mthca_mailbox *mailbox, u8 *status) +{ + + return mthca_cmd_box(dev, 0, mailbox->dma, num, 0, + CMD_QUERY_SRQ, CMD_TIME_CLASS_A, status); +} + int mthca_HW2SW_SRQ(struct mthca_dev *dev, struct mthca_mailbox *mailbox, int srq_num, u8 *status) { Index: openib_gen2/drivers/infiniband/hw/mthca/mthca_dev.h =================================================================== --- openib_gen2.orig/drivers/infiniband/hw/mthca/mthca_dev.h +++ openib_gen2/drivers/infiniband/hw/mthca/mthca_dev.h @@ -486,6 +486,7 @@ int mthca_alloc_srq(struct mthca_dev *de void mthca_free_srq(struct mthca_dev *dev, struct mthca_srq *srq); int mthca_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr, enum ib_srq_attr_mask attr_mask); +int mthca_query_srq(struct ib_srq *srq, struct ib_srq_attr *srq_attr); void mthca_srq_event(struct mthca_dev *dev, u32 srqn, enum ib_event_type event_type); void mthca_free_srq_wqe(struct mthca_srq *srq, u32 wqe_addr); @@ -497,6 +498,10 @@ int mthca_arbel_post_srq_recv(struct ib_ void mthca_qp_event(struct mthca_dev *dev, u32 qpn, enum ib_event_type event_type); int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask); + +int mthca_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *qp_attr, int qp_attr_mask, + struct ib_qp_init_attr *qp_init_attr); + int mthca_tavor_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, struct ib_send_wr **bad_wr); int mthca_tavor_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, From rdreier at cisco.com Sun Feb 12 08:32:11 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 12 Feb 2006 08:32:11 -0800 Subject: [openib-general] [PATCH] mthca: query_qp and query_srq In-Reply-To: <1139758904.5814.6.camel@mtls03.yok.mtl.com> (Eli Cohen's message of "Sun, 12 Feb 2006 17:41:43 +0200") References: <1139758904.5814.6.camel@mtls03.yok.mtl.com> Message-ID: > + ib_ah_attr->grh.sgid_index = path->mgid_index & > (dev->limits.gid_table_len - 1); Your patch is line-wrapped :( - R. From rdreier at cisco.com Sun Feb 12 08:37:48 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 12 Feb 2006 08:37:48 -0800 Subject: [openib-general] Re: [git patch review 1/4] IPoIB: Don't start send-only joins while multicast thread is stopped In-Reply-To: <20060212075037.GA11550@mellanox.co.il> (Michael S. Tsirkin's message of "Sun, 12 Feb 2006 09:50:37 +0200") References: <1139689341370-68b63fa9b8e76d91@cisco.com> <20060211140209.57af1b16.akpm@osdl.org> <20060212075037.GA11550@mellanox.co.il> Message-ID: Michael> Basically, its as Andrew said: the lock around clear_bit Michael> is there to ensure that ipoib_mcast_send isnt running Michael> already when we stop the thread. Thats why test_bit has Michael> to be inside the lock, too. Makes sense I guess. If I'm understanding correctly, the lock isn't really there to serialize the bit ops, but rather to make sure ipoib_mcast_send() won't do anything after we clear the bit. Does that mean that there's no reason to take the lock around the set_bit()? - R. From dotanb at mellanox.co.il Sun Feb 12 08:43:24 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Sun, 12 Feb 2006 18:43:24 +0200 Subject: [openib-general] [PATCH 1/4] [RFC] Add ib_modify_qp_is_ok to core Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3015CD509@mtlexch01.mtl.com> One comment: In SQD->SQD, changing the primary port should be allowed. Dotan From rdreier at cisco.com Sun Feb 12 08:47:33 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 12 Feb 2006 08:47:33 -0800 Subject: [openib-general] [PATCH 1/4] [RFC] Add ib_modify_qp_is_ok to core In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3015CD509@mtlexch01.mtl.com> (Dotan Barak's message of "Sun, 12 Feb 2006 18:43:24 +0200") References: <6AB138A2AB8C8E4A98B9C0C3D52670E3015CD509@mtlexch01.mtl.com> Message-ID: Dotan> One comment: In SQD->SQD, changing the primary port should Dotan> be allowed. Thanks, I will fix that up. This seems to be allowed only for RC QPs according to the IB spec. By the way, this problem was present in all three hw drivers in the tree, which shows how good combining the tables is for maintenance. - R. From mst at mellanox.co.il Sun Feb 12 08:56:57 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 12 Feb 2006 18:56:57 +0200 Subject: [openib-general] Re: [git patch review 1/4] IPoIB: Don't start send-only joins while multicast thread is stopped In-Reply-To: References: <1139689341370-68b63fa9b8e76d91@cisco.com> <20060211140209.57af1b16.akpm@osdl.org> <20060212075037.GA11550@mellanox.co.il> Message-ID: <20060212165657.GA14127@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [openib-general] Re: [git patch review 1/4] IPoIB: Don't start send-only joins while multicast thread is stopped > > Michael> Basically, its as Andrew said: the lock around clear_bit > Michael> is there to ensure that ipoib_mcast_send isnt running > Michael> already when we stop the thread. Thats why test_bit has > Michael> to be inside the lock, too. > > Makes sense I guess. If I'm understanding correctly, the lock isn't > really there to serialize the bit ops, but rather to make sure > ipoib_mcast_send() won't do anything after we clear the bit. Right. Thats one way to put it. > Does that mean that there's no reason to take the lock around the set_bit()? Ugh, sorry, I dont really remember why I put it there. I guess I just have easier time reasoning about locks than barriers and atomic operations. "bit is protected by priv->lock" is a simple rule, and we are not on data path here. The fact that the race went unnoticed for a while validates this approach in my eyes. I guess longer term we will replace mcast_mutex with priv->lock anyway, so it doesnt matter much. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From cirplastica at gmail.com Sun Feb 12 10:15:31 2006 From: cirplastica at gmail.com (Claudia) Date: Sun, 12 Feb 2006 18:15:31 GMT Subject: [openib-general] melhore a sua auto estima 2006 Message-ID: <20060212172219.D9CBC2283D8@openib.ca.sandia.gov> An HTML attachment was scrubbed... URL: From dotanb at mellanox.co.il Sun Feb 12 23:23:06 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Mon, 13 Feb 2006 09:23:06 +0200 Subject: [openib-general] [PATCH 1/4] [RFC] Add ib_modify_qp_is_ok to core Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3015CD5AD@mtlexch01.mtl.com> > Thanks, I will fix that up. This seems to be allowed only for RC QPs > according to the IB spec. You are right, thanks. > > By the way, this problem was present in all three hw drivers in the > tree, which shows how good combining the tables is for maintenance. What about common code that check the validity of the attributes of the modify QP (for example pkey_idx)? Dotan From eli at mellanox.co.il Sun Feb 12 23:27:37 2006 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 13 Feb 2006 09:27:37 +0200 Subject: [openib-general] [PATCH] mthca: query_qp and query_srq Message-ID: <1139815657.5814.25.camel@mtls03.yok.mtl.com> Signed-off-by: Eli Cohen Signed-off-by: Michael S. Tsirkin implement ib_query_qp() and ib_query_srq() for mthca. Index: openib_gen2/drivers/infiniband/hw/mthca/mthca_cmd.h =================================================================== --- openib_gen2.orig/drivers/infiniband/hw/mthca/mthca_cmd.h +++ openib_gen2/drivers/infiniband/hw/mthca/mthca_cmd.h @@ -303,6 +303,8 @@ int mthca_RESIZE_CQ(struct mthca_dev *de u8 *status); int mthca_SW2HW_SRQ(struct mthca_dev *dev, struct mthca_mailbox *mailbox, int srq_num, u8 *status); +int mthca_QUERY_SRQ(struct mthca_dev *dev, u32 num, + struct mthca_mailbox *mailbox, u8 *status); int mthca_HW2SW_SRQ(struct mthca_dev *dev, struct mthca_mailbox *mailbox, int srq_num, u8 *status); int mthca_ARM_SRQ(struct mthca_dev *dev, int srq_num, int limit, u8 *status); Index: openib_gen2/drivers/infiniband/core/verbs.c =================================================================== --- openib_gen2.orig/drivers/infiniband/core/verbs.c +++ openib_gen2/drivers/infiniband/core/verbs.c @@ -257,9 +257,18 @@ int ib_query_qp(struct ib_qp *qp, int qp_attr_mask, struct ib_qp_init_attr *qp_init_attr) { - return qp->device->query_qp ? + int err; + + err = qp->device->query_qp ? qp->device->query_qp(qp, qp_attr, qp_attr_mask, qp_init_attr) : -ENOSYS; + if (err) + return err; + qp_init_attr->recv_cq = qp->recv_cq; + qp_init_attr->send_cq = qp->send_cq; + qp_init_attr->srq = qp->srq; + + return err; } EXPORT_SYMBOL(ib_query_qp); Index: openib_gen2/drivers/infiniband/hw/mthca/mthca_srq.c =================================================================== --- openib_gen2.orig/drivers/infiniband/hw/mthca/mthca_srq.c +++ openib_gen2/drivers/infiniband/hw/mthca/mthca_srq.c @@ -357,6 +357,40 @@ int mthca_modify_srq(struct ib_srq *ibsr return 0; } + +int mthca_query_srq(struct ib_srq *ibsrq, struct ib_srq_attr *srq_attr) +{ + struct mthca_dev *dev = to_mdev(ibsrq->device); + struct mthca_srq *srq = to_msrq(ibsrq); + struct mthca_mailbox *mailbox; + struct mthca_arbel_srq_context *arbel_ctx; + u8 status; + int err; + + mailbox = mthca_alloc_mailbox(dev, GFP_KERNEL); + if (IS_ERR(mailbox)) + return PTR_ERR(mailbox); + + err = mthca_QUERY_SRQ(dev, srq->srqn, mailbox, &status); + if (err) + goto exit; + if (mthca_is_memfree(dev)) { + arbel_ctx = mailbox->buf; + srq_attr->srq_limit = arbel_ctx->limit_watermark; + } + else + srq_attr->srq_limit = 0; + + srq_attr->pd = ibsrq->pd; + srq_attr->max_wr = srq->max; + srq_attr->max_sge = srq->max_gs; + +exit: + mthca_free_mailbox(dev, mailbox); + + return err; +} + void mthca_srq_event(struct mthca_dev *dev, u32 srqn, enum ib_event_type event_type) { Index: openib_gen2/drivers/infiniband/include/rdma/ib_verbs.h =================================================================== --- openib_gen2.orig/drivers/infiniband/include/rdma/ib_verbs.h +++ openib_gen2/drivers/infiniband/include/rdma/ib_verbs.h @@ -423,9 +423,10 @@ enum ib_srq_attr_mask { }; struct ib_srq_attr { - u32 max_wr; - u32 max_sge; - u32 srq_limit; + u32 max_wr; + u32 max_sge; + u32 srq_limit; + struct ib_pd *pd; }; struct ib_srq_init_attr { Index: openib_gen2/drivers/infiniband/hw/mthca/mthca_provider.c =================================================================== --- openib_gen2.orig/drivers/infiniband/hw/mthca/mthca_provider.c +++ openib_gen2/drivers/infiniband/hw/mthca/mthca_provider.c @@ -1270,7 +1270,10 @@ int mthca_register_device(struct mthca_d (1ull << IB_USER_VERBS_CMD_DETACH_MCAST) | (1ull << IB_USER_VERBS_CMD_CREATE_SRQ) | (1ull << IB_USER_VERBS_CMD_MODIFY_SRQ) | - (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ); + (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ) | + (1ull << IB_USER_VERBS_CMD_QUERY_QP) | + (1ull << IB_USER_VERBS_CMD_QUERY_SRQ); + dev->ib_dev.node_type = RDMA_NODE_IB_CA; dev->ib_dev.phys_port_cnt = dev->limits.num_ports; dev->ib_dev.dma_device = &dev->pdev->dev; @@ -1291,7 +1294,8 @@ int mthca_register_device(struct mthca_d if (dev->mthca_flags & MTHCA_FLAG_SRQ) { dev->ib_dev.create_srq = mthca_create_srq; - dev->ib_dev.modify_srq = mthca_modify_srq; + dev->ib_dev.modify_srq = mthca_modify_srq; + dev->ib_dev.query_srq = mthca_query_srq; dev->ib_dev.destroy_srq = mthca_destroy_srq; if (mthca_is_memfree(dev)) @@ -1302,6 +1306,7 @@ int mthca_register_device(struct mthca_d dev->ib_dev.create_qp = mthca_create_qp; dev->ib_dev.modify_qp = mthca_modify_qp; + dev->ib_dev.query_qp = mthca_query_qp; dev->ib_dev.destroy_qp = mthca_destroy_qp; dev->ib_dev.create_cq = mthca_create_cq; dev->ib_dev.resize_cq = mthca_resize_cq; Index: openib_gen2/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- openib_gen2.orig/drivers/infiniband/hw/mthca/mthca_qp.c +++ openib_gen2/drivers/infiniband/hw/mthca/mthca_qp.c @@ -270,6 +270,32 @@ static int to_mthca_state(enum ib_qp_sta } } + +static inline enum ib_qp_state to_ib_qp_state(int mthca_state) +{ + switch (mthca_state) { + case MTHCA_QP_STATE_RST: return IB_QPS_RESET; + case MTHCA_QP_STATE_INIT: return IB_QPS_INIT; + case MTHCA_QP_STATE_RTR: return IB_QPS_RTR; + case MTHCA_QP_STATE_RTS: return IB_QPS_RTS; + case MTHCA_QP_STATE_DRAINING: + case MTHCA_QP_STATE_SQD: return IB_QPS_SQD; + case MTHCA_QP_STATE_SQE: return IB_QPS_SQE; + case MTHCA_QP_STATE_ERR: return IB_QPS_ERR; + default: return -1; + } +} + +static inline enum ib_mig_state to_ib_mig_state(int mthca_mig_state) +{ + switch (mthca_mig_state) { + case 0: return IB_MIG_ARMED; + case 1: return IB_MIG_REARM; + case 3: return IB_MIG_MIGRATED; + default: BUG(); + } +} + enum { RC, UC, UD, RD, RDEE, MLX, NUM_TRANS }; static int to_mthca_st(int transport) @@ -553,6 +579,44 @@ static __be32 get_hw_access_flags(struct return cpu_to_be32(hw_access_flags); } +static int to_ib_qp_access_flags(int mthca_flags) +{ + int ib_flags = 0; + + if (mthca_flags & MTHCA_QP_BIT_RRE) + ib_flags |= IB_ACCESS_REMOTE_READ; + + if (mthca_flags & MTHCA_QP_BIT_RWE) + ib_flags |= IB_ACCESS_REMOTE_WRITE; + + if (mthca_flags & MTHCA_QP_BIT_RAE) + ib_flags |= IB_ACCESS_REMOTE_ATOMIC; + + return ib_flags; +} + +static void to_ib_ah_attr(struct mthca_dev *dev, struct ib_ah_attr *ib_ah_attr, + struct mthca_qp_path *path) +{ + memset(ib_ah_attr, 0, sizeof *path); + ib_ah_attr->port_num = (be32_to_cpu(path->port_pkey) >> 24) & 0x3; + ib_ah_attr->dlid = be16_to_cpu(path->rlid); + ib_ah_attr->sl = be32_to_cpu(path->sl_tclass_flowlabel) >> 28; + ib_ah_attr->src_path_bits = path->g_mylmc & 0x7f; + ib_ah_attr->static_rate = path->static_rate & 0x7; + ib_ah_attr->ah_flags = (path->g_mylmc & (1 << 7)) ? IB_AH_GRH : 0; + if (ib_ah_attr->ah_flags) { + ib_ah_attr->grh.sgid_index = path->mgid_index & (dev->limits.gid_table_len - 1); + ib_ah_attr->grh.hop_limit = path->hop_limit; + ib_ah_attr->grh.traffic_class = + (be32_to_cpu(path->sl_tclass_flowlabel) >> 20) & 0xff; + ib_ah_attr->grh.flow_label = + be32_to_cpu(path->sl_tclass_flowlabel) & 0xfffff; + memcpy(ib_ah_attr->grh.dgid.raw, + path->rgid, sizeof ib_ah_attr->grh.dgid.raw); + } +} + static void mthca_path_set(struct ib_ah_attr *ah, struct mthca_qp_path *path) { path->g_mylmc = ah->src_path_bits & 0x7f; @@ -914,6 +978,79 @@ int mthca_modify_qp(struct ib_qp *ibqp, return err; } + + +int mthca_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *qp_attr, int qp_attr_mask, + struct ib_qp_init_attr *qp_init_attr) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + struct mthca_qp *qp = to_mqp(ibqp); + int err; + struct mthca_mailbox *mailbox; + struct mthca_qp_param *qp_param; + struct mthca_qp_context *context; + int mthca_state; + u8 status; + + mailbox = mthca_alloc_mailbox(dev, GFP_KERNEL); + if (IS_ERR(mailbox)) + return PTR_ERR(mailbox); + + err = mthca_QUERY_QP(dev, qp->qpn, 0, mailbox, &status); + if (err) + goto exit; + + qp_param = mailbox->buf; + context = &qp_param->context; + mthca_state = be32_to_cpu(context->flags) >> 28; + qp_attr->qp_state = to_ib_qp_state(mthca_state); + qp_attr->cur_qp_state = qp_attr->qp_state; + qp_attr->path_mtu = context->mtu_msgmax >> 5; + qp_attr->path_mig_state = + to_ib_mig_state((be32_to_cpu(context->flags) >> 11) & 0x3); + qp_attr->qkey = be32_to_cpu(context->qkey); + qp_attr->rq_psn = be32_to_cpu(context->rnr_nextrecvpsn) & 0xffffff; + qp_attr->sq_psn = be32_to_cpu(context->next_send_psn) & 0xffffff; + qp_attr->dest_qp_num = be32_to_cpu(context->remote_qpn) & 0xffffff; + qp_attr->qp_access_flags = + to_ib_qp_access_flags(be32_to_cpu(context->params2)); + qp_attr->cap.max_send_wr = qp->sq.max; + qp_attr->cap.max_recv_wr = qp->rq.max; + qp_attr->cap.max_send_sge = qp->sq.max_gs; + qp_attr->cap.max_recv_sge = qp->rq.max_gs; + qp_attr->cap.max_inline_data = qp->max_inline_data; + + to_ib_ah_attr(dev, &qp_attr->ah_attr, &context->pri_path); + to_ib_ah_attr(dev, &qp_attr->alt_ah_attr, &context->alt_path); + + qp_attr->pkey_index = be32_to_cpu(context->pri_path.port_pkey) & 0x7f; + qp_attr->alt_pkey_index = be32_to_cpu(context->alt_path.port_pkey) & 0x7f; + + + /* qp_attr->en_sqd_async_notify + this field is only applicable in modify qp */ + qp_attr->sq_draining = mthca_state == MTHCA_QP_STATE_DRAINING; + + qp_attr->max_rd_atomic = 1 << ((be32_to_cpu(context->params1) >> 21) & 0x7); + + qp_attr->max_dest_rd_atomic = + 1 << ((be32_to_cpu(context->params2) >> 21) & 0x7); + qp_attr->min_rnr_timer = + (be32_to_cpu(context->rnr_nextrecvpsn) >> 24) & 0x1f; + qp_attr->port_num = qp_attr->ah_attr.port_num; + qp_attr->timeout = context->pri_path.ackto >> 3; + qp_attr->retry_cnt = (be32_to_cpu(context->params1) >> 16) & 0x7; + qp_attr->rnr_retry = context->pri_path.rnr_retry >> 5; + qp_attr->alt_port_num = qp_attr->alt_ah_attr.port_num; + qp_attr->alt_timeout = context->alt_path.ackto >> 3; + qp_init_attr->cap = qp_attr->cap; + +exit: + mthca_free_mailbox(dev, mailbox); + return err; +} + + static int mthca_max_data_size(struct mthca_dev *dev, struct mthca_qp *qp, int desc_sz) { /* Index: openib_gen2/drivers/infiniband/hw/mthca/mthca_cmd.c =================================================================== --- openib_gen2.orig/drivers/infiniband/hw/mthca/mthca_cmd.c +++ openib_gen2/drivers/infiniband/hw/mthca/mthca_cmd.c @@ -1554,6 +1554,15 @@ int mthca_SW2HW_SRQ(struct mthca_dev *de CMD_TIME_CLASS_A, status); } + +int mthca_QUERY_SRQ(struct mthca_dev *dev, u32 num, + struct mthca_mailbox *mailbox, u8 *status) +{ + + return mthca_cmd_box(dev, 0, mailbox->dma, num, 0, + CMD_QUERY_SRQ, CMD_TIME_CLASS_A, status); +} + int mthca_HW2SW_SRQ(struct mthca_dev *dev, struct mthca_mailbox *mailbox, int srq_num, u8 *status) { Index: openib_gen2/drivers/infiniband/hw/mthca/mthca_dev.h =================================================================== --- openib_gen2.orig/drivers/infiniband/hw/mthca/mthca_dev.h +++ openib_gen2/drivers/infiniband/hw/mthca/mthca_dev.h @@ -486,6 +486,7 @@ int mthca_alloc_srq(struct mthca_dev *de void mthca_free_srq(struct mthca_dev *dev, struct mthca_srq *srq); int mthca_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr, enum ib_srq_attr_mask attr_mask); +int mthca_query_srq(struct ib_srq *srq, struct ib_srq_attr *srq_attr); void mthca_srq_event(struct mthca_dev *dev, u32 srqn, enum ib_event_type event_type); void mthca_free_srq_wqe(struct mthca_srq *srq, u32 wqe_addr); @@ -497,6 +498,10 @@ int mthca_arbel_post_srq_recv(struct ib_ void mthca_qp_event(struct mthca_dev *dev, u32 qpn, enum ib_event_type event_type); int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask); + +int mthca_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *qp_attr, int qp_attr_mask, + struct ib_qp_init_attr *qp_init_attr); + int mthca_tavor_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, struct ib_send_wr **bad_wr); int mthca_tavor_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, From jackm at mellanox.co.il Mon Feb 13 02:07:05 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Mon, 13 Feb 2006 12:07:05 +0200 Subject: [openib-general] [PATCH] mthca: implement query_ah for MADs and memfree Message-ID: <20060213100705.GA21434@mellanox.co.il> Implement query_ah in provider layer (except for av's which are in HCA memory) Needed for implementing RMPP duplicate session detection on sending side (extraction of DGID/DLID and GRH flag from address handle). Signed-off-by: Jack Morgenstein Index: src/drivers/infiniband/hw/mthca/mthca_av.c =================================================================== --- src.orig/drivers/infiniband/hw/mthca/mthca_av.c 2006-02-09 12:46:06.560017000 +0200 +++ src/drivers/infiniband/hw/mthca/mthca_av.c 2006-02-13 11:44:35.552503000 +0200 @@ -191,6 +191,34 @@ int mthca_read_ah(struct mthca_dev *dev, return 0; } +int mthca_query_ah(struct mthca_dev *dev, struct mthca_ah *ah, + struct ib_ah_attr *ah_attr) +{ + /* Only implement for MAD and memfree ah for now. */ + if (ah->type == MTHCA_AH_ON_HCA) + return -ENOSYS; + + memset(ah_attr, 0, sizeof *ah_attr); + ah_attr->dlid = be16_to_cpu(ah->av->dlid); + ah_attr->sl = be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 28; + ah_attr->static_rate = ah->av->msg_sr & 0x7; + ah_attr->src_path_bits = ah->av->g_slid & 0x7F; + ah_attr->port_num = be32_to_cpu(ah->av->port_pd) >> 24; + ah_attr->ah_flags = mthca_ah_grh_present(ah) ? IB_AH_GRH : 0; + + if (ah_attr->ah_flags) { + ah_attr->grh.traffic_class = + be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 20; + ah_attr->grh.flow_label = + be32_to_cpu(ah->av->sl_tclass_flowlabel) & 0xfffff; + ah_attr->grh.hop_limit = ah->av->hop_limit; + ah_attr->grh.sgid_index = ah->av->gid_index & + (dev->limits.gid_table_len - 1); + memcpy(ah_attr->grh.dgid.raw, ah->av->dgid, 16); + } + return 0; +} + int __devinit mthca_init_av_table(struct mthca_dev *dev) { int err; Index: src/drivers/infiniband/hw/mthca/mthca_dev.h =================================================================== --- src.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2006-02-09 12:46:06.574016000 +0200 +++ src/drivers/infiniband/hw/mthca/mthca_dev.h 2006-02-09 12:46:29.484351000 +0200 @@ -529,6 +529,8 @@ int mthca_create_ah(struct mthca_dev *de struct mthca_pd *pd, struct ib_ah_attr *ah_attr, struct mthca_ah *ah); +int mthca_query_ah(struct mthca_dev *dev, struct mthca_ah *ah, + struct ib_ah_attr *ah_attr); int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah); int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah, struct ib_ud_header *header); Index: src/drivers/infiniband/hw/mthca/mthca_provider.c =================================================================== --- src.orig/drivers/infiniband/hw/mthca/mthca_provider.c 2006-02-09 12:46:06.668020000 +0200 +++ src/drivers/infiniband/hw/mthca/mthca_provider.c 2006-02-09 12:46:29.496350000 +0200 @@ -446,6 +446,11 @@ static int mthca_ah_destroy(struct ib_ah return 0; } +static int mthca_ah_query(struct ib_ah *ah, struct ib_ah_attr *ah_attr) +{ + return mthca_query_ah(to_mdev(ah->device), to_mah(ah), ah_attr); +} + static struct ib_srq *mthca_create_srq(struct ib_pd *pd, struct ib_srq_init_attr *init_attr, struct ib_udata *udata) @@ -1288,6 +1293,7 @@ int mthca_register_device(struct mthca_d dev->ib_dev.dealloc_pd = mthca_dealloc_pd; dev->ib_dev.create_ah = mthca_ah_create; dev->ib_dev.destroy_ah = mthca_ah_destroy; + dev->ib_dev.query_ah = mthca_ah_query; if (dev->mthca_flags & MTHCA_FLAG_SRQ) { dev->ib_dev.create_srq = mthca_create_srq; From eli at mellanox.co.il Mon Feb 13 04:19:31 2006 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 13 Feb 2006 14:19:31 +0200 Subject: [openib-general] [PATCH] mthca - command interface Message-ID: <1139833171.5814.81.camel@mtls03.yok.mtl.com> This patch implements posting commands by issuing posted writes in a manner similar to doorbells. The benefit of this method is that it does not require polling the go bit before posting the command and thus can lower CPU utilization. Signed-off-by: Eli Cohen Signed-off-by: Michael S. Tsirkin Index: openib_gen2/drivers/infiniband/hw/mthca/mthca_dev.h =================================================================== --- openib_gen2.orig/drivers/infiniband/hw/mthca/mthca_dev.h +++ openib_gen2/drivers/infiniband/hw/mthca/mthca_dev.h @@ -128,6 +128,13 @@ struct mthca_cmd { int free_head; struct mthca_cmd_context *context; u16 token_mask; + /* post commands like doorbells instead of hcr */ + int can_post_doorbells; + int dbell_post; + void __iomem *dbell_map; + void __iomem *dbell_ptrs[8]; + u64 dbell_base; + u16 dbell_offsets[8]; }; struct mthca_limits { Index: openib_gen2/drivers/infiniband/hw/mthca/mthca_cmd.c =================================================================== --- openib_gen2.orig/drivers/infiniband/hw/mthca/mthca_cmd.c +++ openib_gen2/drivers/infiniband/hw/mthca/mthca_cmd.c @@ -188,19 +188,40 @@ static inline int go_bit(struct mthca_de swab32(1 << HCR_GO_BIT); } -static int mthca_cmd_post(struct mthca_dev *dev, - u64 in_param, - u64 out_param, - u32 in_modifier, - u8 op_modifier, - u16 op, - u16 token, - int event) -{ - int err = 0; - mutex_lock(&dev->cmd.hcr_mutex); +static void mthca_cmd_post_dbell(struct mthca_dev *dev, + u64 in_param, + u64 out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + u16 token) +{ + void __iomem **ptr = dev->cmd.dbell_ptrs; + + writel((__force u32) cpu_to_be32(in_param >> 32), ptr[0]); + writel((__force u32) cpu_to_be32(in_param & 0xfffffffful), ptr[1]); + writel((__force u32) cpu_to_be32(in_modifier), ptr[2]); + writel((__force u32) cpu_to_be32(out_param >> 32), ptr[3]); + writel((__force u32) cpu_to_be32(out_param & 0xfffffffful), ptr[4]); + writel((__force u32) cpu_to_be32(token << 16), ptr[5]); + writel((__force u32) cpu_to_be32((1 << HCR_GO_BIT) | + (1 << HCA_E_BIT) | + (op_modifier << HCR_OPMOD_SHIFT) | + op), ptr[6]); + writel((__force u32) 0, ptr[7]); +} + +static int mthca_cmd_post_hcr(struct mthca_dev *dev, + u64 in_param, + u64 out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + u16 token, + int event) +{ if (event) { unsigned long end = jiffies + GO_BIT_TIMEOUT; @@ -210,10 +231,8 @@ static int mthca_cmd_post(struct mthca_d } } - if (go_bit(dev)) { - err = -EAGAIN; - goto out; - } + if (go_bit(dev)) + return -EAGAIN; /* * We use writel (instead of something like memcpy_toio) @@ -236,7 +255,29 @@ static int mthca_cmd_post(struct mthca_d (op_modifier << HCR_OPMOD_SHIFT) | op), dev->hcr + 6 * 4); -out: + return 0; +} + +static int mthca_cmd_post(struct mthca_dev *dev, + u64 in_param, + u64 out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + u16 token, + int event) +{ + int err = 0; + + mutex_lock(&dev->cmd.hcr_mutex); + + if (dev->cmd.dbell_post && event) + mthca_cmd_post_dbell(dev, in_param, out_param, in_modifier, + op_modifier, op, token); + else + err = mthca_cmd_post_hcr(dev, in_param, out_param, in_modifier, + op_modifier, op, token, event); + mutex_unlock(&dev->cmd.hcr_mutex); return err; } @@ -461,6 +502,8 @@ void mthca_cmd_cleanup(struct mthca_dev { pci_pool_destroy(dev->cmd.pool); iounmap(dev->hcr); + if (dev->cmd.dbell_post) + iounmap(dev->cmd.dbell_map); } /* @@ -499,11 +542,54 @@ int mthca_cmd_use_events(struct mthca_de --dev->cmd.token_mask; dev->cmd.use_events = 1; + + down(&dev->cmd.poll_sem); return 0; } + +/* + * attempt to post commands through doorbells + */ +int mthca_use_cmd_doorbells(struct mthca_dev *dev) +{ + int i; + u16 max_off = 0; + unsigned long pg1, pg2; + void __iomem *map_base; + + if (!dev->cmd.can_post_doorbells) + return -ENODEV; + + for (i=0; i<8; ++i) + if (dev->cmd.dbell_offsets[i] > max_off) + max_off = dev->cmd.dbell_offsets[i]; + + pg1 = dev->cmd.dbell_base & PAGE_MASK; + pg2 = (dev->cmd.dbell_base + max_off) & PAGE_MASK; + + if (pg1 != pg2) + return -ENOMEM; + + map_base = ioremap(dev->cmd.dbell_base, max_off + + sizeof(unsigned long)); + if (!map_base) { + iounmap(map_base); + return -ENOMEM; + } + + for (i=0; i<8; ++i) + dev->cmd.dbell_ptrs[i] = map_base + + dev->cmd.dbell_offsets[i]; + + dev->cmd.dbell_map = map_base; + dev->cmd.dbell_post = 1; + + return 0; +} + /* * Switch back to polling (used when shutting down the device) */ @@ -667,8 +753,10 @@ int mthca_QUERY_FW(struct mthca_dev *dev { struct mthca_mailbox *mailbox; u32 *outbox; + u32 tmp; int err = 0; u8 lg; + int i; #define QUERY_FW_OUT_SIZE 0x100 #define QUERY_FW_VER_OFFSET 0x00 @@ -676,6 +764,11 @@ int mthca_QUERY_FW(struct mthca_dev *dev #define QUERY_FW_ERR_START_OFFSET 0x30 #define QUERY_FW_ERR_SIZE_OFFSET 0x38 + +#define QUERY_FW_CMD_DB_EN_OFFSET 0x10 +#define QUERY_FW_CMD_DB_OFFSET 0x50 +#define QUERY_FW_CMD_DB_BASE 0x60 + #define QUERY_FW_START_OFFSET 0x20 #define QUERY_FW_END_OFFSET 0x28 @@ -708,6 +801,12 @@ int mthca_QUERY_FW(struct mthca_dev *dev dev->cmd.max_cmds = 1 << lg; MTHCA_GET(dev->catas_err.addr, outbox, QUERY_FW_ERR_START_OFFSET); MTHCA_GET(dev->catas_err.size, outbox, QUERY_FW_ERR_SIZE_OFFSET); + MTHCA_GET(tmp, outbox, QUERY_FW_CMD_DB_EN_OFFSET); + dev->cmd.can_post_doorbells = tmp & 0x1; + MTHCA_GET(dev->cmd.dbell_base, outbox, QUERY_FW_CMD_DB_BASE); + for (i=0; i<8; ++i) + MTHCA_GET(dev->cmd.dbell_offsets[i], outbox, + QUERY_FW_CMD_DB_OFFSET + (i << 1)); mthca_dbg(dev, "FW version %012llx, max commands %d\n", (unsigned long long) dev->fw_ver, dev->cmd.max_cmds); @@ -1608,7 +1707,6 @@ int mthca_MODIFY_QP(struct mthca_dev *de if (0) { int i; mthca_dbg(dev, "Dumping QP context:\n"); - printk(" opt param mask: %08x\n", be32_to_cpup(mailbox->buf)); for (i = 0; i < 0x100 / 4; ++i) { if (i % 8 == 0) printk(" [%02x] ", i * 4); Index: openib_gen2/drivers/infiniband/hw/mthca/mthca_cmd.h =================================================================== --- openib_gen2.orig/drivers/infiniband/hw/mthca/mthca_cmd.h +++ openib_gen2/drivers/infiniband/hw/mthca/mthca_cmd.h @@ -244,6 +244,7 @@ struct mthca_set_ib_param { int mthca_cmd_init(struct mthca_dev *dev); void mthca_cmd_cleanup(struct mthca_dev *dev); int mthca_cmd_use_events(struct mthca_dev *dev); +int mthca_use_cmd_doorbells(struct mthca_dev *dev); void mthca_cmd_use_polling(struct mthca_dev *dev); void mthca_cmd_event(struct mthca_dev *dev, u16 token, u8 status, u64 out_param); From mst at mellanox.co.il Mon Feb 13 04:56:26 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 13 Feb 2006 14:56:26 +0200 Subject: [openib-general] Re: [PATCH] mthca - command interface In-Reply-To: <1139833171.5814.81.camel@mtls03.yok.mtl.com> References: <1139833171.5814.81.camel@mtls03.yok.mtl.com> Message-ID: <20060213125626.GA10389@mellanox.co.il> Quoting Eli Cohen : > This patch implements posting commands by issuing posted writes in a > manner similar to doorbells. The benefit of this method is that it does > not require polling the go bit before posting the command and thus can > lower CPU utilization. > > Signed-off-by: Eli Cohen > Signed-off-by: Michael S. Tsirkin No real problem here, but I dont think you need to add Signed-off-by: Michael S. Tsirkin to patches you post to openib by yourself, just because I went over the patch coding-style-wise. Documentation/SubmittingPatches says: "The sign-off is a simple line at the end of the explanation for the patch, which certifies that you wrote it or otherwise have the right to pass it on as a open-source patch." -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From yael at mellanox.co.il Mon Feb 13 05:17:08 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 13 Feb 2006 15:17:08 +0200 Subject: [openib-general] [PATCH] Opensm - cosmetic change in osmtest.c Message-ID: <5zzmkvjt4b.fsf@mtl066.yok.mtl.com> Hi Hal, The following patch removes an extra log message from osmtest.c Thanks, Yael Signed-off-by: Yael Kalka Index: osmtest/osmtest.c =================================================================== --- osmtest/osmtest.c (revision 5380) +++ osmtest/osmtest.c (working copy) @@ -4362,7 +4362,6 @@ osmtest_validate_single_path_rec_guid_pa num_recs = context.result.result_cnt; osm_log( &p_osmt->log, OSM_LOG_VERBOSE, - "osmtest_validate_single_path_rec_guid_pair: " "osmtest_validate_single_path_rec_guid_pair: %u records\n", num_recs); From yael at mellanox.co.il Mon Feb 13 05:19:03 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 13 Feb 2006 15:19:03 +0200 Subject: [openib-general] [PATCH] Opensm - include changes for windows Message-ID: <5zy80fjt14.fsf@mtl066.yok.mtl.com> Hi Hal, The following patch changes some of the includes for the windows compilation. Thanks, Yael Signed-off-by: Yael Kalka Index: include/opensm/osm_log.h =================================================================== --- include/opensm/osm_log.h (revision 5380) +++ include/opensm/osm_log.h (working copy) @@ -51,7 +51,9 @@ #ifndef _OSM_LOG_H_ #define _OSM_LOG_H_ +#ifndef __WIN__ #include +#endif #include #include #include Index: osmtest/include/osmtest_base.h =================================================================== --- osmtest/include/osmtest_base.h (revision 5380) +++ osmtest/include/osmtest_base.h (working copy) @@ -46,8 +46,10 @@ #ifndef _OSMTEST_BASE_H_ #define _OSMTEST_BASE_H_ -#ifndef WIN32 +#ifndef __WIN__ #include +#else +#include #endif #define OSMTEST_MAX_LINE_LEN 120 From yael at mellanox.co.il Mon Feb 13 05:22:06 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 13 Feb 2006 15:22:06 +0200 Subject: [openib-general] [PATCH] Opensm - osmt_service.c - changes for windows Message-ID: <5zwtfzjsw1.fsf@mtl066.yok.mtl.com> Hi Hal, The following patch changes some of the includes for the windows compilation. Also there are fixes of type casting. Thanks, Yael Signed-off-by: Yael Kalka Index: osmtest/osmt_service.c =================================================================== --- osmtest/osmt_service.c (revision 5380) +++ osmtest/osmt_service.c (working copy) @@ -50,7 +50,11 @@ /* next error code: 16A */ +#ifndef __WIN__ #include +#else +#include +#endif #include #include #include @@ -487,7 +491,7 @@ osmt_get_service_by_id_and_name ( IN osm osmtest_req_context_t context; osmv_query_req_t req; ib_service_record_t svc_rec,*p_rec; - uint16_t num_recs = 0; + uint32_t num_recs = 0; osmv_user_query_t user; OSM_LOG_ENTER( &p_osmt->log, osmt_get_service_by_id ); @@ -611,7 +615,7 @@ osmt_get_service_by_id ( IN osmtest_t * osmtest_req_context_t context; osmv_query_req_t req; ib_service_record_t svc_rec,*p_rec; - uint16_t num_recs = 0; + uint32_t num_recs = 0; osmv_user_query_t user; OSM_LOG_ENTER( &p_osmt->log, osmt_get_service_by_id ); @@ -733,7 +737,7 @@ osmt_get_service_by_name_and_key ( IN os osmtest_req_context_t context; osmv_query_req_t req; ib_service_record_t svc_rec,*p_rec; - uint16_t num_recs = 0; + uint32_t num_recs = 0; osmv_user_query_t user; OSM_LOG_ENTER( &p_osmt->log, osmt_get_service_by_name_and_key ); @@ -865,7 +869,7 @@ osmt_get_service_by_name( IN osmtest_t * osmv_query_req_t req; ib_service_record_t* p_rec; ib_svc_name_t service_name; - uint16_t num_recs = 0; + uint32_t num_recs = 0; OSM_LOG_ENTER( &p_osmt->log, osmt_get_service_by_name ); @@ -977,12 +981,12 @@ ib_api_status_t osmt_get_all_services_and_check_names( IN osmtest_t * const p_osmt, IN ib_svc_name_t * const p_valid_service_names_arr, IN uint8_t num_of_valid_names, - OUT uint16_t *num_services) { + OUT uint32_t *num_services) { ib_api_status_t status = IB_SUCCESS; osmtest_req_context_t context; osmv_query_req_t req; ib_service_record_t* p_rec; - uint16_t num_recs = 0,i,j; + uint32_t num_recs = 0,i,j; uint8_t *p_checked_names; OSM_LOG_ENTER(&p_osmt->log, osmt_get_all_services_and_check_names ); @@ -1071,8 +1075,8 @@ osmt_get_all_services_and_check_names( I "osmt_get_all_services_and_check_names: " "-I- Comparing source name : >%s<, with record name : >%s<, idx : %d\n", p_valid_service_names_arr[j],p_rec->service_name, p_checked_names[j]); - if ( strcmp((const char *)p_valid_service_names_arr[j], - (const char *)p_rec->service_name) == 0 ) + if ( strcmp(p_valid_service_names_arr[j], + p_rec->service_name) == 0 ) { osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_all_services_and_check_names: " @@ -1236,18 +1240,22 @@ osmt_run_service_records_flow( IN osmtes #ifdef VENDOR_RMPP_SUPPORT /* This array contain only the valid names after registering vs SM */ ib_svc_name_t service_valid_names[3]; - uint16_t num_recs = 0; + uint32_t num_recs = 0; #endif OSM_LOG_ENTER( &p_osmt->log, osmt_run_service_records_flow); /* Init Service names */ for (i=0 ; i<=6 ; i++ ) { - uint64_t rand = random()-(uint64_t)i; - id[i] = abs(pid - rand); +#ifdef __WIN__ + uint64_t rand_val = rand()-(uint64_t)i; +#else + uint64_t rand_val = random()-(uint64_t)i; +#endif + id[i] = abs((int)(pid - rand_val)); /* Just to be unique any place on any host */ sprintf((char*)(service_name[i]), - "osmt.srvc.%" PRIu64 ".%" PRIu64, rand,pid); + "osmt.srvc.%" PRIu64 ".%" PRIu64, rand_val,pid); /*printf("-I- Service Name is : %s, ID is : 0x%" PRIx64 "\n",service_name[i],id[i]);*/ } status = osmt_register_service( From yael at mellanox.co.il Mon Feb 13 05:24:01 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 13 Feb 2006 15:24:01 +0200 Subject: [openib-general] [PATCH] Opensm - osmt_slvl_vl_arb.c - changes for windows Message-ID: <5zvevjjssu.fsf@mtl066.yok.mtl.com> Hi Hal, Yet another patch with changes of the some includes for the windows compilation, and type casting fixes. Thanks, Yael Signed-off-by: Yael Kalka Index: osmtest/osmt_slvl_vl_arb.c =================================================================== --- osmtest/osmt_slvl_vl_arb.c (revision 5380) +++ osmtest/osmt_slvl_vl_arb.c (working copy) @@ -46,7 +46,9 @@ /* next error code: 16A */ +#ifndef __WIN__ #include +#endif #include #include #include @@ -296,7 +298,8 @@ osmtest_write_slvl_map_table( IN osmtest for (i = 0; i<16; i++) fprintf( fh,"| %-2u ", i); fprintf( fh, "|\nVL:"); - for (i = 0; i<16; i++) fprintf( fh,"| 0x%01X ",ib_slvl_table_get( &p_rec->slvl_tbl, i)); + for (i = 0; i<16; i++) + fprintf( fh,"| 0x%01X ",ib_slvl_table_get( &p_rec->slvl_tbl, (uint8_t)i)); fprintf( fh,"|\nEND\n\n"); /* Exit: */ From ogerlitz at voltaire.com Mon Feb 13 06:56:17 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 13 Feb 2006 16:56:17 +0200 (IST) Subject: [openib-general] [PATCH] iser: move from the iscsi connection fields directly related to the rdma conn Message-ID: moved from struct iscsi_iser_conn to iser_conn all the fields related to the rdma connection and simplified a flow related to its establishment/teardown Signed-off-by: Or Gerlitz Index: iscsi_iser.h =================================================================== --- iscsi_iser.h (revision 5382) +++ iscsi_iser.h (revision 5383) @@ -260,10 +260,12 @@ struct iser_conn struct ib_qp *qp; struct ib_fmr_pool *fmr_pool; int disc_evt_flag;/* may a state reflect it? */ - wait_queue_head_t connect_wait_q; + wait_queue_head_t wait; struct iscsi_iser_conn *p_iscsi_conn; - u32 dst_addr; - u16 dst_port; + atomic_t post_recv_buf_count; + atomic_t post_send_buf_count; + struct work_struct comperror_work; /* conn term sleepabl*/ + char name[ISER_OBJECT_NAME_SIZE]; }; struct iscsi_iser_queue { @@ -314,11 +316,6 @@ struct iscsi_iser_conn { struct iser_conn *ib_conn; /* iSER IB conn */ int ff_mode_enabled; /* To be removed ??? */ - atomic_t post_recv_buf_count; - atomic_t post_send_buf_count; - wait_queue_head_t disconnect_wait_q; /* used by sync term */ - struct work_struct comperror_work; /* conn term sleepabl*/ - char name[ISER_OBJECT_NAME_SIZE]; }; struct iscsi_iser_session { @@ -503,6 +500,4 @@ void iser_unreg_mem(struct iser_mem_reg int iser_post_recv(struct iser_desc *p_rx_desc); int iser_start_send(struct iser_desc *p_tx_desc); - -void iser_comp_error_worker(void *data); #endif Index: iser_verbs.c =================================================================== --- iser_verbs.c (revision 5382) +++ iser_verbs.c (revision 5383) @@ -49,6 +49,7 @@ static void iser_cq_tasklet_fn(unsigned long data); static void iser_cq_callback(struct ib_cq *cq, void *cq_context); +static void iser_comp_error_worker(void *data); static void iser_cq_event_callback(struct ib_event *cause, void *context) { @@ -321,7 +322,7 @@ int iser_conn_sync_terminate(struct iscs if (err) iser_bug("Failed to disc.gracefully, conn: 0x%p\n", p_iser_conn); - wait_event_interruptible(p_iser_conn->disconnect_wait_q, + wait_event_interruptible(ib_conn->wait, (atomic_read(&ib_conn->state) == ISER_CONN_DOWN)); break; @@ -330,7 +331,7 @@ int iser_conn_sync_terminate(struct iscs * conn stop, in that case the state is aligned to SYNC */ case ISER_CONN_ASYNC_TERM: atomic_set(&ib_conn->state, ISER_CONN_SYNC_TERM); - wait_event_interruptible(p_iser_conn->disconnect_wait_q, + wait_event_interruptible(ib_conn->wait, (atomic_read(&ib_conn->state) == ISER_CONN_DOWN)); break; @@ -367,7 +368,7 @@ void iser_conn_async_terminate(struct is /* iser_complete_conn_termination - Checks if the conn may be terminated * * and terminates if possible */ -static void iser_complete_conn_termination(struct iscsi_iser_conn *p_iser_conn) +static void iser_complete_conn_termination(struct iser_conn *p_iser_conn) { int rcv_buf_count; int send_buf_count; @@ -377,16 +378,16 @@ static void iser_complete_conn_terminati /* Check if this conn may be terminated now */ if (rcv_buf_count == 0 && - send_buf_count == 0 && p_iser_conn->ib_conn->disc_evt_flag) { + send_buf_count == 0 && p_iser_conn->disc_evt_flag) { unsigned int cur_conn_state; - cur_conn_state = atomic_read(&p_iser_conn->ib_conn->state); + cur_conn_state = atomic_read(&p_iser_conn->state); if (cur_conn_state != ISER_CONN_ASYNC_TERM && cur_conn_state != ISER_CONN_SYNC_TERM) { iser_bug("Illegal state (%s)", - iser_conn_get_state_name(p_iser_conn->ib_conn)); + iser_conn_get_state_name(p_iser_conn)); } - atomic_set(&p_iser_conn->ib_conn->state, ISER_CONN_DOWN); + atomic_set(&p_iser_conn->state, ISER_CONN_DOWN); iser_dbg("Conn:0x%p, may be terminated:%s\n", p_iser_conn, (cur_conn_state == ISER_CONN_ASYNC_TERM ? @@ -394,19 +395,20 @@ static void iser_complete_conn_terminati /* Notify the upper layer about asynch terminations */ if (cur_conn_state == ISER_CONN_ASYNC_TERM) - iscsi_iser_conn_failure(p_iser_conn, + iscsi_iser_conn_failure(p_iser_conn->p_iscsi_conn, ISCSI_ERR_CONN_FAILED); if (cur_conn_state == ISER_CONN_SYNC_TERM) - wake_up_interruptible(&p_iser_conn->disconnect_wait_q); + wake_up_interruptible( + &p_iser_conn->wait); else - iser_conn_release(p_iser_conn->ib_conn); + iser_conn_release(p_iser_conn); } else { iser_dbg("Conn:0x%p not terminated now, disc_event:%d," " post_recv_cnt:%d, post_send_cnt:%d, state:%s)\n", - p_iser_conn, p_iser_conn->ib_conn->disc_evt_flag, + p_iser_conn, p_iser_conn->disc_evt_flag, rcv_buf_count, send_buf_count, - iser_conn_get_state_name(p_iser_conn->ib_conn)); + iser_conn_get_state_name(p_iser_conn)); } } @@ -417,7 +419,7 @@ static void iser_connect_error(struct rd if (atomic_read(&p_iser_conn->state) == ISER_CONN_PENDING) { atomic_set(&p_iser_conn->state, ISER_CONN_DOWN); - wake_up_interruptible(&p_iser_conn->connect_wait_q); + wake_up_interruptible(&p_iser_conn->wait); } else iser_err("Unexpected evt for conn.state: %s\n", iser_conn_get_state_name(p_iser_conn)); @@ -480,7 +482,7 @@ static void iser_connected_handler(struc p_iser_conn = (struct iser_conn *)cma_id->context; atomic_set(&p_iser_conn->state, ISER_CONN_UP); - wake_up_interruptible(&p_iser_conn->connect_wait_q); + wake_up_interruptible(&p_iser_conn->wait); } static void iser_disconnected_handler(struct rdma_cm_id *cma_id) @@ -496,7 +498,7 @@ static void iser_disconnected_handler(st /* terminated asynchronously from the iSCSI layer's perspective. */ if (atomic_read(&p_iser_conn->state) == ISER_CONN_PENDING) { atomic_set(&p_iser_conn->state, ISER_CONN_DOWN); - wake_up_interruptible(&p_iser_conn->connect_wait_q); + wake_up_interruptible(&p_iser_conn->wait); } else { if (atomic_read(&p_iser_conn->state) == ISER_CONN_UP) { atomic_set(&p_iser_conn->state, ISER_CONN_ASYNC_TERM); @@ -505,7 +507,7 @@ static void iser_disconnected_handler(st } /* Complete the termination process if possible */ /* (no events pending) */ - iser_complete_conn_termination(p_iser_conn->p_iscsi_conn); + iser_complete_conn_termination(p_iser_conn); } } @@ -553,7 +555,11 @@ void iser_conn_init(struct iser_conn *p_ { memset(p_iser_conn, 0, sizeof(struct iser_conn)); atomic_set(&p_iser_conn->state, ISER_CONN_INIT); - init_waitqueue_head(&p_iser_conn->connect_wait_q); + init_waitqueue_head(&p_iser_conn->wait); + atomic_set(&p_iser_conn->post_recv_buf_count, 0); + atomic_set(&p_iser_conn->post_send_buf_count, 0); + INIT_WORK(&p_iser_conn->comperror_work, iser_comp_error_worker, + p_iser_conn); } /** @@ -567,8 +573,8 @@ int iser_connect(struct iser_conn *p_i struct sockaddr *src, *dst; int err = 0; - p_iser_conn->dst_addr = dst_addr->sin_addr.s_addr; - p_iser_conn->dst_port = dst_addr->sin_port; + sprintf(p_iser_conn->name,"%d.%d.%d.%d:%d", + NIPQUAD(dst_addr->sin_addr.s_addr), dst_addr->sin_port); /* the adaptor is known only --after-- address resolution */ p_iser_conn->p_adaptor = NULL; @@ -596,7 +602,7 @@ int iser_connect(struct iser_conn *p_i goto connect_failure; } - wait_event_interruptible(p_iser_conn->connect_wait_q, + wait_event_interruptible(p_iser_conn->wait, atomic_read(&p_iser_conn->state) != ISER_CONN_PENDING); if (atomic_read(&p_iser_conn->state) != ISER_CONN_UP) { @@ -762,9 +768,11 @@ int iser_post_recv(struct iser_desc *p_r recv_wr.num_sge = p_recv_dto->regd_vector_len; recv_wr.wr_id = (unsigned long)p_rx_desc; + atomic_inc(&p_iser_conn->ib_conn->post_recv_buf_count); ib_ret = ib_post_recv (p_iser_conn->ib_conn->qp, &recv_wr, &recv_wr_failed); if (ib_ret) { iser_err("ib_post_recv failed ret=%d\n", ib_ret); + atomic_dec(&p_iser_conn->ib_conn->post_recv_buf_count); ret_val = -1; } @@ -800,37 +808,37 @@ int iser_start_send(struct iser_desc *p_ send_wr.opcode = IB_WR_SEND; send_wr.send_flags = p_dto->notify_enable ? IB_SEND_SIGNALED : 0; - atomic_inc(&p_iser_conn->post_send_buf_count); + atomic_inc(&p_iser_conn->ib_conn->post_send_buf_count); ib_ret = ib_post_send(p_iser_conn->ib_conn->qp, &send_wr, &send_wr_failed); if (ib_ret) { iser_err("Failed to start SEND DTO, p_dto: 0x%p, IOV len: %d\n", p_dto, p_dto->regd_vector_len); iser_err("ib_post_send failed, ret:%d\n", ib_ret); - atomic_dec(&p_iser_conn->post_send_buf_count); + atomic_dec(&p_iser_conn->ib_conn->post_send_buf_count); ret_val = -1; } return ret_val; } -void iser_comp_error_worker(void *data) +static void iser_comp_error_worker(void *data) { - struct iscsi_iser_conn *p_iser_conn = data; + struct iser_conn *p_iser_conn = data; if (p_iser_conn == NULL) iser_bug("NULL p_desc->p_conn \n"); - if (atomic_read(&p_iser_conn->ib_conn->state) == ISER_CONN_UP) - iser_conn_async_terminate(p_iser_conn->ib_conn); + if (atomic_read(&p_iser_conn->state) == ISER_CONN_UP) + iser_conn_async_terminate(p_iser_conn); iser_complete_conn_termination(p_iser_conn); } static void iser_handle_comp_error(struct iser_desc *p_desc) { - struct iser_dto *p_dto = &p_desc->dto; - struct iscsi_iser_conn *p_iser_conn = p_dto->p_conn; + struct iser_dto *p_dto = &p_desc->dto; + struct iser_conn *p_iser_conn = p_dto->p_conn->ib_conn; iser_dto_buffs_release(p_dto); @@ -857,6 +865,7 @@ static void iser_cq_tasklet_fn(unsigned struct iser_desc *p_desc; unsigned long xfer_len; + while (ib_poll_cq(cq, 1, &wc) == 1) { p_desc = (struct iser_desc *) (unsigned long) wc.wr_id; Index: iser_initiator.c =================================================================== --- iser_initiator.c (revision 5382) +++ iser_initiator.c (revision 5383) @@ -304,7 +304,6 @@ static int iser_post_receive_control(str iser_dto_add_regd_buff(p_recv_dto, p_regd_data, USE_NO_OFFSET, USE_ENTIRE_SIZE); - atomic_inc(&p_iser_conn->post_recv_buf_count); err = iser_post_recv(rx_desc); post_receive_control_exit: @@ -313,7 +312,6 @@ post_receive_control_exit: if (rx_desc->data != NULL) kfree(rx_desc->data); kmem_cache_free(ig.desc_cache, rx_desc); - atomic_dec(&p_iser_conn->post_recv_buf_count); } return err; } @@ -355,9 +353,9 @@ int iser_conn_set_full_featured_mode(str /* Check that there is no posted recv or send buffers left - */ /* they must be consumed during the login phase */ - if (atomic_read(&p_iser_conn->post_recv_buf_count) != 0) + if (atomic_read(&p_iser_conn->ib_conn->post_recv_buf_count) != 0) iser_bug("Number of currently posted recv bufs non-zero\n"); - if (atomic_read(&p_iser_conn->post_send_buf_count) != 0) + if (atomic_read(&p_iser_conn->ib_conn->post_send_buf_count) != 0) iser_bug("Number of currently posted send bufs non-zero\n"); /* Initial post receive buffers */ @@ -384,7 +382,8 @@ iser_check_xmit(struct iscsi_iser_conn int rc = 0; spin_lock_bh(&conn->lock); - if (atomic_read(&conn->post_send_buf_count) == ISER_QP_MAX_REQ_DTOS) { + if (atomic_read(&conn->ib_conn->post_send_buf_count) == + ISER_QP_MAX_REQ_DTOS) { iser_dbg("%ld can't xmit task %p, suspending tx\n",jiffies,task); set_bit(SUSPEND_BIT, &conn->suspend_tx); rc = -EAGAIN; @@ -462,7 +461,7 @@ int iser_send_command(struct iscsi_iser_ return 0; send_command_error: - if (p_send_dto != NULL) + if (p_send_dto != NULL) iser_dto_buffs_release(p_send_dto); if (p_iser_conn != NULL) { /* drop the conn, open tasks are deleted during shutdown */ @@ -684,7 +683,7 @@ void iser_rcv_completion(struct iser_des * task eliminates the need to worry on tasks which are completed in * * parallel to the execution of iser_conn_term. So the code that waits * * for the posted rx bufs refcount to become zero handles everything */ - atomic_dec(&p_iser_conn->post_recv_buf_count); + atomic_dec(&p_iser_conn->ib_conn->post_recv_buf_count); } void iser_snd_completion(struct iser_desc *p_tx_desc) @@ -699,7 +698,7 @@ void iser_snd_completion(struct iser_des if (p_tx_desc->type == ISCSI_TX_DATAOUT) kmem_cache_free(ig.desc_cache, p_tx_desc); - atomic_dec(&p_iser_conn->post_send_buf_count); + atomic_dec(&p_iser_conn->ib_conn->post_send_buf_count); spin_lock(&p_iser_conn->lock); if (p_iser_conn->suspend_tx) { Index: iscsi_iser.c =================================================================== --- iscsi_iser.c (revision 5382) +++ iscsi_iser.c (revision 5383) @@ -921,12 +921,7 @@ iscsi_iser_conn_create(iscsi_sessionh_t init_timer(&conn->tmabort_timer); init_MUTEX(&conn->xmitsema); init_waitqueue_head(&conn->ehwait); - - atomic_set(&conn->post_recv_buf_count, 0); - atomic_set(&conn->post_send_buf_count, 0); - init_waitqueue_head(&conn->disconnect_wait_q); spin_lock_init(&conn->lock); - INIT_WORK(&conn->comperror_work, iser_comp_error_worker, conn); return iscsi_handle(conn); @@ -1069,8 +1064,6 @@ iscsi_iser_conn_bind(iscsi_sessionh_t se p_iser_conn = iser_conn_from_sock(sock); p_iser_conn->p_iscsi_conn = conn; conn->ib_conn = p_iser_conn; - sprintf(conn->name,"%d.%d.%d.%d:%d", - NIPQUAD(p_iser_conn->dst_addr),p_iser_conn->dst_port); if (is_leading) session->leadconn = conn; From ogerlitz at voltaire.com Mon Feb 13 07:12:46 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 13 Feb 2006 17:12:46 +0200 (IST) Subject: [openib-general] [PATCH] iser: simplified the rdma connection termination scheme Message-ID: simplified the rdma connection termination scheme, consolidated the flows for SYNC and ASYNC termination into one. Signed-off-by: Or Gerlitz Index: iscsi_iser.h =================================================================== --- iscsi_iser.h (revision 5383) +++ iscsi_iser.h (revision 5384) @@ -146,12 +146,11 @@ struct iser_hdr { #define ISER_OBJECT_NAME_SIZE 64 enum iser_ib_conn_state { - ISER_CONN_INIT, /* descriptor allocd, no conn */ - ISER_CONN_PENDING, /* about to be established */ - ISER_CONN_UP, /* up and running */ - ISER_CONN_SYNC_TERM, /* synchronous termination */ - ISER_CONN_ASYNC_TERM, /* asynchronous termination */ - ISER_CONN_DOWN, /* shut down */ + ISER_CONN_INIT, /* descriptor allocd, no conn */ + ISER_CONN_PENDING, /* in the process of being established */ + ISER_CONN_UP, /* up and running */ + ISER_CONN_TERMINATING, /* in the process of being terminated */ + ISER_CONN_DOWN, /* shut down */ ISER_CONN_STATES_NUM }; @@ -457,11 +456,7 @@ int iser_conn_establish(struct iser_con struct sockaddr_in *dst_addr, struct sockaddr_in *src_addr); -void iser_conn_release(struct iser_conn *p_iser_conn); - -void iser_conn_async_terminate(struct iser_conn *p_iser_conn); - -int iser_conn_sync_terminate(struct iscsi_iser_conn *p_iser_conn); +void iser_conn_terminate(struct iser_conn *ib_conn); void iser_rcv_completion(struct iser_desc *p_desc, unsigned long dto_xfer_len); Index: iser_verbs.c =================================================================== --- iser_verbs.c (revision 5383) +++ iser_verbs.c (revision 5384) @@ -50,6 +50,7 @@ static void iser_cq_tasklet_fn(unsigned long data); static void iser_cq_callback(struct ib_cq *cq, void *cq_context); static void iser_comp_error_worker(void *data); +static void iser_conn_release(struct iser_conn *p_iser_conn); static void iser_cq_event_callback(struct ib_event *cause, void *context) { @@ -282,134 +283,21 @@ static void iser_adaptor_try_release(str up(&ig.adaptor_list_sem); } -static char *conn_state_name[ISER_CONN_STATES_NUM + 1] = { - "INITIAL ", - "PENDING ", - "UP ", - "SYNC_TERM ", - "ASYNC_TERM ", - "DOWN ", - - "ILLEGAL " -}; - /** - * iser_conn_get_state_name - Retrieves symbolic name of a conn state - * - * returns conn name string + * iser_conn_terminate - Triggers start of the disconnect procedures and wait + * for them to be done */ -static char *iser_conn_get_state_name(struct iser_conn *p_iser_conn) +void iser_conn_terminate(struct iser_conn *ib_conn) { - enum iser_ib_conn_state state; - - state = atomic_read(&p_iser_conn->state); - return conn_state_name[state < ISER_CONN_STATES_NUM ? - state : ISER_CONN_STATES_NUM]; -} - -/** - * iser_conn_sync_terminate - Triggers start of the disconnect procedures - */ -int iser_conn_sync_terminate(struct iscsi_iser_conn *p_iser_conn) -{ - struct iser_conn *ib_conn = p_iser_conn->ib_conn; int err = 0; - switch (atomic_read(&ib_conn->state)) { - case ISER_CONN_UP: - atomic_set(&ib_conn->state, ISER_CONN_SYNC_TERM); - err = rdma_disconnect(ib_conn->cma_id); - if (err) - iser_bug("Failed to disc.gracefully, conn: 0x%p\n", - p_iser_conn); - wait_event_interruptible(ib_conn->wait, - (atomic_read(&ib_conn->state) == - ISER_CONN_DOWN)); - break; - - /* this state is possible here if async termination races with iscsi * - * conn stop, in that case the state is aligned to SYNC */ - case ISER_CONN_ASYNC_TERM: - atomic_set(&ib_conn->state, ISER_CONN_SYNC_TERM); - wait_event_interruptible(ib_conn->wait, - (atomic_read(&ib_conn->state) == - ISER_CONN_DOWN)); - break; - - case ISER_CONN_DOWN: - /* this may happen only when iSCSI is being notified */ - break; - - default: - iser_err("called when in state %s\n", - iser_conn_get_state_name(ib_conn)); - err = -EPERM; - break; - } - return err; -} - -/** -* iser_conn_async_terminate - Triggers start of the disconn procedures -*/ -void iser_conn_async_terminate(struct iser_conn *p_iser_conn) -{ - /* if the state is UP it means that the conn is being async terminated * - * as of the iSCSI layer. We need to initiate a disconnection after * - * which we will notify the iSCSI layer */ - if (atomic_read(&p_iser_conn->state) == ISER_CONN_UP) { - iser_err("Conn. 0x%p is being terminated asynchronously\n", p_iser_conn); - atomic_set(&p_iser_conn->state, ISER_CONN_ASYNC_TERM); - rdma_disconnect(p_iser_conn->cma_id); - } else - iser_err("called when in state %s - doing nothing\n", - iser_conn_get_state_name(p_iser_conn)); -} - -/* iser_complete_conn_termination - Checks if the conn may be terminated * - * and terminates if possible */ -static void iser_complete_conn_termination(struct iser_conn *p_iser_conn) -{ - int rcv_buf_count; - int send_buf_count; - - rcv_buf_count = atomic_read(&p_iser_conn->post_recv_buf_count); - send_buf_count = atomic_read(&p_iser_conn->post_send_buf_count); - - /* Check if this conn may be terminated now */ - if (rcv_buf_count == 0 && - send_buf_count == 0 && p_iser_conn->disc_evt_flag) { - unsigned int cur_conn_state; - - cur_conn_state = atomic_read(&p_iser_conn->state); - if (cur_conn_state != ISER_CONN_ASYNC_TERM && - cur_conn_state != ISER_CONN_SYNC_TERM) { - iser_bug("Illegal state (%s)", - iser_conn_get_state_name(p_iser_conn)); - } - atomic_set(&p_iser_conn->state, ISER_CONN_DOWN); - - iser_dbg("Conn:0x%p, may be terminated:%s\n", p_iser_conn, - (cur_conn_state == ISER_CONN_ASYNC_TERM ? - "ASYNC" : "SYNC")); - - /* Notify the upper layer about asynch terminations */ - if (cur_conn_state == ISER_CONN_ASYNC_TERM) - iscsi_iser_conn_failure(p_iser_conn->p_iscsi_conn, - ISCSI_ERR_CONN_FAILED); - - if (cur_conn_state == ISER_CONN_SYNC_TERM) - wake_up_interruptible( - &p_iser_conn->wait); - else - iser_conn_release(p_iser_conn); - } else { - iser_dbg("Conn:0x%p not terminated now, disc_event:%d," - " post_recv_cnt:%d, post_send_cnt:%d, state:%s)\n", - p_iser_conn, p_iser_conn->disc_evt_flag, - rcv_buf_count, send_buf_count, - iser_conn_get_state_name(p_iser_conn)); - } + atomic_set(&ib_conn->state, ISER_CONN_TERMINATING); + err = rdma_disconnect(ib_conn->cma_id); + if (err) + iser_bug("Failed to disconnect, conn: 0x%p err %d\n",ib_conn,err); + wait_event_interruptible(ib_conn->wait, + (atomic_read(&ib_conn->state) == ISER_CONN_DOWN)); + iser_conn_release(ib_conn); } static void iser_connect_error(struct rdma_cm_id *cma_id) @@ -421,8 +309,8 @@ static void iser_connect_error(struct rd atomic_set(&p_iser_conn->state, ISER_CONN_DOWN); wake_up_interruptible(&p_iser_conn->wait); } else - iser_err("Unexpected evt for conn.state: %s\n", - iser_conn_get_state_name(p_iser_conn)); + iser_err("Unexpected evt for conn.state: %d\n", + atomic_read(&p_iser_conn->state)); } static void iser_addr_handler(struct rdma_cm_id *cma_id) @@ -489,8 +377,6 @@ static void iser_disconnected_handler(st { struct iser_conn *p_iser_conn; - rdma_disconnect(cma_id); - p_iser_conn = (struct iser_conn *)cma_id->context; p_iser_conn->disc_evt_flag = 1; @@ -501,13 +387,16 @@ static void iser_disconnected_handler(st wake_up_interruptible(&p_iser_conn->wait); } else { if (atomic_read(&p_iser_conn->state) == ISER_CONN_UP) { - atomic_set(&p_iser_conn->state, ISER_CONN_ASYNC_TERM); - iser_dbg("Conn 0x%p is being terminated" - " asynchronously\n", p_iser_conn); + atomic_set(&p_iser_conn->state, ISER_CONN_TERMINATING); + iscsi_iser_conn_failure(p_iser_conn->p_iscsi_conn, + ISCSI_ERR_CONN_FAILED); + } + /* Complete the termination process if no posts are pending */ + if ((atomic_read(&p_iser_conn->post_recv_buf_count) == 0) && + (atomic_read(&p_iser_conn->post_send_buf_count) == 0)) { + atomic_set(&p_iser_conn->state, ISER_CONN_DOWN); + wake_up_interruptible(&p_iser_conn->wait); } - /* Complete the termination process if possible */ - /* (no events pending) */ - iser_complete_conn_termination(p_iser_conn); } } @@ -621,7 +510,7 @@ connect_failure: /** * Frees all conn objects and deallocs conn descriptor */ -void iser_conn_release(struct iser_conn *p_iser_conn) +static void iser_conn_release(struct iser_conn *p_iser_conn) { struct iscsi_iser_conn *p_iscsi_conn; struct iser_adaptor *p_iser_adaptor = p_iser_conn->p_adaptor; @@ -829,10 +718,18 @@ static void iser_comp_error_worker(void if (p_iser_conn == NULL) iser_bug("NULL p_desc->p_conn \n"); - if (atomic_read(&p_iser_conn->state) == ISER_CONN_UP) - iser_conn_async_terminate(p_iser_conn); + if (atomic_read(&p_iser_conn->state) == ISER_CONN_UP) { + atomic_set(&p_iser_conn->state, ISER_CONN_TERMINATING); + iscsi_iser_conn_failure(p_iser_conn->p_iscsi_conn, + ISCSI_ERR_CONN_FAILED); + } - iser_complete_conn_termination(p_iser_conn); + /* complete the termination process if disconnect event was delivered * + * note there are no more non completed posts to the QP */ + if (p_iser_conn->disc_evt_flag) { + atomic_set(&p_iser_conn->state, ISER_CONN_DOWN); + wake_up_interruptible(&p_iser_conn->wait); + } } static void iser_handle_comp_error(struct iser_desc *p_desc) Index: iser_initiator.c =================================================================== --- iser_initiator.c (revision 5383) +++ iser_initiator.c (revision 5384) @@ -463,11 +463,7 @@ int iser_send_command(struct iscsi_iser_ send_command_error: if (p_send_dto != NULL) iser_dto_buffs_release(p_send_dto); - if (p_iser_conn != NULL) { - /* drop the conn, open tasks are deleted during shutdown */ - iser_err("send cmd failed, drop conn:0x%p\n", p_iser_conn); - iser_conn_async_terminate(p_iser_conn->ib_conn); - } + iser_err("conn %p failed err %d\n",p_iser_conn, err); return err; } @@ -545,12 +541,7 @@ send_data_out_error: iser_dto_buffs_release(p_send_dto); if (tx_desc != NULL) kmem_cache_free(ig.desc_cache, tx_desc); - - if (p_iser_conn != NULL) { - /* drop the conn, open tasks are deleted during shutdown */ - iser_err("send dout failed, drop conn:0x%p\n", p_iser_conn); - iser_conn_async_terminate(p_iser_conn->ib_conn); - } + iser_err("conn %p failed err %d\n",p_iser_conn, err); return err; } @@ -613,11 +604,7 @@ int iser_send_control(struct iscsi_iser_ send_control_error: if (p_send_dto != NULL) iser_dto_buffs_release(p_send_dto); - if (p_iser_conn != NULL) { - /* drop the conn, open tasks are deleted during shutdown */ - iser_err("send ctrl failed, drop conn:0x%p\n", p_iser_conn); - iser_conn_async_terminate(p_iser_conn->ib_conn); - } + iser_err("conn %p failed err %d\n",p_iser_conn, err); return err; } Index: iscsi_iser.c =================================================================== --- iscsi_iser.c (revision 5383) +++ iscsi_iser.c (revision 5384) @@ -524,6 +524,9 @@ iscsi_iser_mtask_xmit(struct iscsi_iser_ error = iser_send_control(conn, mtask); + if (error && error != -EAGAIN) + iscsi_iser_conn_failure(conn, ISCSI_ERR_CONN_FAILED); + return error; } @@ -1181,13 +1184,7 @@ iscsi_iser_conn_stop(iscsi_connh_t connh /* starts conn teardown process, waits until all previously * * posted buffers get flushed, deallocates all conn resources */ - if (atomic_read(&conn->ib_conn->state) == ISER_CONN_UP) { - iser_conn_sync_terminate(conn); - iser_conn_release(conn->ib_conn); - } - else - iser_err("conn state is %d doing nothing\n", - atomic_read(&conn->ib_conn->state)); + iser_conn_terminate(conn->ib_conn); sock_release(conn->sock); conn->sock = NULL; From eitan at mellanox.co.il Mon Feb 13 07:23:20 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 13 Feb 2006 17:23:20 +0200 Subject: [openib-general] IPoIB and lid change Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B754@mtlexch01.mtl.com> Hi, I had a long discussion today with Michael, Yael and Tziporet regarding this issue. We have got to the following conclusions/proposal: 1. As we use only GID[0] (that can not change) and a QP that is reserved for the interface even if it is down we actually "never" change IPoIB MAC (unless you download the module). So unsolicited ARP reply is not a must. The MAC (which is GID,QP in IPoIB) is kept even if the LID changes. 2. When a new SM is brought up it optionally can send ClientReRegistration which should cause the entire UD AV cache (and the new SA Path Cache) to be flushed. 3. When subnets are merged there is currently no way to tell if there were any LID changes. The SM does not need to do any "Set(PortInfo)" on a node if no change is required so the remote node (that did not change LID) will not know about any such change. There are several solutions for this problem. The one we believe should be promoted is the concept of "UnPath" Notice: a. In IB any class manager (SM) can generate Traps carrying a Notice attribute. We propose a new Notice of trap number = YYY which will mean UnPath message. The notice will carry several fields of the path record as its DataDetails field and also a component mask. (the DataDetails field is 432 bits long). So the following fields are to be included in the DataDetails: SLID, DLID, P_KEY, TCLASS, SL, compMask. The component mask will allow to wildcard the fields values such that: * an UnPath Notice including a component mask = 0 will UnPath all paths. * an UnPath Notice including a component mask = 1 will UnPath all paths from the LID = the SLID included in the Notice. b. The IPoIB will have to register with the SA to receive these notices. c. The SM needs to be smart with the way the notices are being built: The SM should be able to coalesce multiple change events to one notice by using the wildcard (zero compmask) as appropriate for the case inspected by the SM. With this kind of coalescing the SM does not need to generate O(N^2) Reports but only O(N) Reports. As you might guess any discovery and setting of the fabric require O(N^2/32) just for LFTs settings so our problem of distributing these N reports is not so big. My plan is to bring this UnPath Notice to the IBTA MgtWG for discussion. Thanks Eitan From halr at voltaire.com Mon Feb 13 07:12:52 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Feb 2006 10:12:52 -0500 Subject: [openib-general] Re: [PATCH] Opensm - cosmetic change in osmtest.c In-Reply-To: <5zzmkvjt4b.fsf@mtl066.yok.mtl.com> References: <5zzmkvjt4b.fsf@mtl066.yok.mtl.com> Message-ID: <1139843571.4475.8670.camel@hal.voltaire.com> On Mon, 2006-02-13 at 08:17, Yael Kalka wrote: > Hi Hal, > > The following patch removes an extra log message from osmtest.c > > Thanks, > Yael > > Signed-off-by: Yael Kalka Thanks. Applied. From rdreier at cisco.com Mon Feb 13 07:26:17 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 13 Feb 2006 07:26:17 -0800 Subject: [openib-general] [PATCH 1/4] [RFC] Add ib_modify_qp_is_ok to core References: <6AB138A2AB8C8E4A98B9C0C3D52670E3015CD5AD@mtlexch01.mtl.com> Message-ID: Dotan> What about common code that check the validity of the Dotan> attributes of the modify QP (for example pkey_idx)? Consolidating common code is always good. I will always like a patch that moves code that appears in multiple places into a core function. Usually it's better to put it in a library function so that low-level drivers can override it if necessary. - R. From halr at voltaire.com Mon Feb 13 07:25:12 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Feb 2006 10:25:12 -0500 Subject: [openib-general] Re: [PATCH] Opensm - include changes for windows In-Reply-To: <5zy80fjt14.fsf@mtl066.yok.mtl.com> References: <5zy80fjt14.fsf@mtl066.yok.mtl.com> Message-ID: <1139844311.4475.8782.camel@hal.voltaire.com> On Mon, 2006-02-13 at 08:19, Yael Kalka wrote: > Hi Hal, > > The following patch changes some of the includes for the windows compilation. > > Thanks, > Yael > > Signed-off-by: Yael Kalka Thanks. Applied. From mst at mellanox.co.il Mon Feb 13 07:41:14 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 13 Feb 2006 17:41:14 +0200 Subject: [openib-general] madvise MADV_DONTFORK/MADV_DOFORK Message-ID: <20060213154114.GO32041@mellanox.co.il> OK, I guess its time to start the push for merging this patch. Probably not 2.6.16 material, but it would be nice to get this say into -mm to make it easier to test this. Tested on x86_64 only. Please Cc me directly with comments, I'm not on the list. --- Add madvise options to control whether memory range is inherited across fork. Useful e.g. for when hardware is doing DMA from/into these pages. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.16-rc2/kernel/fork.c =================================================================== --- linux-2.6.16-rc2.orig/kernel/fork.c 2006-02-10 03:43:19.000000000 +0200 +++ linux-2.6.16-rc2/kernel/fork.c 2006-02-12 20:48:37.000000000 +0200 @@ -210,7 +210,7 @@ static inline int dup_mmap(struct mm_str for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) { struct file *file; - if (mpnt->vm_flags & VM_DONTCOPY) { + if (mpnt->vm_flags & (VM_DONTCOPY | VM_DONTFORK)) { long pages = vma_pages(mpnt); mm->total_vm -= pages; vm_stat_account(mm, mpnt->vm_flags, mpnt->vm_file, Index: linux-2.6.16-rc2/mm/mmap.c =================================================================== --- linux-2.6.16-rc2.orig/mm/mmap.c 2006-02-10 03:43:19.000000000 +0200 +++ linux-2.6.16-rc2/mm/mmap.c 2006-02-12 20:48:37.000000000 +0200 @@ -847,7 +847,7 @@ void vm_stat_account(struct mm_struct *m #ifdef CONFIG_HUGETLB if (flags & VM_HUGETLB) { - if (!(flags & VM_DONTCOPY)) + if (!(flags & (VM_DONTCOPY|VM_DONTFORK))) mm->shared_vm += pages; return; } Index: linux-2.6.16-rc2/mm/madvise.c =================================================================== --- linux-2.6.16-rc2.orig/mm/madvise.c 2006-02-10 03:43:19.000000000 +0200 +++ linux-2.6.16-rc2/mm/madvise.c 2006-02-12 20:48:37.000000000 +0200 @@ -22,16 +22,23 @@ static long madvise_behavior(struct vm_a struct mm_struct * mm = vma->vm_mm; int error = 0; pgoff_t pgoff; - int new_flags = vma->vm_flags & ~VM_READHINTMASK; + int new_flags = vma->vm_flags; switch (behavior) { + case MADV_NORMAL: + new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ; + break; case MADV_SEQUENTIAL: - new_flags |= VM_SEQ_READ; + new_flags = (new_flags & ~VM_RAND_READ) | VM_SEQ_READ; break; case MADV_RANDOM: - new_flags |= VM_RAND_READ; + new_flags = (new_flags & ~VM_SEQ_READ) | VM_RAND_READ; break; - default: + case MADV_DONTFORK: + new_flags |= VM_DONTFORK; + break; + case MADV_DOFORK: + new_flags &= ~VM_DONTFORK; break; } @@ -180,6 +187,8 @@ madvise_vma(struct vm_area_struct *vma, case MADV_NORMAL: case MADV_SEQUENTIAL: case MADV_RANDOM: + case MADV_DONTFORK: + case MADV_DOFORK: error = madvise_behavior(vma, prev, start, end, behavior); break; case MADV_REMOVE: Index: linux-2.6.16-rc2/include/linux/mm.h =================================================================== --- linux-2.6.16-rc2.orig/include/linux/mm.h 2006-02-10 03:43:19.000000000 +0200 +++ linux-2.6.16-rc2/include/linux/mm.h 2006-02-12 20:52:11.000000000 +0200 @@ -166,6 +166,7 @@ extern unsigned int kobjsize(const void #define VM_NONLINEAR 0x00800000 /* Is non-linear (remap_file_pages) */ #define VM_MAPPED_COPY 0x01000000 /* T if mapped copy of data (nommu mmap) */ #define VM_INSERTPAGE 0x02000000 /* The vma has had "vm_insert_page()" done on it */ +#define VM_DONTFORK 0x04000000 /* App wants to avoid inheriting the vma on fork */ #ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */ #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS Index: linux-2.6.16-rc2/include/asm-x86_64/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-x86_64/mman.h 2006-02-10 03:43:19.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-x86_64/mman.h 2006-02-12 20:52:21.000000000 +0200 @@ -37,6 +37,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-powerpc/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-powerpc/mman.h 2006-02-10 03:43:13.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-powerpc/mman.h 2006-02-12 20:55:07.000000000 +0200 @@ -45,6 +45,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-cris/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-cris/mman.h 2006-02-10 03:43:13.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-cris/mman.h 2006-02-12 20:54:31.000000000 +0200 @@ -38,6 +38,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-arm26/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-arm26/mman.h 2006-02-10 03:43:13.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-arm26/mman.h 2006-02-12 20:54:27.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-alpha/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-alpha/mman.h 2006-02-10 03:43:13.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-alpha/mman.h 2006-02-12 20:55:44.000000000 +0200 @@ -43,6 +43,8 @@ #define MADV_SPACEAVAIL 5 /* ensure resources are available */ #define MADV_DONTNEED 6 /* don't need these pages */ #define MADV_REMOVE 7 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-m68k/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-m68k/mman.h 2006-02-10 03:43:13.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-m68k/mman.h 2006-02-12 20:54:54.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-xtensa/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-xtensa/mman.h 2006-02-10 03:43:19.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-xtensa/mman.h 2006-02-12 20:55:38.000000000 +0200 @@ -73,6 +73,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-mips/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-mips/mman.h 2006-02-10 03:43:13.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-mips/mman.h 2006-02-12 20:55:00.000000000 +0200 @@ -66,6 +66,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-sparc64/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-sparc64/mman.h 2006-02-10 03:43:18.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-sparc64/mman.h 2006-02-12 20:55:24.000000000 +0200 @@ -55,6 +55,8 @@ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_FREE 0x5 /* (Solaris) contents can be freed */ #define MADV_REMOVE 0x6 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-v850/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-v850/mman.h 2006-02-10 03:43:18.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-v850/mman.h 2006-02-12 20:55:31.000000000 +0200 @@ -33,6 +33,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-s390/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-s390/mman.h 2006-02-10 03:43:13.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-s390/mman.h 2006-02-12 20:55:11.000000000 +0200 @@ -44,6 +44,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-parisc/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-parisc/mman.h 2006-02-10 03:43:13.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-parisc/mman.h 2006-02-12 20:55:04.000000000 +0200 @@ -49,6 +49,8 @@ #define MADV_4M_PAGES 22 /* Use 4 Megabyte pages */ #define MADV_16M_PAGES 24 /* Use 16 Megabyte pages */ #define MADV_64M_PAGES 26 /* Use 64 Megabyte pages */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-i386/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-i386/mman.h 2006-02-10 03:43:13.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-i386/mman.h 2006-02-12 20:54:43.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-sh/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-sh/mman.h 2006-02-10 03:43:13.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-sh/mman.h 2006-02-12 20:55:14.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-ia64/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-ia64/mman.h 2006-02-10 03:43:13.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-ia64/mman.h 2006-02-12 20:54:47.000000000 +0200 @@ -44,6 +44,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-sparc/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-sparc/mman.h 2006-02-10 03:43:13.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-sparc/mman.h 2006-02-12 20:55:20.000000000 +0200 @@ -55,6 +55,8 @@ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_FREE 0x5 /* (Solaris) contents can be freed */ #define MADV_REMOVE 0x6 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-m32r/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-m32r/mman.h 2006-02-10 03:43:13.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-m32r/mman.h 2006-02-12 20:54:51.000000000 +0200 @@ -38,6 +38,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-frv/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-frv/mman.h 2006-02-10 03:43:13.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-frv/mman.h 2006-02-12 20:54:35.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-h8300/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-h8300/mman.h 2006-02-10 03:43:13.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-h8300/mman.h 2006-02-12 20:54:39.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-arm/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-arm/mman.h 2006-02-10 03:43:13.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-arm/mman.h 2006-02-12 20:54:19.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From halr at voltaire.com Mon Feb 13 07:36:33 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Feb 2006 10:36:33 -0500 Subject: [openib-general] Re: [PATCH] Opensm - osmt_service.c - changes for windows In-Reply-To: <5zwtfzjsw1.fsf@mtl066.yok.mtl.com> References: <5zwtfzjsw1.fsf@mtl066.yok.mtl.com> Message-ID: <1139844992.4475.8874.camel@hal.voltaire.com> On Mon, 2006-02-13 at 08:22, Yael Kalka wrote: > Hi Hal, > > The following patch changes some of the includes for the windows > compilation. > Also there are fixes of type casting. > > Thanks, > Yael > > Signed-off-by: Yael Kalka Thanks. Applied. From halr at voltaire.com Mon Feb 13 07:45:56 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Feb 2006 10:45:56 -0500 Subject: [openib-general] Re: [PATCH] Opensm - osmt_slvl_vl_arb.c - changes for windows In-Reply-To: <5zvevjjssu.fsf@mtl066.yok.mtl.com> References: <5zvevjjssu.fsf@mtl066.yok.mtl.com> Message-ID: <1139845556.4475.8957.camel@hal.voltaire.com> On Mon, 2006-02-13 at 08:24, Yael Kalka wrote: > Hi Hal, > > Yet another patch with changes of the some includes for the windows > compilation, and type casting fixes. > > Thanks, > Yael > > Signed-off-by: Yael Kalka Thanks. Applied. From Arkady.Kanevsky at netapp.com Mon Feb 13 08:09:39 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Mon, 13 Feb 2006 11:09:39 -0500 Subject: [openib-general] RE: [swg] IP Addressing Annex v4. Message-ID: Updated annex that incorporates responses to Ted's comments. Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Kanevsky, Arkady > Sent: Monday, February 13, 2006 11:04 AM > To: Ted H. Kim > Cc: swg at infinibandta.org > Subject: RE: [swg] IP Addressing Annex v4. > > comments in-line. > > Arkady Kanevsky email: arkady at netapp.com > Network Appliance Inc. phone: 781-768-5395 > 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 > Waltham, MA 02451 central phone: 781-768-5300 > > > > -----Original Message----- > > From: Ted H. Kim [mailto:ted.kim at sun.com] > > Sent: Wednesday, February 01, 2006 1:11 PM > > To: Kanevsky, Arkady > > Cc: swg at infinibandta.org > > Subject: Re: [swg] IP Addressing Annex v4. > > > > Arkady, > > > > Some comments on v4 - > > > > p.1, line 28 - suggest removing the word "Agent" - since > that word has > > a particular meaning in the context of the IBTA management model, > > which probably should not be referred to here > > OK. Removed from all 3 places. > But CM stands for both protocol definition as well as "provider". > I suggest to spell out Communication Manager(s) instead of CM. > CM will be used for Connection Management (protocol). > > > > > p.2, lines 17-20 and throughout the annex - while you use the > > requestor/responder terminology which is used in the LWG chapters a > > lot, I think it might be better to use active/passive to make it > > consistent with the CM chapter or do you mean something different > > here? > > active/passive refers to model not communicating entities. > IBTA chapter 12 uses client/server for the entities. > I will change requestor/responder to client/server. > > > > > since the annex is short can you set it so tables are not broken > > across pages? > > When I change Table attribute to start at the top of the page > the requirement before it (2 lines) take the whole page alone. > I will leave this formating to an editor when they merge > Annex into the spec. > > > > > p. 4 table 2, while the numerical values are the same, perhaps you > > should break out byte zero as the AGN byte (i.e. > > IBTA part of the SID space to be consistent with how Annex A3 > > structures the SID space - this is implied by p. 5, line 10 > > -- but the table maybe should reflect it as well > > OK. > I will call byte 0 - IBTA AGN. > Bytes 1-3 will retain the name - Prefix of RDMA-aware Service > ID range. > > > > > p. 7, Table 5 - Is there a reason why the suggested values > start at > > byte 41 (alignment?)? I would guess the largest suggested > value would > > be an IPv6 address, which won't take the whole 41-71 range anyway. > > > > Table 5 - Also should any remaining space in the ARI area, be > > classified as "reserved"? > > OK. > I had split it into bytes 41-56 for suggested value and 57-71 > for "remaining ARI field". > > Should we add that for ARI "0x00" - unspecified, there is no > suggested resolution? > I think not we can leave it to Provider to not specify > anything and set byte 3 to 0x00 (no suggested value). > > > > > > > Thanks, > > -ted > > > > > > > > > > > > > > Kanevsky, Arkady wrote: > > > Major changes are: > > > definition and use of "downward compatible", ARI suggested value. > > > > > > Arkady Kanevsky email: arkady at netapp.com > > > Network Appliance Inc. phone: 781-768-5395 > > > 1601 Trapelo Rd. - Suite 16. Fax: 781-895-119 > > > Waltham, MA 02451 central phone: 781-768-5300 > > > > -- > > Ted H. Kim > > Sun Microsystems, Inc. ted.kim at sun.com > > 222 North Sepulveda Blvd., 10th Floor (310) 341-1116 > > El Segundo, CA 90245 (310) 341-1120 FAX > > > -------------- next part -------------- A non-text attachment was scrubbed... Name: ip_address_annex_v5.pdf Type: application/octet-stream Size: 70284 bytes Desc: ip_address_annex_v5.pdf URL: From halr at voltaire.com Mon Feb 13 08:50:47 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Feb 2006 11:50:47 -0500 Subject: [openib-general] Re: [PATCH] Opensm - cl_event_wheel casting In-Reply-To: <5zpsm027hf.fsf@mtl066.yok.mtl.com> References: <5zpsm027hf.fsf@mtl066.yok.mtl.com> Message-ID: <1139849446.4333.497.camel@hal.voltaire.com> On Mon, 2006-02-06 at 03:53, Yael Kalka wrote: > Hi Hal, > > The following patch adds the casting done in a clearer way - to avoid > compilation errors in windows. Also - added a clear message if the > timeout was trimmed (due to the casting). > > Thanks, > Yael > > Signed-off-by: Yael Kalka Thanks. Applied. From halr at voltaire.com Mon Feb 13 09:05:18 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Feb 2006 12:05:18 -0500 Subject: [openib-general] [PATCH] complib/cl_event_wheel.c: In cl_event_wheel_reg, use max 32 bit timeout rather than truncated timeout Message-ID: <1139850317.4333.676.camel@hal.voltaire.com> complib/cl_event_wheel.c: In cl_event_wheel_reg, use max 32 bit timeout rather than truncated timeout when timeout overflows 32 bits Signed-off-by: Hal Rosenstock Index: complib/cl_event_wheel.c =================================================================== --- complib/cl_event_wheel.c (revision 5389) +++ complib/cl_event_wheel.c (working copy) @@ -354,6 +354,7 @@ cl_event_wheel_reg( { cl_event_wheel_reg_info_t *p_event; uint64_t timeout; + uint32_t to; cl_status_t cl_status = CL_SUCCESS; cl_list_item_t *prev_event_list_item; cl_map_item_t *p_map_item; @@ -428,16 +429,18 @@ cl_event_wheel_reg( /* The timeout for the cl_timer_start should be given as uint32_t. if there is an overflow - warn about it. */ + to = (uint32_t)timeout; if ( timeout > (uint32_t)timeout ) { + to = 0xffffffff; /* max 32 bit timer */ osm_log (p_event_wheel->p_log, OSM_LOG_INFO, "cl_event_wheel_reg: " - "timeout requested is too large. Using timeout: %u \n", - (uint32_t)timeout ); + "timeout requested is too large. Using timeout: %u\n", + to ); } /* start the timer to the timeout [msec] */ - cl_status = cl_timer_start(&p_event_wheel->timer, (uint32_t)timeout); + cl_status = cl_timer_start(&p_event_wheel->timer, to); if (cl_status != CL_SUCCESS) { osm_log (p_event_wheel->p_log, OSM_LOG_ERROR, From halr at voltaire.com Mon Feb 13 09:21:04 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Feb 2006 12:21:04 -0500 Subject: [openib-general] [PATCH] OpenSM include/opensm/osm_madw.h: Only include arbitary context when needed Message-ID: <1139851263.4333.881.camel@hal.voltaire.com> OpenSM include/opensm/osm_madw.h: Only include arbitary context when needed (e.g. not OpenIB) Signed-off-by: Hal Rosenstock Index: include/opensm/osm_madw.h =================================================================== --- include/opensm/osm_madw.h (revision 5387) +++ include/opensm/osm_madw.h (working copy) @@ -315,6 +315,7 @@ typedef struct _osm_vla_context boolean_t set_method; } osm_vla_context_t; /*********/ +#ifndef OSM_VENDOR_INTF_OPENIB /****s* OpenSM: MAD Wrapper/osm_arbitrary_context_t * NAME * osm_sa_context_t @@ -330,7 +331,7 @@ typedef struct _osm_arbitrary_context void* context2; } osm_arbitrary_context_t; /*********/ - +#endif /****s* OpenSM: MAD Wrapper/osm_madw_context_t * NAME * osm_madw_context_t @@ -351,7 +352,9 @@ typedef union _osm_madw_context osm_smi_context_t smi_context; osm_slvl_context_t slvl_context; osm_pkey_context_t pkey_context; +#ifndef OSM_VENDOR_INTF_OPENIB osm_arbitrary_context_t arb_context; +#endif } osm_madw_context_t; /*********/ @@ -880,6 +883,7 @@ osm_madw_get_slvl_context_ptr( * * SEE ALSO *********/ + /****f* OpenSM: MAD Wrapper/osm_madw_get_vla_context_ptr * NAME * osm_madw_get_vla_context_ptr @@ -908,6 +912,7 @@ osm_madw_get_vla_context_ptr( * SEE ALSO *********/ +#ifndef OSM_VENDOR_INTF_OPENIB /****f* OpenSM: MAD Wrapper/osm_madw_get_arbitrary_context_ptr * NAME * osm_madw_get_arbitrary_context_ptr @@ -935,6 +940,7 @@ osm_madw_get_arbitrary_context_ptr( * * SEE ALSO *********/ +#endif /****f* OpenSM: MAD Wrapper/osm_madw_get_vend_ptr * NAME From suri at baymicrosystems.com Mon Feb 13 09:45:16 2006 From: suri at baymicrosystems.com (Suresh Shelvapille) Date: Mon, 13 Feb 2006 12:45:16 -0500 Subject: [openib-general] RE: [RFC] [PATCH] mad.c: Add support for switch SMI In-Reply-To: <1139675973.4475.6062.camel@hal.voltaire.com> Message-ID: <200602131745.k1DHjMQg009583@mail.baymicrosystems.com> Hal: >From a switch perspective, and from the little testing that I have done with different SM, I don't know how I can get out port num in the query AH. Because, from what I have observed, the out port num is different depending on whether the switch is in the middle or at the end of the query and hence one may have to compute this info from either return_path, or init_path or from the input port number(physical). Pardon my ignorance, but will I have access to the MAD packet in query_ah? Thanks a lot, Suri > > 2. On the send side, the driver must support the optional query_ah verb > in order to obtain the send side port number (actual switch external > port on which to send the DR SMP). > From halr at voltaire.com Mon Feb 13 10:16:38 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Feb 2006 13:16:38 -0500 Subject: [openib-general] RE: [RFC] [PATCH] mad.c: Add support for switch SMI In-Reply-To: <200602131745.k1DHjMQg009583@mail.baymicrosystems.com> References: <200602131745.k1DHjMQg009583@mail.baymicrosystems.com> Message-ID: <1139854597.4333.1579.camel@hal.voltaire.com> Suri, On Mon, 2006-02-13 at 12:45, Suresh Shelvapille wrote: > Hal: > > >From a switch perspective, and from the little testing that I have done with > different SM, I don't think it's related to which SM is being used in the subnet. > I don't know how I can get out port num in the query AH. When create AH is called, it should be saved. It should be available from the AV for this AH. > Because, from what I have observed, the out port num is different depending > on whether the switch is in the middle or at the end of the query and hence > one may have to compute this info from either return_path, or init_path or > from the input port number(physical). You might be able to do this but it would be ugly. > Pardon my ignorance, but will I have access to the MAD packet in query_ah? I don't think this is available through create_ah or query_ah. -- Hal > Thanks a lot, > Suri > > > > > > > > 2. On the send side, the driver must support the optional query_ah verb > > in order to obtain the send side port number (actual switch external > > port on which to send the DR SMP). > > > From hugh at veritas.com Mon Feb 13 10:23:36 2006 From: hugh at veritas.com (Hugh Dickins) Date: Mon, 13 Feb 2006 18:23:36 +0000 (GMT) Subject: [openib-general] Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: <20060213154114.GO32041@mellanox.co.il> References: <20060213154114.GO32041@mellanox.co.il> Message-ID: On Mon, 13 Feb 2006, Michael S. Tsirkin wrote: > OK, I guess its time to start the push for merging this patch. > Probably not 2.6.16 material, but it would be nice to get this say into -mm to > make it easier to test this. Tested on x86_64 only. > > Please Cc me directly with comments, I'm not on the list. > > --- > > Add madvise options to control whether memory range is inherited across fork. > Useful e.g. for when hardware is doing DMA from/into these pages. > > Signed-off-by: Michael S. Tsirkin Looks good to me, Michael (but Gleb's eye has always proved better than mine). Just a couple of adjustments I'd ask before you send to Andrew: 1. Please just drop your mm/mmap.c vm_stat_account() mod: > if (flags & VM_HUGETLB) { > - if (!(flags & VM_DONTCOPY)) > + if (!(flags & (VM_DONTCOPY|VM_DONTFORK))) > mm->shared_vm += pages; Conscientious of you to include that, but (a) if it's right, then you'd need to be fiddling shared_vm up and down whenever madvise changes VM_DONTFORK, and none us much want to get into that; and (b) I cannot for the life of me work out what that VM_HUGETLB VM_DONTCOPY block is about in the first place - I can't even find any instance of VM_HUGETLB with VM_DONTCOPY to judge it by. Luckily, wli CC'ed is the expert on both hugetlb and vm_stat_account - I hope he'll just tell us that block is wrong and should be deleted (which he or I could do as an unrelated patch). Perhaps it was an inappropriate hack to prevent some count going negative, from the days when we forgot to correct total_vm in the VM_DONTCOPY case in dup_mmap. 2. Your two-line changeset comment should be expanded: mention Infiniband, mention get_user_pages, explain how frustrating it is for the carefully pinned page to be orphaned from its user address space by a stray Copy- On-Write, if the process happens to fork meanwhile; and how VM_DONTFORK can be used to secure areas against that possibility. Mention how it could also be useful to an application, wanting to speed up its forks by cutting large areas out of consideration. Some of that information will be useful to Michael Kerrisk when he updates the madvise man page. Explain that MADV_DONTFORK should be reversible, hence MADV_DOFORK; but should not be reversible on areas a driver has so marked, hence VM_DONTFORK distinct from VM_DONTCOPY. Thanks, Hugh From mst at mellanox.co.il Mon Feb 13 11:02:06 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 13 Feb 2006 21:02:06 +0200 Subject: [openib-general] Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: References: <20060213154114.GO32041@mellanox.co.il> Message-ID: <20060213190206.GC12458@mellanox.co.il> Quoting r. Hugh Dickins : > > Add madvise options to control whether memory range is inherited across fork. > > Useful e.g. for when hardware is doing DMA from/into these pages. > > > > Signed-off-by: Michael S. Tsirkin > > Looks good to me, Michael (but Gleb's eye has always proved better than > mine). Just a couple of adjustments I'd ask before you send to Andrew: Gleb has acked this to me in a private mail. Right, Gleb? ... > > 2. Your two-line changeset comment should be expanded: OK, thanks, I'll work on an appropriate description. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From rdreier at cisco.com Mon Feb 13 11:04:02 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 13 Feb 2006 11:04:02 -0800 Subject: [openib-general] madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: <20060213154114.GO32041@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 13 Feb 2006 17:41:14 +0200") References: <20060213154114.GO32041@mellanox.co.il> Message-ID: One question, which I'm too lazy to read the source to answer: what does an old (pre-MADV_DONTFORK) kernel do with an application that tries to set MADV_DONTFORK? I'm wondering what portable applications will have to do to handle the case of a new application running on an old kernel. Thanks, Roland From torvalds at osdl.org Mon Feb 13 11:05:37 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Mon, 13 Feb 2006 11:05:37 -0800 (PST) Subject: [openib-general] Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: <20060213154114.GO32041@mellanox.co.il> References: <20060213154114.GO32041@mellanox.co.il> Message-ID: On Mon, 13 Feb 2006, Michael S. Tsirkin wrote: > > Add madvise options to control whether memory range is inherited across fork. > Useful e.g. for when hardware is doing DMA from/into these pages. > > Signed-off-by: Michael S. Tsirkin > > - if (mpnt->vm_flags & VM_DONTCOPY) { > + if (mpnt->vm_flags & (VM_DONTCOPY | VM_DONTFORK)) { Why? That VM_DONTCOPY _is_ DONTFORK. Don't add a new useless DONTFORK that doesn't have any value. Linus From rdreier at cisco.com Mon Feb 13 11:13:21 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 13 Feb 2006 11:13:21 -0800 Subject: [openib-general] [PATCH] mthca - command interface In-Reply-To: <1139833171.5814.81.camel@mtls03.yok.mtl.com> (Eli Cohen's message of "Mon, 13 Feb 2006 14:19:31 +0200") References: <1139833171.5814.81.camel@mtls03.yok.mtl.com> Message-ID: Thanks, looks good. A few comments: > + int can_post_doorbells; > + int dbell_post; I think these should be combined with struct mthca_cmd.use_events with some bit masks defined. It feels too bloaty to use 3 ints to hold what's really 3 bits. > + void __iomem *dbell_map; > + void __iomem *dbell_ptrs[8]; > + u64 dbell_base; > + u16 dbell_offsets[8]; The magic number 8 should be given a name as an enum value. It seems that we don't need to keep these values around for the whole lifetime of the driver. How about holding dbell_base and dbell_offsets somewhere temporary just long enough to create dbell_map and dbell_ptrs? Or maybe it's not even worth having dbell_ptrs -- just keep dbell_offsets and use it along with dbell_map when writing commands into the doorbell page. > + MTHCA_GET(tmp, outbox, QUERY_FW_CMD_DB_EN_OFFSET); > + dev->cmd.can_post_doorbells = tmp & 0x1; Please add an mthca_dbg() line that says whether the HW/FW can post commands through doorbells, and whether enabling the feature succeeded. > +static void mthca_cmd_post_dbell(struct mthca_dev *dev, > + u64 in_param, > + u64 out_param, > + u32 in_modifier, > + u8 op_modifier, > + u16 op, > + u16 token) > +{ > + void __iomem **ptr = dev->cmd.dbell_ptrs; > + > + writel((__force u32) cpu_to_be32(in_param >> 32), ptr[0]); > + writel((__force u32) cpu_to_be32(in_param & 0xfffffffful), ptr[1]); Does this work on a big-endian system? It seems that writel() will do one swap too many; I think you need __raw_writel() as in the existing mthca_cmd_post(). I don't know what the precise ordering requirements for commands through the UAR are, but you need explicit wmb()s for __raw_writel() to guarantee order. - R. From mst at mellanox.co.il Mon Feb 13 11:20:46 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 13 Feb 2006 21:20:46 +0200 Subject: [openib-general] Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: References: <20060213154114.GO32041@mellanox.co.il> Message-ID: <20060213192046.GD12458@mellanox.co.il> Quoting Linus Torvalds : > > > > Add madvise options to control whether memory range is inherited across fork. > > Useful e.g. for when hardware is doing DMA from/into these pages. > > > > Signed-off-by: Michael S. Tsirkin > > > > - if (mpnt->vm_flags & VM_DONTCOPY) { > > + if (mpnt->vm_flags & (VM_DONTCOPY | VM_DONTFORK)) { > > Why? > > That VM_DONTCOPY _is_ DONTFORK. > > Don't add a new useless DONTFORK that doesn't have any value. When this was last discussed, Hugh Dickins said: > If a driver sets VM_DONTCOPY, it's likely to be because the driver knows it'll > cause some nastiness (memory corruption, memory leak, lockup...) if it were > copied. The memory belongs to the driver, it's letting the process have a > window on it. I don't think we should now let the process overrule it. Here's a pointer to the relevant discussion: http://lkml.org/lkml/2005/11/3/112 -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Mon Feb 13 11:39:24 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 13 Feb 2006 21:39:24 +0200 Subject: [openib-general] Re: [PATCH] mthca - command interface In-Reply-To: References: <1139833171.5814.81.camel@mtls03.yok.mtl.com> Message-ID: <20060213193924.GF12458@mellanox.co.il> Quoting Roland Dreier : > I don't know what the precise ordering requirements > for commands through the UAR are, but you need explicit wmb()s for > __raw_writel() to guarantee order. All writes must be done in-order, same as doorbell. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From iod00d at hp.com Mon Feb 13 12:10:37 2006 From: iod00d at hp.com (Grant Grundler) Date: Mon, 13 Feb 2006 12:10:37 -0800 Subject: [openib-general] OpenIB Developers Workshop Presentation Available In-Reply-To: <20060213040706.0AA3A158003@hpcn.ca.sandia.gov> References: <20060213040706.0AA3A158003@hpcn.ca.sandia.gov> Message-ID: <20060213201037.GD5137@esmail.cup.hp.com> On Sun, Feb 12, 2006 at 08:07:06PM -0800, Matt Leininger wrote: > Most of the presentations from the OpenIB Developers Workshop are > available for download at > http://www.openib.org/conference/sonoma2006/index.html Matt - big thanks for posting those! Of the ones still missing, I'd really like to review "Oracle Need for RDS" Richard, any ETA when you can email it to Matt? (or post it here?) thanks, grant From rdreier at cisco.com Mon Feb 13 12:16:39 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 13 Feb 2006 12:16:39 -0800 Subject: [openib-general] [PATCH] mthca: query_qp and query_srq In-Reply-To: <1139815657.5814.25.camel@mtls03.yok.mtl.com> (Eli Cohen's message of "Mon, 13 Feb 2006 09:27:37 +0200") References: <1139815657.5814.25.camel@mtls03.yok.mtl.com> Message-ID: Seems like this chunk: > --- openib_gen2.orig/drivers/infiniband/core/verbs.c > +++ openib_gen2/drivers/infiniband/core/verbs.c > @@ -257,9 +257,18 @@ int ib_query_qp(struct ib_qp *qp, > int qp_attr_mask, > struct ib_qp_init_attr *qp_init_attr) > { > - return qp->device->query_qp ? > + int err; > + > + err = qp->device->query_qp ? > qp->device->query_qp(qp, qp_attr, qp_attr_mask, qp_init_attr) : > -ENOSYS; > + if (err) > + return err; > + qp_init_attr->recv_cq = qp->recv_cq; > + qp_init_attr->send_cq = qp->send_cq; > + qp_init_attr->srq = qp->srq; > + > + return err; > } > EXPORT_SYMBOL(ib_query_qp); really belonged in an earlier part of the series. Anyway I'll fix it up when applying the patches. Thanks, Roland From mst at mellanox.co.il Mon Feb 13 12:24:25 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 13 Feb 2006 22:24:25 +0200 Subject: [openib-general] Fwd: Your message to openib-general awaits moderator approval Message-ID: <20060213202425.GH12458@mellanox.co.il> Why does "Too many recipients to the message" cause the message to be delayed? ----- Forwarded message from openib-general-bounces at openib.org ----- Subject: Your message to openib-general awaits moderator approval From: openib-general-bounces at openib.org Date: Mon, 13 Feb 2006 12:11:19 -0800 X-Spam: exempt Your mail to 'openib-general' with the subject Re: Re: madvise MADV_DONTFORK/MADV_DOFORK Is being held until the list moderator can review it for approval. The reason it is being held: Too many recipients to the message Either the message will get posted to the list, or you will receive notification of the moderator's decision. If you would like to cancel this posting, please visit the following URL: http://openib.org/mailman/confirm/openib-general/2348f5018b8ea93155d4396c27a9609d66e7548a ----- End forwarded message ----- -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From rdreier at cisco.com Mon Feb 13 12:23:17 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 13 Feb 2006 12:23:17 -0800 Subject: [openib-general] [git pull] InfiniBand fixes for 2.6.16-rc3 Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus The pull will get the following changes: Michael S. Tsirkin: IPoIB: Don't start send-only joins while multicast thread is stopped IPoIB: Fix another send-only join race Ralph Campbell: IB/mad: Handle DR SMPs with a LID routed part Roland Dreier: IB/mthca: Don't print debugging info until we have all values IPoIB: Yet another fix for send-only joins IB/mthca: bump driver version and release date drivers/infiniband/core/mad.c | 10 ++++++ drivers/infiniband/hw/mthca/mthca_cmd.c | 38 ++++++++++++------------ drivers/infiniband/hw/mthca/mthca_dev.h | 4 +-- drivers/infiniband/ulp/ipoib/ipoib.h | 1 + drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 28 +++++++++++++++--- 5 files changed, 55 insertions(+), 26 deletions(-) diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index d393b50..c82f47a 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -665,7 +665,15 @@ static int handle_outgoing_dr_smp(struct struct ib_wc mad_wc; struct ib_send_wr *send_wr = &mad_send_wr->send_wr; - if (!smi_handle_dr_smp_send(smp, device->node_type, port_num)) { + /* + * Directed route handling starts if the initial LID routed part of + * a request or the ending LID routed part of a response is empty. + * If we are at the start of the LID routed part, don't update the + * hop_ptr or hop_cnt. See section 14.2.2, Vol 1 IB spec. + */ + if ((ib_get_smp_direction(smp) ? smp->dr_dlid : smp->dr_slid) == + IB_LID_PERMISSIVE && + !smi_handle_dr_smp_send(smp, device->node_type, port_num)) { ret = -EINVAL; printk(KERN_ERR PFX "Invalid directed route\n"); goto out; diff --git a/drivers/infiniband/hw/mthca/mthca_cmd.c b/drivers/infiniband/hw/mthca/mthca_cmd.c index f9b9b93..2825615 100644 --- a/drivers/infiniband/hw/mthca/mthca_cmd.c +++ b/drivers/infiniband/hw/mthca/mthca_cmd.c @@ -1029,25 +1029,6 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev MTHCA_GET(size, outbox, QUERY_DEV_LIM_UAR_ENTRY_SZ_OFFSET); dev_lim->uar_scratch_entry_sz = size; - mthca_dbg(dev, "Max QPs: %d, reserved QPs: %d, entry size: %d\n", - dev_lim->max_qps, dev_lim->reserved_qps, dev_lim->qpc_entry_sz); - mthca_dbg(dev, "Max SRQs: %d, reserved SRQs: %d, entry size: %d\n", - dev_lim->max_srqs, dev_lim->reserved_srqs, dev_lim->srq_entry_sz); - mthca_dbg(dev, "Max CQs: %d, reserved CQs: %d, entry size: %d\n", - dev_lim->max_cqs, dev_lim->reserved_cqs, dev_lim->cqc_entry_sz); - mthca_dbg(dev, "Max EQs: %d, reserved EQs: %d, entry size: %d\n", - dev_lim->max_eqs, dev_lim->reserved_eqs, dev_lim->eqc_entry_sz); - mthca_dbg(dev, "reserved MPTs: %d, reserved MTTs: %d\n", - dev_lim->reserved_mrws, dev_lim->reserved_mtts); - mthca_dbg(dev, "Max PDs: %d, reserved PDs: %d, reserved UARs: %d\n", - dev_lim->max_pds, dev_lim->reserved_pds, dev_lim->reserved_uars); - mthca_dbg(dev, "Max QP/MCG: %d, reserved MGMs: %d\n", - dev_lim->max_pds, dev_lim->reserved_mgms); - mthca_dbg(dev, "Max CQEs: %d, max WQEs: %d, max SRQ WQEs: %d\n", - dev_lim->max_cq_sz, dev_lim->max_qp_sz, dev_lim->max_srq_sz); - - mthca_dbg(dev, "Flags: %08x\n", dev_lim->flags); - if (mthca_is_memfree(dev)) { MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET); dev_lim->max_srq_sz = 1 << field; @@ -1093,6 +1074,25 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev dev_lim->mpt_entry_sz = MTHCA_MPT_ENTRY_SIZE; } + mthca_dbg(dev, "Max QPs: %d, reserved QPs: %d, entry size: %d\n", + dev_lim->max_qps, dev_lim->reserved_qps, dev_lim->qpc_entry_sz); + mthca_dbg(dev, "Max SRQs: %d, reserved SRQs: %d, entry size: %d\n", + dev_lim->max_srqs, dev_lim->reserved_srqs, dev_lim->srq_entry_sz); + mthca_dbg(dev, "Max CQs: %d, reserved CQs: %d, entry size: %d\n", + dev_lim->max_cqs, dev_lim->reserved_cqs, dev_lim->cqc_entry_sz); + mthca_dbg(dev, "Max EQs: %d, reserved EQs: %d, entry size: %d\n", + dev_lim->max_eqs, dev_lim->reserved_eqs, dev_lim->eqc_entry_sz); + mthca_dbg(dev, "reserved MPTs: %d, reserved MTTs: %d\n", + dev_lim->reserved_mrws, dev_lim->reserved_mtts); + mthca_dbg(dev, "Max PDs: %d, reserved PDs: %d, reserved UARs: %d\n", + dev_lim->max_pds, dev_lim->reserved_pds, dev_lim->reserved_uars); + mthca_dbg(dev, "Max QP/MCG: %d, reserved MGMs: %d\n", + dev_lim->max_pds, dev_lim->reserved_mgms); + mthca_dbg(dev, "Max CQEs: %d, max WQEs: %d, max SRQ WQEs: %d\n", + dev_lim->max_cq_sz, dev_lim->max_qp_sz, dev_lim->max_srq_sz); + + mthca_dbg(dev, "Flags: %08x\n", dev_lim->flags); + out: mthca_free_mailbox(dev, mailbox); return err; diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h index 2a165fd..e481037 100644 --- a/drivers/infiniband/hw/mthca/mthca_dev.h +++ b/drivers/infiniband/hw/mthca/mthca_dev.h @@ -53,8 +53,8 @@ #define DRV_NAME "ib_mthca" #define PFX DRV_NAME ": " -#define DRV_VERSION "0.06" -#define DRV_RELDATE "June 23, 2005" +#define DRV_VERSION "0.07" +#define DRV_RELDATE "February 13, 2006" enum { MTHCA_FLAG_DDR_HIDDEN = 1 << 1, diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index e0a5412..2f85a9a 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -78,6 +78,7 @@ enum { IPOIB_FLAG_SUBINTERFACE = 4, IPOIB_MCAST_RUN = 5, IPOIB_STOP_REAPER = 6, + IPOIB_MCAST_STARTED = 7, IPOIB_MAX_BACKOFF_SECONDS = 16, diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index ccaa0c3..a2408d7 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -533,8 +533,10 @@ void ipoib_mcast_join_task(void *dev_ptr } if (!priv->broadcast) { - priv->broadcast = ipoib_mcast_alloc(dev, 1); - if (!priv->broadcast) { + struct ipoib_mcast *broadcast; + + broadcast = ipoib_mcast_alloc(dev, 1); + if (!broadcast) { ipoib_warn(priv, "failed to allocate broadcast group\n"); mutex_lock(&mcast_mutex); if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) @@ -544,10 +546,11 @@ void ipoib_mcast_join_task(void *dev_ptr return; } - memcpy(priv->broadcast->mcmember.mgid.raw, priv->dev->broadcast + 4, + spin_lock_irq(&priv->lock); + memcpy(broadcast->mcmember.mgid.raw, priv->dev->broadcast + 4, sizeof (union ib_gid)); + priv->broadcast = broadcast; - spin_lock_irq(&priv->lock); __ipoib_mcast_add(dev, priv->broadcast); spin_unlock_irq(&priv->lock); } @@ -601,6 +604,10 @@ int ipoib_mcast_start_thread(struct net_ queue_work(ipoib_workqueue, &priv->mcast_task); mutex_unlock(&mcast_mutex); + spin_lock_irq(&priv->lock); + set_bit(IPOIB_MCAST_STARTED, &priv->flags); + spin_unlock_irq(&priv->lock); + return 0; } @@ -611,6 +618,10 @@ int ipoib_mcast_stop_thread(struct net_d ipoib_dbg_mcast(priv, "stopping multicast thread\n"); + spin_lock_irq(&priv->lock); + clear_bit(IPOIB_MCAST_STARTED, &priv->flags); + spin_unlock_irq(&priv->lock); + mutex_lock(&mcast_mutex); clear_bit(IPOIB_MCAST_RUN, &priv->flags); cancel_delayed_work(&priv->mcast_task); @@ -693,6 +704,14 @@ void ipoib_mcast_send(struct net_device */ spin_lock(&priv->lock); + if (!test_bit(IPOIB_MCAST_STARTED, &priv->flags) || + !priv->broadcast || + !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) { + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + goto unlock; + } + mcast = __ipoib_mcast_find(dev, mgid); if (!mcast) { /* Let's create a new send only group now */ @@ -754,6 +773,7 @@ out: ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN); } +unlock: spin_unlock(&priv->lock); } From suri at baymicrosystems.com Mon Feb 13 12:45:55 2006 From: suri at baymicrosystems.com (Suresh Shelvapille) Date: Mon, 13 Feb 2006 15:45:55 -0500 Subject: [openib-general] RE: [RFC] [PATCH] mad.c: Add support for switch SMI In-Reply-To: <1139854597.4333.1579.camel@hal.voltaire.com> Message-ID: <200602132046.k1DKk0Fb013264@mail.baymicrosystems.com> Sorry Hal: But I don't understand... > > When create AH is called, it should be saved. It should be available > from the AV for this AH. > are you saying I should save the port number in ah_attr.port_num which is passed as a parameter to the create_ah(pd, ah_attr) method? If so, ah_attr.port_num is set to mad_agent->port_num which is going to be zero on a switch! Otherwise, if you are saying I should determine this in create_ah() and save it then I would need access to init_path, return_path etc....right?? Thanks a lot, Suri From halr at voltaire.com Mon Feb 13 14:03:17 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Feb 2006 17:03:17 -0500 Subject: [openib-general] RE: [RFC] [PATCH] mad.c: Add support for switch SMI In-Reply-To: <200602132046.k1DKk0Fb013264@mail.baymicrosystems.com> References: <200602132046.k1DKk0Fb013264@mail.baymicrosystems.com> Message-ID: <1139868140.4333.4189.camel@hal.voltaire.com> On Mon, 2006-02-13 at 15:45, Suresh Shelvapille wrote: > Sorry Hal: > > But I don't understand... > > > > > When create AH is called, it should be saved. It should be available > > from the AV for this AH. > > > are you saying I should save the port number in ah_attr.port_num which is > passed as a parameter to the create_ah(pd, ah_attr) method? Yes. > If so, ah_attr.port_num is set to mad_agent->port_num which is going to be > zero on a switch! Why ? The port_num in the ah_attr passed to create_ah needs to be set to the switch external port number on the send side for switches: struct ib_ah_attr { struct ib_global_route grh; u16 dlid; u8 sl; u8 src_path_bits; u8 static_rate; u8 ah_flags; u8 port_num; }; It would be filled in from the send_wr which is posted (see ib_post_send_mad). struct ib_send_wr { struct ib_send_wr *next; u64 wr_id; struct ib_sge *sg_list; int num_sge; enum ib_wr_opcode opcode; int send_flags; __be32 imm_data; union { ... struct { struct ib_ah *ah; u32 remote_qpn; u32 remote_qkey; u16 pkey_index; /* valid for GSI only */ u8 port_num; /* valid for DR SMPs on switch only */ } ud; } wr; }; On the receive side, it comes from the WC: struct ib_wc { u64 wr_id; enum ib_wc_status status; enum ib_wc_opcode opcode; u32 vendor_err; u32 byte_len; __be32 imm_data; u32 qp_num; u32 src_qp; int wc_flags; u16 pkey_index; u16 slid; u8 sl; u8 dlid_path_bits; u8 port_num; /* valid only for DR SMPs on switches */ }; If you are seeing this called with port_num 0, I missed another place where this should be set. Is that what you are saying ? > Otherwise, if you are saying I should determine this in create_ah() and save > it then I would need access to init_path, return_path etc....right?? That's not what I'm saying. -- Hal > Thanks a lot, > Suri > From halr at voltaire.com Mon Feb 13 14:42:38 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Feb 2006 17:42:38 -0500 Subject: [openib-general] IPoIB and lid change In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B754@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B754@mtlexch01.mtl.com> Message-ID: <1139870556.4333.4637.camel@hal.voltaire.com> Hi Eitan, On Mon, 2006-02-13 at 10:23, Eitan Zahavi wrote: > Hi, > > I had a long discussion today with Michael, Yael and Tziporet regarding > this issue. > We have got to the following conclusions/proposal: > > 1. As we use only GID[0] (that can not change) and a QP that is reserved > for the interface even if it is down we actually "never" change IPoIB > MAC (unless you download the module). remove the module > So unsolicited ARP reply is not a must. The MAC (which is GID,QP in IPoIB) > is kept even if the LID changes. > > 2. When a new SM is brought up it optionally can send > ClientReRegistration which should cause the entire UD AV cache (and the > new SA Path Cache) to be flushed. Would this be an SM requirement ? The entire UD AV cache includes both unicast and multicast, right ? > 3. When subnets are merged there is currently no way to tell if there > were any LID changes. The SM does not need to do any "Set(PortInfo)" on > a node if no change is required so the remote node (that did not change > LID) will not know about any such change. There are several solutions > for this problem. The one we believe should be promoted is the concept > of "UnPath" Notice: UnPath will be needed for QoS too. It has been discussed in that context as well. > a. In IB any class manager (SM) can generate Traps carrying a Notice > attribute. We > propose a new Notice of trap number = YYY which will mean UnPath > message. > The notice will carry several fields of the path record as its > DataDetails field and also > a component mask. (the DataDetails field is 432 bits long). > So the following fields are to be included in the DataDetails: > SLID, DLID, P_KEY, TCLASS, SL, compMask. > The component mask will allow to wildcard the fields values such > that: > * an UnPath Notice including a component mask = 0 will UnPath all > paths. > * an UnPath Notice including a component mask = 1 will UnPath all > paths from the > LID = the SLID included in the Notice. > > b. The IPoIB will have to register with the SA to receive these notices. > > c. The SM needs to be smart with the way the notices are being built: > The SM should be able to coalesce multiple change events to one > notice by using the > wildcard (zero compmask) as appropriate for the case inspected by > the SM. > > With this kind of coalescing the SM does not need to generate O(N^2) > Reports but only O(N) Reports. As you might guess any discovery and > setting of the fabric require O(N^2/32) just for LFTs setting Or perhaps these are distributed on some MC group reserved for this (unpath) ? > s so our > problem of distributing these N reports is not so big. > > My plan is to bring this UnPath Notice to the IBTA MgtWG for > discussion. Sounds good; (That's where this will need to be standardized.) -- Hal > Thanks > > Eitan > From mlleinin at hpcn.ca.sandia.gov Mon Feb 13 14:58:20 2006 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Mon, 13 Feb 2006 14:58:20 -0800 Subject: [openib-general] OpenIB Developers Workshop Presentation Available In-Reply-To: <20060213201037.GD5137@esmail.cup.hp.com> References: <20060213040706.0AA3A158003@hpcn.ca.sandia.gov> <20060213201037.GD5137@esmail.cup.hp.com> Message-ID: <1139871500.6044.1534.camel@localhost> On Mon, 2006-02-13 at 12:10 -0800, Grant Grundler wrote: > On Sun, Feb 12, 2006 at 08:07:06PM -0800, Matt Leininger wrote: > > Most of the presentations from the OpenIB Developers Workshop are > > available for download at > > http://www.openib.org/conference/sonoma2006/index.html > > Matt - big thanks for posting those! > > Of the ones still missing, I'd really like to review > "Oracle Need for RDS" > > Richard, any ETA when you can email it to Matt? > (or post it here?) > Thanks to Richard for sending them to me. They are posted now. - Matt From rdreier at cisco.com Mon Feb 13 14:59:59 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 13 Feb 2006 14:59:59 -0800 Subject: [openib-general] [PATCH] mthca: query_qp and query_srq In-Reply-To: <1139815657.5814.25.camel@mtls03.yok.mtl.com> (Eli Cohen's message of "Mon, 13 Feb 2006 09:27:37 +0200") References: <1139815657.5814.25.camel@mtls03.yok.mtl.com> Message-ID: > --- openib_gen2.orig/drivers/infiniband/core/verbs.c > +++ openib_gen2/drivers/infiniband/core/verbs.c > @@ -257,9 +257,18 @@ int ib_query_qp(struct ib_qp *qp, > int qp_attr_mask, > struct ib_qp_init_attr *qp_init_attr) > { > - return qp->device->query_qp ? > + int err; > + > + err = qp->device->query_qp ? > qp->device->query_qp(qp, qp_attr, qp_attr_mask, qp_init_attr) : > -ENOSYS; > + if (err) > + return err; > + qp_init_attr->recv_cq = qp->recv_cq; > + qp_init_attr->send_cq = qp->send_cq; > + qp_init_attr->srq = qp->srq; > + > + return err; > } > EXPORT_SYMBOL(ib_query_qp); Actually this chunk is pretty silly -- anyone querying the QP could just look into the qp struct and get the information directly. So I'm going to drop this chunk unless there's something I'm missing. - R. From mst at mellanox.co.il Mon Feb 13 15:30:51 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 14 Feb 2006 01:30:51 +0200 Subject: [openib-general] Re: [PATCH] mthca: query_qp and query_srq In-Reply-To: References: <1139815657.5814.25.camel@mtls03.yok.mtl.com> Message-ID: <20060213233051.GA14766@mellanox.co.il> Quoting Roland Dreier : > Subject: Re: [PATCH] mthca: query_qp and query_srq > > > --- openib_gen2.orig/drivers/infiniband/core/verbs.c > > +++ openib_gen2/drivers/infiniband/core/verbs.c > > @@ -257,9 +257,18 @@ int ib_query_qp(struct ib_qp *qp, > > int qp_attr_mask, > > struct ib_qp_init_attr *qp_init_attr) > > { > > - return qp->device->query_qp ? > > + int err; > > + > > + err = qp->device->query_qp ? > > qp->device->query_qp(qp, qp_attr, qp_attr_mask, qp_init_attr) : > > -ENOSYS; > > + if (err) > > + return err; > > + qp_init_attr->recv_cq = qp->recv_cq; > > + qp_init_attr->send_cq = qp->send_cq; > > + qp_init_attr->srq = qp->srq; > > + > > + return err; > > } > > EXPORT_SYMBOL(ib_query_qp); > > Actually this chunk is pretty silly -- anyone querying the QP could > just look into the qp struct and get the information directly. So I'm > going to drop this chunk unless there's something I'm missing. > > - R. Still, it seems a bit ugly to leave some fields in ib_qp_init_attr uninitialized. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From rdreier at cisco.com Mon Feb 13 17:12:08 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 13 Feb 2006 17:12:08 -0800 Subject: [openib-general] query_qp and query_srq (and libibverbs 1.0) In-Reply-To: <1139815657.5814.25.camel@mtls03.yok.mtl.com> (Eli Cohen's message of "Mon, 13 Feb 2006 09:27:37 +0200") References: <1139815657.5814.25.camel@mtls03.yok.mtl.com> Message-ID: I've now finished merging all of the query QP/query SRQ patches. Please let me know if I messed something up while merging. This means that I now consider the libibverbs API and ABI frozen for the 1.0 branch, and very soon I will do a 1.0-rc6 release. A full 1.0 release should follow soon afterwards if no major problems are encountered. If there's something important to you for libibverbs 1.0, speak up soon or wait for 1.0.1 ;) (Just to be clear, API/ABI breaking changes will have to wait for 1.1.0) - R. From mst at mellanox.co.il Mon Feb 13 17:27:21 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 14 Feb 2006 03:27:21 +0200 Subject: [openib-general] Re: query_qp and query_srq (and libibverbs 1.0) In-Reply-To: References: <1139815657.5814.25.camel@mtls03.yok.mtl.com> Message-ID: <20060214012721.GB15063@mellanox.co.il> Quoting r. Roland Dreier : > If there's something important to you for libibverbs 1.0, speak up > soon or wait for 1.0.1 ;) (Just to be clear, API/ABI breaking changes > will have to wait for 1.1.0) Please take a look at devinfo_board_id.patch in contrib/mellanox/patches. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From rdreier at cisco.com Mon Feb 13 21:37:32 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 13 Feb 2006 21:37:32 -0800 Subject: [openib-general] Re: query_qp and query_srq (and libibverbs 1.0) In-Reply-To: <20060214012721.GB15063@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 14 Feb 2006 03:27:21 +0200") References: <1139815657.5814.25.camel@mtls03.yok.mtl.com> <20060214012721.GB15063@mellanox.co.il> Message-ID: Michael> Please take a look at devinfo_board_id.patch in Michael> contrib/mellanox/patches. Thanks, I applied it. There was some reason I didn't like it when you first posted it, but enough time has passed that I've forgotten why. - R. From eitan at mellanox.co.il Mon Feb 13 22:01:39 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 14 Feb 2006 08:01:39 +0200 Subject: [openib-general] IPoIB and lid change In-Reply-To: <1139870556.4333.4637.camel@hal.voltaire.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B754@mtlexch01.mtl.com> <1139870556.4333.4637.camel@hal.voltaire.com> Message-ID: <43F17243.2010807@mellanox.co.il> Hal Rosenstock wrote: > Hi Eitan, > > On Mon, 2006-02-13 at 10:23, Eitan Zahavi wrote: > >>Hi, >> >>I had a long discussion today with Michael, Yael and Tziporet regarding >>this issue. >>We have got to the following conclusions/proposal: >> >>1. As we use only GID[0] (that can not change) and a QP that is reserved >>for the interface even if it is down we actually "never" change IPoIB >>MAC (unless you download the module). > > > remove the module Sure - pardon my English. > > >>So unsolicited ARP reply is not a must. The MAC (which is GID,QP in IPoIB) >>is kept even if the LID changes. >> >>2. When a new SM is brought up it optionally can send >>ClientReRegistration which should cause the entire UD AV cache (and the >>new SA Path Cache) to be flushed. > > > Would this be an SM requirement ? This is an optional SM feature. So I can not "require" it. Implementation of the "UnPath" event would suffice but will probably be "optional" too. The entire UD AV cache includes both > unicast and multicast, right ? Yes both require to be flushed. > > >>3. When subnets are merged there is currently no way to tell if there >>were any LID changes. The SM does not need to do any "Set(PortInfo)" on >>a node if no change is required so the remote node (that did not change >>LID) will not know about any such change. There are several solutions >>for this problem. The one we believe should be promoted is the concept >>of "UnPath" Notice: > > > UnPath will be needed for QoS too. It has been discussed in that context > as well. Agree. > > >>a. In IB any class manager (SM) can generate Traps carrying a Notice >>attribute. We >> propose a new Notice of trap number = YYY which will mean UnPath >>message. >> The notice will carry several fields of the path record as its >>DataDetails field and also >> a component mask. (the DataDetails field is 432 bits long). >> So the following fields are to be included in the DataDetails: >> SLID, DLID, P_KEY, TCLASS, SL, compMask. >> The component mask will allow to wildcard the fields values such >>that: >> * an UnPath Notice including a component mask = 0 will UnPath all >>paths. >> * an UnPath Notice including a component mask = 1 will UnPath all >>paths from the >> LID = the SLID included in the Notice. >> >>b. The IPoIB will have to register with the SA to receive these notices. >> >>c. The SM needs to be smart with the way the notices are being built: >> The SM should be able to coalesce multiple change events to one >>notice by using the >> wildcard (zero compmask) as appropriate for the case inspected by >>the SM. >> >>With this kind of coalescing the SM does not need to generate O(N^2) >>Reports but only O(N) Reports. As you might guess any discovery and >>setting of the fabric require O(N^2/32) just for LFTs setting > > > Or perhaps these are distributed on some MC group reserved for this > (unpath) ? I would prefer to pay the price of sending O(N) packets and "know" they arrive. Any node not responding to the Report with ReportRepress will get some "retries" and if even that does not help we at least know about it. Using MC will not provide this knowledge - even though we could use a scheme where we "retry" each send multiple times. I consider this a secondary issue. I am glad we agree on the overall concept. > > >>s so our >>problem of distributing these N reports is not so big. >> >> My plan is to bring this UnPath Notice to the IBTA MgtWG for >>discussion. > > > Sounds good; (That's where this will need to be standardized.) Just wanted to see if this will make sense on the OpenIB list first. > > -- Hal > > >>Thanks >> >>Eitan >> > > From rdreier at cisco.com Mon Feb 13 22:09:28 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 13 Feb 2006 22:09:28 -0800 Subject: [openib-general] [PATCH] mthca - command interface In-Reply-To: <1139833171.5814.81.camel@mtls03.yok.mtl.com> (Eli Cohen's message of "Mon, 13 Feb 2006 14:19:31 +0200") References: <1139833171.5814.81.camel@mtls03.yok.mtl.com> Message-ID: Hmm, I think I see one more issue. You take the FW's word for the doorbell base address here: > + MTHCA_GET(dev->cmd.dbell_base, outbox, QUERY_FW_CMD_DB_BASE); and then just ioremap that with no adjustment: > + map_base = ioremap(dev->cmd.dbell_base, max_off + > + sizeof(unsigned long)); This will be broken on architectures such as ppc64 where the HCA's view of bus address doesn't match up with what the kernel expects to be passed to ioremap. I think you need to do something like the code in mthca_start_catas_poll(), which takes the address from the firmware and uses it as an offset into a PCI resource to make this work: addr = pci_resource_start(dev->pdev, 0) + ((pci_resource_len(dev->pdev, 0) - 1) & dev->catas_err.addr); if (!request_mem_region(addr, dev->catas_err.size * 4, DRV_NAME)) { mthca_warn(dev, "couldn't request catastrophic error region " "at 0x%lx/0x%x\n", addr, dev->catas_err.size * 4); return; } dev->catas_err.map = ioremap(addr, dev->catas_err.size * 4); if (!dev->catas_err.map) { mthca_warn(dev, "couldn't map catastrophic error region " "at 0x%lx/0x%x\n", addr, dev->catas_err.size * 4); release_mem_region(addr, dev->catas_err.size * 4); return; } (And looking at the catastrophic error code, I notice that you didn't request the doorbell region before ioremapping it, which is another issue to fix) - R. From rdreier at cisco.com Mon Feb 13 22:12:01 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 13 Feb 2006 22:12:01 -0800 Subject: [openib-general] [PATCH] mthca - command interface In-Reply-To: (Roland Dreier's message of "Mon, 13 Feb 2006 22:09:28 -0800") References: <1139833171.5814.81.camel@mtls03.yok.mtl.com> Message-ID: And yet one more thing, looking at that last email: > + map_base = ioremap(dev->cmd.dbell_base, max_off + > + sizeof(unsigned long)); why is this sizeof (unsigned long)? I think you just want sizeof (u32), since the size of the command register doesn't change depending on the kernel's word size. - R. From rdreier at cisco.com Mon Feb 13 11:16:32 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 13 Feb 2006 11:16:32 -0800 Subject: [openib-general] Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: (Linus Torvalds's message of "Mon, 13 Feb 2006 11:05:37 -0800 (PST)") References: <20060213154114.GO32041@mellanox.co.il> Message-ID: Linus> Why? Linus> That VM_DONTCOPY _is_ DONTFORK. Linus> Don't add a new useless DONTFORK that doesn't have any Linus> value. VM_DONTCOPY is hardly used in the kernel, so the semantics aren't very precisely defined. But the idea is that a driver setting VM_DONTCOPY probably has a good reason for doing it, and we don't want userspace to be able to erase that flag through madvise(). As Hugh said in his suggestion for a better changelog entry: > Explain that MADV_DONTFORK should be reversible, hence > MADV_DOFORK; but should not be reversible on areas a driver has > so marked, hence VM_DONTFORK distinct from VM_DONTCOPY. Perhaps we don't care for now, and we should wait and add VM_KERNEL_DONTCOPY later if we really need it. I honestly don't know. - Roland From torvalds at osdl.org Mon Feb 13 11:34:43 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Mon, 13 Feb 2006 11:34:43 -0800 (PST) Subject: [openib-general] Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: References: <20060213154114.GO32041@mellanox.co.il> Message-ID: On Mon, 13 Feb 2006, Roland Dreier wrote: > > VM_DONTCOPY is hardly used in the kernel, so the semantics aren't very > precisely defined. Now, I agree - it's a strange bit, and was initially just done "because we can and it seems to be a conceptually valid notion", so it's not used a lot. That said, the semantics shouldn't be all that unexpected: #define VM_DONTCOPY 0x00020000 /* Do not copy this vma on fork */ and the usage ends up matching that (except for some really strange issue with hugepage counting, which just looks wrong, but never mind). > But the idea is that a driver setting VM_DONTCOPY > probably has a good reason for doing it, and we don't want userspace > to be able to erase that flag through madvise(). Well, I can't actually see any case where a driver could validly do something that confuses the VM enough that clearing that bit could cause new problems. Put another way: if that is true, then we have bigger issues, and should fix those problems instead. So at most we might have _applications_ that depend on the fork not causing a copy-on-write thing (due to the old and broken private mapping of ioremapped areas behaviour), but if that's true, then it would have to be the driver itself that does the MADV_DOFORK thing, so.. > As Hugh said in his suggestion for a better changelog entry: > > > Explain that MADV_DONTFORK should be reversible, hence > > MADV_DOFORK; but should not be reversible on areas a driver has > > so marked, hence VM_DONTFORK distinct from VM_DONTCOPY. > > Perhaps we don't care for now, and we should wait and add > VM_KERNEL_DONTCOPY later if we really need it. I honestly don't know. I can see where Hugh is coming from, but I think it's adding cruft very much for a "be very careful" reason. I would suggest that if you wanted to be very careful, you'd simply disallow changing - or perhaps just clearing - that DONTCOPY flag on special regions (ie ones that have been marked with VM_IO or VM_RESERVED). Linus From hugh at veritas.com Mon Feb 13 11:50:41 2006 From: hugh at veritas.com (Hugh Dickins) Date: Mon, 13 Feb 2006 19:50:41 +0000 (GMT) Subject: [openib-general] Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: References: <20060213154114.GO32041@mellanox.co.il> Message-ID: On Mon, 13 Feb 2006, Linus Torvalds wrote: > > and the usage ends up matching that (except for some really strange issue > with hugepage counting, which just looks wrong, but never mind). Yes, I cc'ed wli on my reply to Michael, we're hoping he'll just say delete that block. > I can see where Hugh is coming from, but I think it's adding cruft very > much for a "be very careful" reason. > > I would suggest that if you wanted to be very careful, you'd simply > disallow changing - or perhaps just clearing - that DONTCOPY flag on > special regions (ie ones that have been marked with VM_IO or VM_RESERVED). Fair enough, disallow clearing on VM_IO (VM_RESERVED is on its way out, does little more than perpetuate a few accounting anomalies I think). So no new VM_DONTFORK flag, stick with VM_DONTCOPY: that's fine with me. Hugh From hugh at veritas.com Mon Feb 13 13:57:31 2006 From: hugh at veritas.com (Hugh Dickins) Date: Mon, 13 Feb 2006 21:57:31 +0000 (GMT) Subject: [openib-general] Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: <20060213210906.GC13603@mellanox.co.il> References: <20060213154114.GO32041@mellanox.co.il> <20060213210906.GC13603@mellanox.co.il> Message-ID: On Mon, 13 Feb 2006, Michael S. Tsirkin wrote: > > Like this then? Almost. I would still prefer madvise_vma to allow MADV_DONTFORK on a VM_IO vma, even though it must prohibit MADV_DOFORK there. But if Linus disagrees, of course ignore me. Comments much better, thanks. I didn't get your point about mlock'd memory, but I'm content to believe you're thinking of an issue that hasn't occurred to me. Hugh From torvalds at osdl.org Mon Feb 13 14:27:54 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Mon, 13 Feb 2006 14:27:54 -0800 (PST) Subject: [openib-general] Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: References: <20060213154114.GO32041@mellanox.co.il> <20060213210906.GC13603@mellanox.co.il> Message-ID: On Mon, 13 Feb 2006, Hugh Dickins wrote: > > Almost. I would still prefer madvise_vma to allow MADV_DONTFORK > on a VM_IO vma, even though it must prohibit MADV_DOFORK there. > But if Linus disagrees, of course ignore me. No, I agree. Quite frankly, I'd be willing to allow even the other way around, because I don't see how the VM could screw up, but prohibiting DOFORK is clearly the safer thing to do. Linus From hugh at veritas.com Mon Feb 13 14:54:57 2006 From: hugh at veritas.com (Hugh Dickins) Date: Mon, 13 Feb 2006 22:54:57 +0000 (GMT) Subject: [openib-general] Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: <20060213220947.GD13603@mellanox.co.il> References: <20060213154114.GO32041@mellanox.co.il> <20060213210906.GC13603@mellanox.co.il> <20060213220947.GD13603@mellanox.co.il> Message-ID: On Tue, 14 Feb 2006, Michael S. Tsirkin wrote: > Quoting r. Hugh Dickins : > > > Comments much better, thanks. I didn't get your point about mlock'd > > memory, but I'm content to believe you're thinking of an issue that > > hasn't occurred to me. > > I'm referring to the follwing, from man mlock(2): > > "Cryptographic security software often handles critical bytes like passwords > or secret keys as data structures. As a result of paging, these secrets could > be transfered onto a persistent swap store medium, where they might be > accessible to the enemy long after the security software has erased the > secrets in RAM and terminated." Ah, I get it, thanks: once parent and child have distinct pages, the child's is not locked in memory and might go out to swap. Yes, a valid point, and a relevant use for MADV_DONTFORK. Hugh From mst at mellanox.co.il Mon Feb 13 12:05:46 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 13 Feb 2006 22:05:46 +0200 Subject: [openib-general] Re: Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: References: <20060213154114.GO32041@mellanox.co.il> Message-ID: <20060213200546.GG12458@mellanox.co.il> Quoting Linus Torvalds : > I would suggest that if you wanted to be very careful, you'd simply > disallow changing - or perhaps just clearing - that DONTCOPY flag on > special regions (ie ones that have been marked with VM_IO or VM_RESERVED). Right, this was already proposed here http://lkml.org/lkml/2005/11/3/81 and I site: > You're then saying that a process cannot set VM_DONTCOPY on a VM_IO > area to prevent the first child getting the area, but clear it after > so the next child does get a copy of the area. I think it'd be wrong > (surprising) to limit the functionality in that way. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Mon Feb 13 13:09:06 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 13 Feb 2006 23:09:06 +0200 Subject: [openib-general] Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: References: <20060213154114.GO32041@mellanox.co.il> Message-ID: <20060213210906.GC13603@mellanox.co.il> Quoting Hugh Dickins : > Fair enough, disallow clearing on VM_IO (VM_RESERVED is on its way out, > does little more than perpetuate a few accounting anomalies I think). > So no new VM_DONTFORK flag, stick with VM_DONTCOPY: that's fine with me. Like this then? --- Currently, copy-on-write may change the physical address of a page even if the user requested that the page is pinned in memory (either by mlock or by get_user_pages). This happens if the process forks meanwhile, and the parent writes to that page. As a result, the page is orphaned: in case of get_user_pages, the application will never see any data hardware DMA's into this page after the COW. In case of mlock'd memory, the parent is not getting the realtime/security benefits of mlock. In particular, this affects the Infiniband modules which do DMA from and into user pages all the time. Add madvise options to control whether memory range is inherited across fork. Useful e.g. for when hardware is doing DMA from/into these pages. Could also be useful to an application wanting to speed up its forks by cutting large areas out of consideration. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.16-rc2/mm/madvise.c =================================================================== --- linux-2.6.16-rc2.orig/mm/madvise.c 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/mm/madvise.c 2006-02-14 01:34:20.000000000 +0200 @@ -22,16 +22,23 @@ static long madvise_behavior(struct vm_a struct mm_struct * mm = vma->vm_mm; int error = 0; pgoff_t pgoff; - int new_flags = vma->vm_flags & ~VM_READHINTMASK; + int new_flags = vma->vm_flags; switch (behavior) { + case MADV_NORMAL: + new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ; + break; case MADV_SEQUENTIAL: - new_flags |= VM_SEQ_READ; + new_flags = (new_flags & ~VM_RAND_READ) | VM_SEQ_READ; break; case MADV_RANDOM: - new_flags |= VM_RAND_READ; + new_flags = (new_flags & ~VM_SEQ_READ) | VM_RAND_READ; break; - default: + case MADV_DONTFORK: + new_flags |= VM_DONTCOPY; + break; + case MADV_DOFORK: + new_flags &= ~VM_DONTCOPY; break; } @@ -177,6 +184,12 @@ madvise_vma(struct vm_area_struct *vma, long error; switch (behavior) { + case MADV_DONTFORK: + case MADV_DOFORK: + if (vma->vm_flags & VM_IO) { + error = -EINVAL; + break; + } case MADV_NORMAL: case MADV_SEQUENTIAL: case MADV_RANDOM: Index: linux-2.6.16-rc2/include/asm-x86_64/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-x86_64/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-x86_64/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -37,6 +37,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-powerpc/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-powerpc/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-powerpc/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -45,6 +45,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-cris/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-cris/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-cris/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -38,6 +38,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-arm26/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-arm26/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-arm26/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-alpha/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-alpha/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-alpha/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -43,6 +43,8 @@ #define MADV_SPACEAVAIL 5 /* ensure resources are available */ #define MADV_DONTNEED 6 /* don't need these pages */ #define MADV_REMOVE 7 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-m68k/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-m68k/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-m68k/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-xtensa/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-xtensa/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-xtensa/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -73,6 +73,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-mips/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-mips/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-mips/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -66,6 +66,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-sparc64/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-sparc64/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-sparc64/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -55,6 +55,8 @@ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_FREE 0x5 /* (Solaris) contents can be freed */ #define MADV_REMOVE 0x6 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-v850/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-v850/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-v850/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -33,6 +33,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-s390/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-s390/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-s390/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -44,6 +44,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-parisc/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-parisc/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-parisc/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -49,6 +49,8 @@ #define MADV_4M_PAGES 22 /* Use 4 Megabyte pages */ #define MADV_16M_PAGES 24 /* Use 16 Megabyte pages */ #define MADV_64M_PAGES 26 /* Use 64 Megabyte pages */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-i386/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-i386/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-i386/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-sh/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-sh/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-sh/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-ia64/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-ia64/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-ia64/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -44,6 +44,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-sparc/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-sparc/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-sparc/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -55,6 +55,8 @@ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_FREE 0x5 /* (Solaris) contents can be freed */ #define MADV_REMOVE 0x6 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-m32r/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-m32r/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-m32r/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -38,6 +38,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-frv/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-frv/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-frv/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-h8300/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-h8300/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-h8300/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-arm/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-arm/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-arm/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Mon Feb 13 14:09:47 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 14 Feb 2006 00:09:47 +0200 Subject: [openib-general] Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: References: <20060213154114.GO32041@mellanox.co.il> <20060213210906.GC13603@mellanox.co.il> Message-ID: <20060213220947.GD13603@mellanox.co.il> Quoting r. Hugh Dickins : > Subject: Re: [openib-general] Re: madvise MADV_DONTFORK/MADV_DOFORK > > On Mon, 13 Feb 2006, Michael S. Tsirkin wrote: > > > > Like this then? > > Almost. I would still prefer madvise_vma to allow MADV_DONTFORK > on a VM_IO vma, even though it must prohibit MADV_DOFORK there. > But if Linus disagrees, of course ignore me. I'm not sure about this point. Linus? > Comments much better, thanks. I didn't get your point about mlock'd > memory, but I'm content to believe you're thinking of an issue that > hasn't occurred to me. I'm referring to the follwing, from man mlock(2): "Cryptographic security software often handles critical bytes like passwords or secret keys as data structures. As a result of paging, these secrets could be transfered onto a persistent swap store medium, where they might be accessible to the enemy long after the security software has erased the secrets in RAM and terminated." -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Mon Feb 13 14:55:38 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 14 Feb 2006 00:55:38 +0200 Subject: [openib-general] Re: Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: References: <20060213154114.GO32041@mellanox.co.il> <20060213210906.GC13603@mellanox.co.il> Message-ID: <20060213225538.GE13603@mellanox.co.il> Here's the final version then. Currently, copy-on-write may change the physical address of a page even if the user requested that the page is pinned in memory (either by mlock or by get_user_pages). This happens if the process forks meanwhile, and the parent writes to that page. As a result, the page is orphaned: in case of get_user_pages, the application will never see any data hardware DMAs into this page after the COW. In particular, this affects the Infiniband modules which do DMA from and into user pages all the time. This patch adds madvise options to control whether memory range is inherited across fork. Useful e.g. for when hardware is doing DMA from/into these pages. Could also be useful to an application wanting to speed up its forks by cutting large areas out of consideration. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.16-rc2/mm/madvise.c =================================================================== --- linux-2.6.16-rc2.orig/mm/madvise.c 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/mm/madvise.c 2006-02-14 03:20:33.000000000 +0200 @@ -22,16 +22,23 @@ static long madvise_behavior(struct vm_a struct mm_struct * mm = vma->vm_mm; int error = 0; pgoff_t pgoff; - int new_flags = vma->vm_flags & ~VM_READHINTMASK; + int new_flags = vma->vm_flags; switch (behavior) { + case MADV_NORMAL: + new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ; + break; case MADV_SEQUENTIAL: - new_flags |= VM_SEQ_READ; + new_flags = (new_flags & ~VM_RAND_READ) | VM_SEQ_READ; break; case MADV_RANDOM: - new_flags |= VM_RAND_READ; + new_flags = (new_flags & ~VM_SEQ_READ) | VM_RAND_READ; break; - default: + case MADV_DONTFORK: + new_flags |= VM_DONTCOPY; + break; + case MADV_DOFORK: + new_flags &= ~VM_DONTCOPY; break; } @@ -177,6 +184,12 @@ madvise_vma(struct vm_area_struct *vma, long error; switch (behavior) { + case MADV_DOFORK: + if (vma->vm_flags & VM_IO) { + error = -EINVAL; + break; + } + case MADV_DONTFORK: case MADV_NORMAL: case MADV_SEQUENTIAL: case MADV_RANDOM: Index: linux-2.6.16-rc2/include/asm-x86_64/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-x86_64/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-x86_64/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -37,6 +37,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-powerpc/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-powerpc/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-powerpc/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -45,6 +45,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-cris/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-cris/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-cris/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -38,6 +38,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-arm26/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-arm26/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-arm26/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-alpha/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-alpha/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-alpha/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -43,6 +43,8 @@ #define MADV_SPACEAVAIL 5 /* ensure resources are available */ #define MADV_DONTNEED 6 /* don't need these pages */ #define MADV_REMOVE 7 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-m68k/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-m68k/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-m68k/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-xtensa/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-xtensa/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-xtensa/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -73,6 +73,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-mips/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-mips/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-mips/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -66,6 +66,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-sparc64/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-sparc64/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-sparc64/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -55,6 +55,8 @@ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_FREE 0x5 /* (Solaris) contents can be freed */ #define MADV_REMOVE 0x6 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-v850/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-v850/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-v850/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -33,6 +33,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-s390/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-s390/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-s390/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -44,6 +44,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-parisc/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-parisc/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-parisc/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -49,6 +49,8 @@ #define MADV_4M_PAGES 22 /* Use 4 Megabyte pages */ #define MADV_16M_PAGES 24 /* Use 16 Megabyte pages */ #define MADV_64M_PAGES 26 /* Use 64 Megabyte pages */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-i386/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-i386/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-i386/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-sh/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-sh/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-sh/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-ia64/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-ia64/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-ia64/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -44,6 +44,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-sparc/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-sparc/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-sparc/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -55,6 +55,8 @@ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_FREE 0x5 /* (Solaris) contents can be freed */ #define MADV_REMOVE 0x6 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-m32r/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-m32r/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-m32r/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -38,6 +38,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-frv/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-frv/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-frv/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-h8300/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-h8300/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-h8300/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-arm/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-arm/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-arm/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Mon Feb 13 15:01:00 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 14 Feb 2006 01:01:00 +0200 Subject: [openib-general] [PATCH] madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: References: <20060213154114.GO32041@mellanox.co.il> <20060213210906.GC13603@mellanox.co.il> <20060213220947.GD13603@mellanox.co.il> Message-ID: <20060213230100.GF13603@mellanox.co.il> Here's the final version of MADV_DONTFORK/MADV_DOFORK patch. Hugh, I gather you'll forward this to Andrew, correct? --- Currently, copy-on-write may change the physical address of a page even if the user requested that the page is pinned in memory (either by mlock or by get_user_pages). This happens if the process forks meanwhile, and the parent writes to that page. As a result, the page is orphaned: in case of get_user_pages, the application will never see any data hardware DMAs into this page after the COW. In case of mlock'd memory, the parent is not getting the real-time/security benefits of mlock. In particular, this affects the Infiniband modules which do DMA from and into user pages all the time. This patch adds madvise options to control whether memory range is inherited across fork. Useful e.g. for when hardware is doing DMA from/into these pages. Could also be useful to an application wanting to speed up its forks by cutting large areas out of consideration. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.16-rc2/mm/madvise.c =================================================================== --- linux-2.6.16-rc2.orig/mm/madvise.c 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/mm/madvise.c 2006-02-14 03:40:22.000000000 +0200 @@ -22,16 +22,23 @@ static long madvise_behavior(struct vm_a struct mm_struct * mm = vma->vm_mm; int error = 0; pgoff_t pgoff; - int new_flags = vma->vm_flags & ~VM_READHINTMASK; + int new_flags = vma->vm_flags; switch (behavior) { + case MADV_NORMAL: + new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ; + break; case MADV_SEQUENTIAL: - new_flags |= VM_SEQ_READ; + new_flags = (new_flags & ~VM_RAND_READ) | VM_SEQ_READ; break; case MADV_RANDOM: - new_flags |= VM_RAND_READ; + new_flags = (new_flags & ~VM_SEQ_READ) | VM_RAND_READ; break; - default: + case MADV_DONTFORK: + new_flags |= VM_DONTCOPY; + break; + case MADV_DOFORK: + new_flags &= ~VM_DONTCOPY; break; } @@ -177,6 +184,12 @@ madvise_vma(struct vm_area_struct *vma, long error; switch (behavior) { + case MADV_DOFORK: + if (vma->vm_flags & VM_IO) { + error = -EINVAL; + break; + } + case MADV_DONTFORK: case MADV_NORMAL: case MADV_SEQUENTIAL: case MADV_RANDOM: Index: linux-2.6.16-rc2/include/asm-x86_64/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-x86_64/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-x86_64/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -37,6 +37,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-powerpc/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-powerpc/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-powerpc/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -45,6 +45,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-cris/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-cris/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-cris/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -38,6 +38,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-arm26/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-arm26/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-arm26/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-alpha/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-alpha/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-alpha/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -43,6 +43,8 @@ #define MADV_SPACEAVAIL 5 /* ensure resources are available */ #define MADV_DONTNEED 6 /* don't need these pages */ #define MADV_REMOVE 7 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-m68k/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-m68k/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-m68k/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-xtensa/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-xtensa/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-xtensa/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -73,6 +73,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-mips/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-mips/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-mips/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -66,6 +66,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-sparc64/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-sparc64/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-sparc64/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -55,6 +55,8 @@ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_FREE 0x5 /* (Solaris) contents can be freed */ #define MADV_REMOVE 0x6 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-v850/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-v850/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-v850/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -33,6 +33,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-s390/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-s390/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-s390/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -44,6 +44,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-parisc/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-parisc/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-parisc/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -49,6 +49,8 @@ #define MADV_4M_PAGES 22 /* Use 4 Megabyte pages */ #define MADV_16M_PAGES 24 /* Use 16 Megabyte pages */ #define MADV_64M_PAGES 26 /* Use 64 Megabyte pages */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-i386/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-i386/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-i386/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-sh/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-sh/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-sh/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-ia64/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-ia64/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-ia64/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -44,6 +44,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-sparc/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-sparc/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-sparc/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -55,6 +55,8 @@ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_FREE 0x5 /* (Solaris) contents can be freed */ #define MADV_REMOVE 0x6 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-m32r/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-m32r/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-m32r/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -38,6 +38,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-frv/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-frv/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-frv/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-h8300/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-h8300/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-h8300/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-arm/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-arm/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-arm/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From hugh at veritas.com Mon Feb 13 15:01:20 2006 From: hugh at veritas.com (Hugh Dickins) Date: Mon, 13 Feb 2006 23:01:20 +0000 (GMT) Subject: [openib-general] Re: Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: <20060213225538.GE13603@mellanox.co.il> References: <20060213154114.GO32041@mellanox.co.il> <20060213210906.GC13603@mellanox.co.il> <20060213225538.GE13603@mellanox.co.il> Message-ID: On Tue, 14 Feb 2006, Michael S. Tsirkin wrote: > Here's the final version then. Acked-by: Hugh Dickins From mst at mellanox.co.il Mon Feb 13 15:35:17 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 14 Feb 2006 01:35:17 +0200 Subject: [openib-general] [PATCH] madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: References: <20060213154114.GO32041@mellanox.co.il> <20060213210906.GC13603@mellanox.co.il> <20060213225538.GE13603@mellanox.co.il> Message-ID: <20060213233517.GG13603@mellanox.co.il> Hello, Andrew! Please consider the following for inclusion in -mm (and hopefully mainline in 2.6.17). The patch below is against 2.6.16-rc2. Tested on x86_64. --- Currently, copy-on-write may change the physical address of a page even if the user requested that the page is pinned in memory (either by mlock or by get_user_pages). This happens if the process forks meanwhile, and the parent writes to that page. As a result, the page is orphaned: in case of get_user_pages, the application will never see any data hardware DMA's into this page after the COW. In case of mlock'd memory, the parent is not getting the realtime/security benefits of mlock. In particular, this affects the Infiniband modules which do DMA from and into user pages all the time. This patch adds madvise options to control whether memory range is inherited across fork. Useful e.g. for when hardware is doing DMA from/into these pages. Could also be useful to an application wanting to speed up its forks by cutting large areas out of consideration. Signed-off-by: Michael S. Tsirkin Acked-by: Hugh Dickins Index: linux-2.6.16-rc2/mm/madvise.c =================================================================== --- linux-2.6.16-rc2.orig/mm/madvise.c 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/mm/madvise.c 2006-02-14 03:40:22.000000000 +0200 @@ -22,16 +22,23 @@ static long madvise_behavior(struct vm_a struct mm_struct * mm = vma->vm_mm; int error = 0; pgoff_t pgoff; - int new_flags = vma->vm_flags & ~VM_READHINTMASK; + int new_flags = vma->vm_flags; switch (behavior) { + case MADV_NORMAL: + new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ; + break; case MADV_SEQUENTIAL: - new_flags |= VM_SEQ_READ; + new_flags = (new_flags & ~VM_RAND_READ) | VM_SEQ_READ; break; case MADV_RANDOM: - new_flags |= VM_RAND_READ; + new_flags = (new_flags & ~VM_SEQ_READ) | VM_RAND_READ; break; - default: + case MADV_DONTFORK: + new_flags |= VM_DONTCOPY; + break; + case MADV_DOFORK: + new_flags &= ~VM_DONTCOPY; break; } @@ -177,6 +184,12 @@ madvise_vma(struct vm_area_struct *vma, long error; switch (behavior) { + case MADV_DOFORK: + if (vma->vm_flags & VM_IO) { + error = -EINVAL; + break; + } + case MADV_DONTFORK: case MADV_NORMAL: case MADV_SEQUENTIAL: case MADV_RANDOM: Index: linux-2.6.16-rc2/include/asm-x86_64/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-x86_64/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-x86_64/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -37,6 +37,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-powerpc/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-powerpc/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-powerpc/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -45,6 +45,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-cris/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-cris/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-cris/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -38,6 +38,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-arm26/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-arm26/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-arm26/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-alpha/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-alpha/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-alpha/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -43,6 +43,8 @@ #define MADV_SPACEAVAIL 5 /* ensure resources are available */ #define MADV_DONTNEED 6 /* don't need these pages */ #define MADV_REMOVE 7 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-m68k/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-m68k/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-m68k/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-xtensa/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-xtensa/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-xtensa/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -73,6 +73,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-mips/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-mips/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-mips/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -66,6 +66,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-sparc64/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-sparc64/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-sparc64/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -55,6 +55,8 @@ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_FREE 0x5 /* (Solaris) contents can be freed */ #define MADV_REMOVE 0x6 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-v850/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-v850/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-v850/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -33,6 +33,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-s390/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-s390/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-s390/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -44,6 +44,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-parisc/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-parisc/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-parisc/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -49,6 +49,8 @@ #define MADV_4M_PAGES 22 /* Use 4 Megabyte pages */ #define MADV_16M_PAGES 24 /* Use 16 Megabyte pages */ #define MADV_64M_PAGES 26 /* Use 64 Megabyte pages */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-i386/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-i386/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-i386/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-sh/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-sh/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-sh/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-ia64/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-ia64/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-ia64/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -44,6 +44,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-sparc/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-sparc/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-sparc/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -55,6 +55,8 @@ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_FREE 0x5 /* (Solaris) contents can be freed */ #define MADV_REMOVE 0x6 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-m32r/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-m32r/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-m32r/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -38,6 +38,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-frv/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-frv/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-frv/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-h8300/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-h8300/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-h8300/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-arm/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-arm/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-arm/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From hugh at veritas.com Mon Feb 13 15:44:27 2006 From: hugh at veritas.com (Hugh Dickins) Date: Mon, 13 Feb 2006 23:44:27 +0000 (GMT) Subject: [openib-general] [PATCH] madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: <20060213230100.GF13603@mellanox.co.il> References: <20060213154114.GO32041@mellanox.co.il> <20060213210906.GC13603@mellanox.co.il> <20060213220947.GD13603@mellanox.co.il> <20060213230100.GF13603@mellanox.co.il> Message-ID: On Tue, 14 Feb 2006, Michael S. Tsirkin wrote: > Here's the final version of MADV_DONTFORK/MADV_DOFORK patch. > Hugh, I gather you'll forward this to Andrew, correct? Oh, if you like, here we go - but it would have been perfectly okay for you to send it to him directly yourself. (And he's probably been watching and already taken it anyway.) It seems to me that it's sufficiently harmless that it could still be 2.6.16 material, but let's leave Linus and Andrew to decide on that. Hugh --- From: Michael S. Tsirkin Currently, copy-on-write may change the physical address of a page even if the user requested that the page is pinned in memory (either by mlock or by get_user_pages). This happens if the process forks meanwhile, and the parent writes to that page. As a result, the page is orphaned: in case of get_user_pages, the application will never see any data hardware DMAs into this page after the COW. In case of mlock'd memory, the parent is not getting the real-time/security benefits of mlock. In particular, this affects the Infiniband modules which do DMA from and into user pages all the time. This patch adds madvise options to control whether memory range is inherited across fork. Useful e.g. for when hardware is doing DMA from/into these pages. Could also be useful to an application wanting to speed up its forks by cutting large areas out of consideration. Signed-off-by: Michael S. Tsirkin Signed-off-by: Hugh Dickins Index: linux-2.6.16-rc2/mm/madvise.c =================================================================== --- linux-2.6.16-rc2.orig/mm/madvise.c 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/mm/madvise.c 2006-02-14 03:40:22.000000000 +0200 @@ -22,16 +22,23 @@ static long madvise_behavior(struct vm_a struct mm_struct * mm = vma->vm_mm; int error = 0; pgoff_t pgoff; - int new_flags = vma->vm_flags & ~VM_READHINTMASK; + int new_flags = vma->vm_flags; switch (behavior) { + case MADV_NORMAL: + new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ; + break; case MADV_SEQUENTIAL: - new_flags |= VM_SEQ_READ; + new_flags = (new_flags & ~VM_RAND_READ) | VM_SEQ_READ; break; case MADV_RANDOM: - new_flags |= VM_RAND_READ; + new_flags = (new_flags & ~VM_SEQ_READ) | VM_RAND_READ; break; - default: + case MADV_DONTFORK: + new_flags |= VM_DONTCOPY; + break; + case MADV_DOFORK: + new_flags &= ~VM_DONTCOPY; break; } @@ -177,6 +184,12 @@ madvise_vma(struct vm_area_struct *vma, long error; switch (behavior) { + case MADV_DOFORK: + if (vma->vm_flags & VM_IO) { + error = -EINVAL; + break; + } + case MADV_DONTFORK: case MADV_NORMAL: case MADV_SEQUENTIAL: case MADV_RANDOM: Index: linux-2.6.16-rc2/include/asm-x86_64/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-x86_64/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-x86_64/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -37,6 +37,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-powerpc/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-powerpc/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-powerpc/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -45,6 +45,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-cris/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-cris/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-cris/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -38,6 +38,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-arm26/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-arm26/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-arm26/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-alpha/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-alpha/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-alpha/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -43,6 +43,8 @@ #define MADV_SPACEAVAIL 5 /* ensure resources are available */ #define MADV_DONTNEED 6 /* don't need these pages */ #define MADV_REMOVE 7 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-m68k/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-m68k/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-m68k/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-xtensa/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-xtensa/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-xtensa/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -73,6 +73,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-mips/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-mips/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-mips/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -66,6 +66,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-sparc64/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-sparc64/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-sparc64/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -55,6 +55,8 @@ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_FREE 0x5 /* (Solaris) contents can be freed */ #define MADV_REMOVE 0x6 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-v850/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-v850/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-v850/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -33,6 +33,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-s390/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-s390/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-s390/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -44,6 +44,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-parisc/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-parisc/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-parisc/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -49,6 +49,8 @@ #define MADV_4M_PAGES 22 /* Use 4 Megabyte pages */ #define MADV_16M_PAGES 24 /* Use 16 Megabyte pages */ #define MADV_64M_PAGES 26 /* Use 64 Megabyte pages */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-i386/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-i386/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-i386/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-sh/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-sh/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-sh/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-ia64/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-ia64/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-ia64/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -44,6 +44,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-sparc/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-sparc/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-sparc/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -55,6 +55,8 @@ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_FREE 0x5 /* (Solaris) contents can be freed */ #define MADV_REMOVE 0x6 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-m32r/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-m32r/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-m32r/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -38,6 +38,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-frv/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-frv/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-frv/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-h8300/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-h8300/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-h8300/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc2/include/asm-arm/mman.h =================================================================== --- linux-2.6.16-rc2.orig/include/asm-arm/mman.h 2006-02-14 01:22:27.000000000 +0200 +++ linux-2.6.16-rc2/include/asm-arm/mman.h 2006-02-14 01:24:57.000000000 +0200 @@ -36,6 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ +#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From rdreier at cisco.com Mon Feb 13 22:30:37 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 13 Feb 2006 22:30:37 -0800 Subject: [openib-general] [PATCH] mthca - command interface In-Reply-To: (Roland Dreier's message of "Mon, 13 Feb 2006 22:09:28 -0800") References: <1139833171.5814.81.camel@mtls03.yok.mtl.com> Message-ID: > (And looking at the catastrophic error code, I notice that you didn't > request the doorbell region before ioremapping it, which is another > issue to fix) Or is the command doorbell stuff inside the UAR, which we already request in mthca_request_regions()? - R. From gleb at minantech.com Mon Feb 13 22:51:45 2006 From: gleb at minantech.com (Gleb Natapov) Date: Tue, 14 Feb 2006 08:51:45 +0200 Subject: [openib-general] Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: <20060213190206.GC12458@mellanox.co.il> References: <20060213154114.GO32041@mellanox.co.il> <20060213190206.GC12458@mellanox.co.il> Message-ID: <20060214065145.GE24524@minantech.com> On Mon, Feb 13, 2006 at 09:02:06PM +0200, Michael S. Tsirkin wrote: > Quoting r. Hugh Dickins : > > > Add madvise options to control whether memory range is inherited across fork. > > > Useful e.g. for when hardware is doing DMA from/into these pages. > > > > > > Signed-off-by: Michael S. Tsirkin > > > > Looks good to me, Michael (but Gleb's eye has always proved better than > > mine). Just a couple of adjustments I'd ask before you send to Andrew: > > Gleb has acked this to me in a private mail. > Right, Gleb? > Sory to be late :) Yes the patch is looking good. Acked-by: Gleb Natapov -- Gleb. From lyouseff at cs.ucsb.edu Mon Feb 13 23:42:57 2006 From: lyouseff at cs.ucsb.edu (Lamia M.Youseff) Date: Mon, 13 Feb 2006 23:42:57 -0800 Subject: [openib-general] newbie to openib Message-ID: <43F18A01.1050608@cs.ucsb.edu> Dear list, I am having much trouble on getting openib stack to work. I am always getting the error that "No IB devices found." I am using 2.6.14 kernel for em64t, I am sure that i configured the kernel with infiniband device driver (as portion of .config shows below). I also got all of the modules correctly build and installed. However, I have no clue why it is not working. Any help or hint will be greatly appreciated, -- Lamia -bash-3.0# cat .config ............. # # InfiniBand support # CONFIG_INFINIBAND=m CONFIG_INFINIBAND_USER_MAD=m CONFIG_INFINIBAND_USER_ACCESS=m CONFIG_INFINIBAND_MTHCA=m # CONFIG_INFINIBAND_MTHCA_DEBUG is not set CONFIG_INFINIBAND_IPOIB=m # CONFIG_INFINIBAND_IPOIB_DEBUG is not set ............................ ====================== -bash-3.00# ibv_devinfo libibverbs: Fatal: no infiniband class devices found. No IB devices found ======================= -bash-3.00# opensm ------------------------------------------------- OpenSM Rev:openib-1.1.0 Based on OpenIB svn 5372 Command Line Arguments: Log File: /var/log/osm.log ------------------------------------------------- OpenSM Rev:openib-1.1.0 OpenIB svn 5372 No local ports detected! Error: Could not get port guid Exiting SM ========================= -bash-3.00# lsmod Module Size Used by ib_ucm 18296 0 ib_cm 33904 1 ib_ucm ib_uverbs 30872 0 ib_umad 15776 0 ib_mthca 93600 0 ib_mad 39208 3 ib_cm,ib_umad,ib_mthca ib_core 44544 5 ib_cm,ib_uverbs,ib_umad,ib_mthca,ib_mad ========================== -bash-3.00# ls /sys/class/infiniband* -al /sys/class/infiniband: total 0 drwxr-xr-x 2 root root 0 Feb 14 02:11 . drwxr-xr-x 25 root root 0 Feb 14 02:11 .. /sys/class/infiniband_cm: total 0 drwxr-xr-x 3 root root 0 Feb 14 02:11 . drwxr-xr-x 25 root root 0 Feb 14 02:11 .. drwxr-xr-x 2 root root 0 Feb 14 02:12 ucm /sys/class/infiniband_mad: total 0 drwxr-xr-x 2 root root 0 Feb 14 02:12 . drwxr-xr-x 25 root root 0 Feb 14 02:11 .. -r--r--r-- 1 root root 4096 Feb 14 02:12 abi_version /sys/class/infiniband_verbs: total 0 drwxr-xr-x 2 root root 0 Feb 14 02:12 . drwxr-xr-x 25 root root 0 Feb 14 02:11 .. -r--r--r-- 1 root root 4096 Feb 14 02:12 abi_version ========================== -bash-3.00# cat /sys/class/infiniband_cm/ucm/dev 231:255 ========================= From mlleinin at hpcn.ca.sandia.gov Mon Feb 13 23:43:07 2006 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Mon, 13 Feb 2006 23:43:07 -0800 Subject: [openib-general] We have an OpenIB code release team In-Reply-To: <1139885565.2954.413.camel@sarium.internal.keyresearch.com> References: <1AC79F16F5C5284499BB9591B33D6F0006E16BCF@orsmsx408> <1139885565.2954.413.camel@sarium.internal.keyresearch.com> Message-ID: <1139902987.29855.51.camel@localhost> OpenIB Community, At the OpenIB workshop we discussed putting together an OpenIB code release team. The first task is to work on a 1.0 release of the current development branch of the OpenIB code. We have some volunteers so we now have our 1.0 release team Bryan O'Sullivan has signed up to be the release manager for 1.0, working with Robert Woodruff and Hal Rosenstock. The charter of the release team includes: - determine release mechanics (eg how and when snapshot is taken, bug handling, etc - to be discussed on mailing list) - make sure release meets the needs for RedHat, Novell, and other distros - provide some oversight to the testing matrix - leverage the Q&A resources of the OpenIB industry members - publish what gets tested, at some high level - publish some release criteria for 1.0 features (eg, what makes a feature 1.0 as opposed to Beta) - post release candidates for testing by the wider community - ship the release, hopefully in not too many weeks. The release team will be following up to discuss how this will all work. Thanks to all who have volunteered their time for this effort. I'd like to encourage the OpenIB community to download the code release candidates (when they are ready) and test it in your particular computing environments. This is an important step towards getting OpenIB hardened and ready for inclusion into the various Linux distributions. Thanks, - Matt From mst at mellanox.co.il Mon Feb 13 23:48:40 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 14 Feb 2006 09:48:40 +0200 Subject: [openib-general] Re: newbie to openib In-Reply-To: <43F18A01.1050608@cs.ucsb.edu> References: <43F18A01.1050608@cs.ucsb.edu> Message-ID: <20060214074840.GE15063@mellanox.co.il> Quoting r. Lamia M.Youseff : > -bash-3.00# ibv_devinfo > libibverbs: Fatal: no infiniband class devices found. > No IB devices found My guess is you dont have the character devices in /dev/infiniband created properly. One way to create them is with udev rules. Look it up in openib wiki. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From lyouseff at cs.ucsb.edu Tue Feb 14 00:04:19 2006 From: lyouseff at cs.ucsb.edu (Lamia M.Youseff) Date: Tue, 14 Feb 2006 00:04:19 -0800 Subject: [openib-general] Re: newbie to openib In-Reply-To: <20060214074840.GE15063@mellanox.co.il> References: <43F18A01.1050608@cs.ucsb.edu> <20060214074840.GE15063@mellanox.co.il> Message-ID: <43F18F03.7040103@cs.ucsb.edu> Hi Michael, Thank you for reply. The way I created the character device was by adding /etc/udev/rules.d/90-ib.rules, unloading the ib kernel modules, reloading them again, and restarting the udev, which is the same way described in the wiki (detailed sequence of commands is shown below). I am not sure what i am doing wrong. Please advise. Thank you, Lamia -bash-3.00# uname -a Linux xen2 2.6.14test2 #3 SMP Tue Feb 14 02:01:05 EST 2006 x86_64 x86_64 x86_64 GNU/Linux -bash-3.00# cat /etc/udev/rules.d/90-ib.rules KERNEL="umad*", NAME="infiniband/%k" KERNEL="issm*", NAME="infiniband/%k" KERNEL="uverbs*", NAME="infiniband/%k", MODE="0666" KERNEL="ucm*", NAME="infiniband/%k", MODE="0666" KERNEL="rdma_cm", NAME="infiniband/%k", MODE="0666" -bash-3.00# modprobe -r ib_mthca -bash-3.00# modprobe -r ib_umad -bash-3.00# modprobe -r ib_uverbs -bash-3.00# modprobe -r ib_ucm -bash-3.00# modprobe ib_mthca -bash-3.00# modprobe ib_umad -bash-3.00# modprobe ib_uverbs -bash-3.00# modprobe ib_ucm -bash-3.00# udevstart 2>&1 > /dev/null -bash-3.00# ls -al /dev/infiniband/ucm crw-rw-rw- 1 root root 231, 255 Feb 14 02:12 /dev/infiniband/ucm Michael S. Tsirkin wrote: >Quoting r. Lamia M.Youseff : > > >>-bash-3.00# ibv_devinfo >>libibverbs: Fatal: no infiniband class devices found. >>No IB devices found >> >> > >My guess is you dont have the character devices in /dev/infiniband created >properly. One way to create them is with udev rules. Look it up in openib wiki. > > > > From eli at mellanox.co.il Tue Feb 14 00:07:21 2006 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 14 Feb 2006 10:07:21 +0200 Subject: [openib-general] [PATCH] mthca - command interface In-Reply-To: References: <1139833171.5814.81.camel@mtls03.yok.mtl.com> Message-ID: <1139904441.26226.2.camel@mtls03.yok.mtl.com> On Mon, 2006-02-13 at 22:30 -0800, Roland Dreier wrote: > Or is the command doorbell stuff inside the UAR, which we already > request in mthca_request_regions()? Yes. These are offsets into UAR0 so there is no need to request a region. From mst at mellanox.co.il Tue Feb 14 00:20:34 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 14 Feb 2006 10:20:34 +0200 Subject: [openib-general] Re: newbie to openib In-Reply-To: <43F18F03.7040103@cs.ucsb.edu> References: <43F18A01.1050608@cs.ucsb.edu> <20060214074840.GE15063@mellanox.co.il> <43F18F03.7040103@cs.ucsb.edu> Message-ID: <20060214082034.GB10494@mellanox.co.il> Quoting r. Lamia M.Youseff : > Subject: Re: newbie to openib > > Hi Michael, > Thank you for reply. The way I created the character device was by > adding /etc/udev/rules.d/90-ib.rules, unloading the ib kernel modules, > reloading them again, and restarting the udev, which is the same way > described in the wiki (detailed sequence of commands is shown below). I > am not sure what i am doing wrong. Please advise. Do you have /dev/infiniband*? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From lyouseff at cs.ucsb.edu Tue Feb 14 00:34:34 2006 From: lyouseff at cs.ucsb.edu (Lamia M.Youseff) Date: Tue, 14 Feb 2006 00:34:34 -0800 Subject: [openib-general] Re: newbie to openib In-Reply-To: <20060214082034.GB10494@mellanox.co.il> References: <43F18A01.1050608@cs.ucsb.edu> <20060214074840.GE15063@mellanox.co.il> <43F18F03.7040103@cs.ucsb.edu> <20060214082034.GB10494@mellanox.co.il> Message-ID: <43F1961A.2000100@cs.ucsb.edu> >>Hi Michael, >>Thank you for reply. The way I created the character device was by >>adding /etc/udev/rules.d/90-ib.rules, unloading the ib kernel modules, >>reloading them again, and restarting the udev, which is the same way >>described in the wiki (detailed sequence of commands is shown below). I >>am not sure what i am doing wrong. Please advise. >> >> > >Do you have /dev/infiniband*? > > > Just one directory: /dev/infiniband with ucm device inside it. -- .......................................................... : Lamia M.Youseff : lyouseff at cs.ucsb.edu : : Ph.D Candidate : www.cs.ucsb.edu/~lyouseff : : University of California,: : : Santa Barbara (UCSB) : : :........................................................: : : : "There are no impossible dreams; there is just our : : limited perception of what is possible" : *:-.,_,.-:*'``'*:-.,_,.-:*'``'*:-.,_,.-:*'``'*:-.,_,.-:*'* From mst at mellanox.co.il Tue Feb 14 00:39:09 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 14 Feb 2006 10:39:09 +0200 Subject: [openib-general] Re: newbie to openib In-Reply-To: <43F1961A.2000100@cs.ucsb.edu> References: <43F18A01.1050608@cs.ucsb.edu> <20060214074840.GE15063@mellanox.co.il> <43F18F03.7040103@cs.ucsb.edu> <20060214082034.GB10494@mellanox.co.il> <43F1961A.2000100@cs.ucsb.edu> Message-ID: <20060214083909.GC10494@mellanox.co.il> Quoting r. Lamia M.Youseff : > Subject: Re: newbie to openib > > > >>Hi Michael, > >>Thank you for reply. The way I created the character device was by > >>adding /etc/udev/rules.d/90-ib.rules, unloading the ib kernel modules, > >>reloading them again, and restarting the udev, which is the same way > >>described in the wiki (detailed sequence of commands is shown below). I > >>am not sure what i am doing wrong. Please advise. > > > >Do you have /dev/infiniband*? > > Just one directory: /dev/infiniband with ucm device inside it. This means that the problem is with the udev: it didnt create the devices properly. I dont know how to debug this. Maybe updating to latest udev will help. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From dotanb at mellanox.co.il Tue Feb 14 00:57:09 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Tue, 14 Feb 2006 10:57:09 +0200 Subject: [openib-general] RE: query_qp and query_srq (and libibverbs 1.0) Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3015CD952@mtlexch01.mtl.com> Hi Roland. The following patch wasn't committed: Index: last_stable/drivers/infiniband/core/uverbs_main.c =================================================================== --- last_stable.orig/drivers/infiniband/core/uverbs_main.c 2006-02-07 21:13:38.000000000 +0200 +++ last_stable/drivers/infiniband/core/uverbs_main.c 2006-02-08 09:56:39.000000000 +0200 @@ -107,6 +107,7 @@ static ssize_t (*uverbs_cmd_table[])(str [IB_USER_VERBS_CMD_DETACH_MCAST] = ib_uverbs_detach_mcast, [IB_USER_VERBS_CMD_CREATE_SRQ] = ib_uverbs_create_srq, [IB_USER_VERBS_CMD_MODIFY_SRQ] = ib_uverbs_modify_srq, + [IB_USER_VERBS_CMD_QUERY_SRQ] = ib_uverbs_query_srq, [IB_USER_VERBS_CMD_DESTROY_SRQ] = ib_uverbs_destroy_srq, }; thanks Dotan From lyouseff at cs.ucsb.edu Tue Feb 14 01:11:16 2006 From: lyouseff at cs.ucsb.edu (Lamia M.Youseff) Date: Tue, 14 Feb 2006 01:11:16 -0800 Subject: [openib-general] Re: newbie to openib In-Reply-To: <20060214083909.GC10494@mellanox.co.il> References: <43F18A01.1050608@cs.ucsb.edu> <20060214074840.GE15063@mellanox.co.il> <43F18F03.7040103@cs.ucsb.edu> <20060214082034.GB10494@mellanox.co.il> <43F1961A.2000100@cs.ucsb.edu> <20060214083909.GC10494@mellanox.co.il> Message-ID: <43F19EB4.50909@cs.ucsb.edu> >This means that the problem is with the udev: it didnt create the devices >properly. I dont know how to debug this. Maybe updating to latest udev will >help. > It does not seem it is a udev problem. I have upgraded from udev-039-10.10.EL4 to udev-071-0.FC4.2, unloaded, reloaded the modules and then restarted udev with no progress to the output of ibv_devinfo or any chances to devices created in /dev/infiniband. I am wondering if this error may be caused by absence/malfunction of the physical card, since this is a remote machine and I can not see it. If so, How can I check that card is pluged-in and working properly. I have included strace of devinfo as requested by the list.!! -bash-3.00# strace ibv_devinfo execve("/usr/local/bin/ibv_devinfo", ["ibv_devinfo"], [/* 17 vars */]) = 0 uname({sys="Linux", node="xen2", ...}) = 0 brk(0) = 0x503000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaaab000 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/usr/local/lib/tls/x86_64/libibverbs.so.1", O_RDONLY) = -1 ENOENT (No such file or directory) stat("/usr/local/lib/tls/x86_64", 0x7fffffb00ee0) = -1 ENOENT (No such file or directory) open("/usr/local/lib/tls/libibverbs.so.1", O_RDONLY) = -1 ENOENT (No such file or directory) stat("/usr/local/lib/tls", 0x7fffffb00ee0) = -1 ENOENT (No such file or directory) open("/usr/local/lib/x86_64/libibverbs.so.1", O_RDONLY) = -1 ENOENT (No such file or directory) stat("/usr/local/lib/x86_64", 0x7fffffb00ee0) = -1 ENOENT (No such file or directory) open("/usr/local/lib/libibverbs.so.1", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0@ \0\0\0"..., 640) = 640 fstat(3, {st_mode=S_IFREG|0755, st_size=130448, ...}) = 0 mmap(NULL, 1074848, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x2aaaaaaac000 mprotect(0x2aaaaaab3000, 1046176, PROT_NONE) = 0 mmap(0x2aaaaabb2000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x6000) = 0x2aaaaabb2000 close(3) = 0 open("/usr/local/lib/libsysfs.so.1", O_RDONLY) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=60550, ...}) = 0 mmap(NULL, 60550, PROT_READ, MAP_PRIVATE, 3, 0) = 0x2aaaaabb3000 close(3) = 0 open("/usr/lib64/libsysfs.so.1", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\200/\340"..., 640) = 640 fstat(3, {st_mode=S_IFREG|0755, st_size=51776, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaabc2000 mmap(0x3b70e00000, 1096328, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x3b70e00000 mprotect(0x3b70e0c000, 1047176, PROT_NONE) = 0 mmap(0x3b70f0b000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xb000) = 0x3b70f0b000 close(3) = 0 open("/usr/local/lib/libpthread.so.0", O_RDONLY) = -1 ENOENT (No such file or directory) open("/lib64/tls/libpthread.so.0", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340U\200"..., 640) = 640 fstat(3, {st_mode=S_IFREG|0755, st_size=106105, ...}) = 0 mmap(0x3b71800000, 1131384, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x3b71800000 mprotect(0x3b71810000, 1065848, PROT_NONE) = 0 mmap(0x3b7190f000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xf000) = 0x3b7190f000 mmap(0x3b71911000, 13176, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x3b71911000 close(3) = 0 open("/usr/local/lib/libdl.so.2", O_RDONLY) = -1 ENOENT (No such file or directory) open("/lib64/libdl.so.2", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\200\17\0"..., 640) = 640 fstat(3, {st_mode=S_IFREG|0755, st_size=17943, ...}) = 0 mmap(0x3b71000000, 1056968, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x3b71000000 mprotect(0x3b71002000, 1048776, PROT_NONE) = 0 mmap(0x3b71101000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1000) = 0x3b71101000 close(3) = 0 open("/usr/local/lib/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory) open("/lib64/tls/libc.so.6", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0`\305\261"..., 640) = 640 lseek(3, 624, SEEK_SET) = 624 read(3, "\4\0\0\0\20\0\0\0\1\0\0\0GNU\0\0\0\0\0\2\0\0\0\4\0\0\0"..., 32) = 32 fstat(3, {st_mode=S_IFREG|0755, st_size=1489097, ...}) = 0 mmap(0x3b70b00000, 2305992, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x3b70b00000 mprotect(0x3b70c2a000, 1085384, PROT_NONE) = 0 mmap(0x3b70d29000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x129000) = 0x3b70d29000 mmap(0x3b70d2f000, 16328, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x3b70d2f000 close(3) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaabc3000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaabc4000 mprotect(0x3b70d29000, 12288, PROT_READ) = 0 arch_prctl(ARCH_SET_FS, 0x2aaaaabc3fc0) = 0 munmap(0x2aaaaabb3000, 60550) = 0 set_tid_address(0x2aaaaabc4050) = 6935 rt_sigaction(SIGRTMIN, {0x3b71805190, [], SA_RESTORER|SA_SIGINFO, 0x3b7180c320}, NULL, 8) = 0 rt_sigaction(SIGRT_1, {0x3b71805210, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x3b7180c320}, NULL, 8) = 0 rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0 getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=RLIM_INFINITY}) = 0 _sysctl({{CTL_KERN, KERN_VERSION, 0, 0, 0, 0, 20bd1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, 2, 0x7fffffb01670, 35, (nil), 0}) = 0 brk(0) = 0x503000 brk(0x524000) = 0x524000 open("/usr/local/lib/infiniband", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3 fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 fcntl(3, F_SETFD, FD_CLOEXEC) = 0 getdents64(3, /* 5 entries */, 4096) = 144 getdents64(3, /* 0 entries */, 4096) = 0 close(3) = 0 futex(0x3b711020c4, FUTEX_WAKE, 2147483647) = 0 open("/usr/local/lib/infiniband/mthca.so", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\20\26\0"..., 640) = 640 fstat(3, {st_mode=S_IFREG|0755, st_size=164941, ...}) = 0 mmap(NULL, 1075376, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x2aaaaabc5000 mprotect(0x2aaaaabcc000, 1046704, PROT_NONE) = 0 mmap(0x2aaaaaccb000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x6000) = 0x2aaaaaccb000 close(3) = 0 getuid() = 0 geteuid() = 0 open("/proc/mounts", O_RDONLY) = 3 futex(0x3b70d30828, FUTEX_WAKE, 2147483647) = 0 fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaccc000 read(3, "rootfs / rootfs rw 0 0\n/proc /pr"..., 1024) = 515 close(3) = 0 munmap(0x2aaaaaccc000, 4096) = 0 lstat("/sys/class/infiniband_verbs", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0 open("/proc/mounts", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaccc000 read(3, "rootfs / rootfs rw 0 0\n/proc /pr"..., 1024) = 515 close(3) = 0 munmap(0x2aaaaaccc000, 4096) = 0 stat("/sys/class/infiniband_verbs/abi_version", {st_mode=S_IFREG|0444, st_size=4096, ...}) = 0 open("/sys/class/infiniband_verbs/abi_version", O_RDONLY) = 3 read(3, "2\n", 4096) = 2 close(3) = 0 lstat("/sys/class/infiniband_verbs", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0 open("/sys/class/infiniband_verbs", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3 fstat(3, {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0 fcntl(3, F_SETFD, FD_CLOEXEC) = 0 getdents64(3, /* 3 entries */, 4096) = 80 lstat("/sys/class/infiniband_verbs/abi_version", {st_mode=S_IFREG|0444, st_size=4096, ...}) = 0 getdents64(3, /* 0 entries */, 4096) = 0 close(3) = 0 write(2, "libibverbs: Fatal: no infiniband"..., 54libibverbs: Fatal: no infiniband class devices found. ) = 54 write(2, "No IB devices found\n", 20No IB devices found ) = 20 exit_group(-1) = ? Process 6935 detached -- .......................................................... : Lamia M.Youseff : lyouseff at cs.ucsb.edu : : Ph.D Candidate : www.cs.ucsb.edu/~lyouseff : : University of California,: : : Santa Barbara (UCSB) : : :........................................................: : : : "There are no impossible dreams; there is just our : : limited perception of what is possible" : *:-.,_,.-:*'``'*:-.,_,.-:*'``'*:-.,_,.-:*'``'*:-.,_,.-:*'* From mst at mellanox.co.il Tue Feb 14 01:16:06 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 14 Feb 2006 11:16:06 +0200 Subject: [openib-general] Re: newbie to openib In-Reply-To: <43F19EB4.50909@cs.ucsb.edu> References: <43F18A01.1050608@cs.ucsb.edu> <20060214074840.GE15063@mellanox.co.il> <43F18F03.7040103@cs.ucsb.edu> <20060214082034.GB10494@mellanox.co.il> <43F1961A.2000100@cs.ucsb.edu> <20060214083909.GC10494@mellanox.co.il> <43F19EB4.50909@cs.ucsb.edu> Message-ID: <20060214091606.GA12170@mellanox.co.il> Quoting Lamia M.Youseff : > I am wondering if this error may be caused by absence/malfunction of the > physical card, since this is a remote machine and I can not see it. If > so, How can I check that card is pluged-in and working properly. > I have included strace of devinfo as requested by the list.!! Correct, I missed that info in your post: the fact that you dont have the devices under /sys/infiniband implies some hardware problem. Do you see any messages from MTHCA in dmesg or /var/log/message -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From lyouseff at cs.ucsb.edu Tue Feb 14 01:32:56 2006 From: lyouseff at cs.ucsb.edu (Lamia M.Youseff) Date: Tue, 14 Feb 2006 01:32:56 -0800 Subject: [openib-general] Re: newbie to openib In-Reply-To: <20060214091606.GA12170@mellanox.co.il> References: <43F18A01.1050608@cs.ucsb.edu> <20060214074840.GE15063@mellanox.co.il> <43F18F03.7040103@cs.ucsb.edu> <20060214082034.GB10494@mellanox.co.il> <43F1961A.2000100@cs.ucsb.edu> <20060214083909.GC10494@mellanox.co.il> <43F19EB4.50909@cs.ucsb.edu> <20060214091606.GA12170@mellanox.co.il> Message-ID: <43F1A3C8.2070106@cs.ucsb.edu> >Correct, I missed that info in your post: the fact that you dont >have the devices under /sys/infiniband implies some hardware problem. > >Do you see any messages from MTHCA in dmesg or /var/log/message > > > There is no trace for mthca, MTHCA, pci-x, pci-express or hca in dmsg or /var/log/messages. I guess this would mean a hardware problem (possibly the card is not present), Am i correct in my assumption? Thanks Michael, appreciate your help. --Lamia From info at swhdy.com Tue Feb 14 02:36:01 2006 From: info at swhdy.com (info at swhdy.com) Date: 14 Feb 2006 19:36:01 +0900 Subject: [openib-general] $B8BDj#5#0?M$NJ}$KAw?.$5$;$FD:$$$F$*$j$^$9!#(B Message-ID: <20060214103601.26912.qmail@mail.swhdy.com> $B$*5RMM$K3Ne$N%a!<%k%^!<%/$N%"%$%3%s$+$i=w at -$ND>%"%I$K%a%C%;!<%8$rAw?.$7$F$/$@$5$$!#0J2AwJ8$H$J$j$^$9!#(B $B%-%(!j!J(B33$B!K$O$8$a$^$7$F!"IaCJ$N at 83h$r>/$7$@$1JQ$($?$/$FEPO?$7$^$7$?!#@dBPHkL)$rP$$$, at d$($J$$AGE($J4X78$r4|BT$7$F$$$^$9!#59$7$/$*4j$$$7$^$9!#(B $B$7$g$&$3!j!J(B23$B!K<~$j$NM'C#$,$J$s$@$+;R6!$C$]$/$FA4A3Nx0&BP>]$K8+$($^$;$s!#!#!#Bg?M$NJ70O5$$JM%$7$$J}$HCN$j9g$$$?$$$G$9!*%i%V%i%V!z%$%A%c%$%A%c$7$?!l9g$O!d"M(Bpriority7_net at yahoo.ca ////////////////////////////////////////////////////////// From tziporet at mellanox.co.il Tue Feb 14 04:36:45 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 14 Feb 2006 14:36:45 +0200 Subject: [openib-general] Mellanox contributes Linux srp target (gen1 code base) Message-ID: <43F1CEDD.9090200@mellanox.co.il> In openib workshop Mellanox decided to open source our gen1 SRP target. This implements an SRP target in software on top of a local SCSI device supported by Linux. Code is posted under: https://openib.org/svn/trunk/contrib/mellanox/gen1/ib_srpt This SRP target passed basic tests with gen2 SRP initiator. We currently don't have plans to port the code to gen2 but we welcome any contribution of the community. Tziporet From promotionlist at she.com Tue Feb 14 04:46:43 2006 From: promotionlist at she.com (PACIFIC PROMOTIONS) Date: Tue, 14 Feb 2006 19:46:43 +0700 Subject: [openib-general] ASIAN WINNINGS NOTIFICATION!!! Message-ID: <20060214125558.887422283E1@openib.ca.sandia.gov> DATE: 14th OF FEBRUARY 2006. FROM: THE DESK OF THE VICE PRESIDENT. PACIFIC PROMOTIONS ORGANIZATION PRIZE/AWARD BANGKOK THAILAND. BATCH: EGS/ 33504992/03: REFERENCE: 17/9918/IPD ATTENTION: CONGRATULATIONS! This is to inform you of the release of PACIFIC PROMOTIONS ORGANIZATION PRIZE DRAW, held on the 5th of April 2005, but owing to some mix up of numbers and addresses and the holidays, the results were released on the 21st of December 2005.Your name was attached to ticket number 065-12855077-09 with serial number 77390-0 that drew the lucky numbers of 05-05-17-18-28-38, which consequently won the lottery in the 5th category. You have therefore been approved for a lump sum pay of $400,000. ( FOUR HUNDRED THOUSAND UNITED STATE DOLLARS ONLY), in cash credited to file with REF:N.ZES/3662367114/18. This is from a total cash prize of 81,020.337.00, shared among the twenty-five international winners in this category. CONGRATULATIONS!!! Your fund is now deposited with our Financial Company Bangkok Thailand and insured in your email address. Due to mix up of some numbers and names, we ask that you keep this award from public notice until your claim has been processed, and the money remitted to your account, as this is part of our security protocol, to avoid double claiming and unwarranted taking of advantage of this program by participants, as has happened in the past. All participants were selected through a computer ballot system drawn from over 20,000 company and 30,000,000 individual email addresses, from Asia, Australia, New Zealand, Europe, North and South America, Middle East and Africa, as part of our International Promotions Program. We hope your lucky name will draw a bigger cash prize in the subsequent programs. To begin your lottery claims, please contact our Co-ordinator as follows, NAME: MR. ALBERT LIM TEL: +66-70723812 EMAIL: coordinatoralim at she.com Remember, all prize money must be claimed not later than 28th of Feb 2006. Any claim not made by this date will be returned to HER MAJESTY'S DEPARTMENT OF THE TREASURY. And also be informed that 10% of your lottery winning belongs to (THE PROMOTIONS COMPANY).Because they are the company that bought your ticket and played the lottery in your e-mail address.Note also that this 10% will be remitted after you have received your winnings prize, because the money is insured in your name already.NOTE: In order to avoid unnecessary delays and complications, please remember to quote your reference and batch numbers in all correspondences with us, Furthermore, should there be any change of address, please do inform our Co-ordinator as soon possible. An original copy of your lucky winning ticket and your deposit certificate will be sent to you by Administrative Remittance Operation Manager of UNITED UNION BANK BANGKOK THAILAND. CONGRATULATIONS!!!! Once again from all members of our staff and thank you for being a part of our International Promotions program. We wish you continued good fortunes. Sincerely, DR. THOMAS WIGNALL. VICE PRESIDENT PACIFIC PROMOTIONS BANGKOK THAILAND From tziporet at mellanox.co.il Tue Feb 14 05:12:52 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 14 Feb 2006 15:12:52 +0200 Subject: [openib-general] We have an OpenIB code release team In-Reply-To: <1139902987.29855.51.camel@localhost> References: <1AC79F16F5C5284499BB9591B33D6F0006E16BCF@orsmsx408> <1139885565.2954.413.camel@sarium.internal.keyresearch.com> <1139902987.29855.51.camel@localhost> Message-ID: <43F1D754.2020807@mellanox.co.il> Hi Matt, Good that we start the release effort. I would like to join the release team as Mellanox representative. Tziporet From ianjiang.ict at gmail.com Tue Feb 14 05:27:37 2006 From: ianjiang.ict at gmail.com (Ian Jiang) Date: Tue, 14 Feb 2006 21:27:37 +0800 Subject: [openib-general] Which driver to choose, OpenIB or IBGD? Message-ID: <7b2fa1820602140527s7e0a78b9ld8f85b6576ea12e7@mail.gmail.com> Hi all, We are going to develop some applications over the InfiniBand network. And I would like to get some advice on choosing the IB driver: IBGD or OpenIB. These applications are of certain scale and the developing will last for a certain time. So a stable and reliable environment is needed. As a software collection provided by Mellanox, IBGD-1.8.0 is a good choice.The open source at OpenIB is developed all the time and the linux kernel that the latest IB driver need is changed every now and then. That is what we do not like very much. However, I am wondering if any new feature is or will be provided in the OpenIB driver, as it is developing all the time. Our applications might benefit a lot if so. And could I still turn to this mail-list for help when coming with problems if we were using the IBGD VAPI? Another is about the difference between the VAPIs in IBGD and OpenIB in a programmer's view. The two are of difference not just in name, I'm afraid. All I could think out is above and any suggestion is appreciated! -- Ian Jiang ianjiang.ict at gmail.com Laboratory of Spatial Information Technology Division of System Architecture Institute of Computing Technology Chinese Academy of Sciences -------------- next part -------------- An HTML attachment was scrubbed... URL: From dotanb at mellanox.co.il Tue Feb 14 05:47:41 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Tue, 14 Feb 2006 15:47:41 +0200 Subject: [openib-general] Which driver to choose, OpenIB or IBGD? Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3015CDA95@mtlexch01.mtl.com> Hi Ian and welcome to the IB scene ... There are several things you should consider when you choose the stack to work with: Life ----- * The IBGD is based on the VAPI driver (which won't be developed much in the future) * The openIB is under development Stability ----------- * Both of the drivers are stable Support ------------ * IBGD: Mellanox support will answer to your questions * openIB: all the opebIB members can help you and answer to your questions Debug features --------------------- * the IBGD is more programmer friendly environment (the return value help you understand exactly what the error was) * not friendly at all: the return value in case of error will be -1 (most of the times) without any debug print or specific return value to different errors Supported HW -------------------- * IBGD: Mellanox HCAs are being supported * openIB : Mellanox / IBM / PathScale HCAs are being supported Supported Linux kernels versions ------------------------------------------------ * IBGD support kernel 2.4.*, 2.6* (up to 2.6.11) * openIB support the latest kernel.org Latency ----------- * openIB based applications can may better latency than IBGD based applicaitons Few weeks ago, Mellanox published IBG2 (which is a stable version of the openIB driver) Dotan -------------- next part -------------- An HTML attachment was scrubbed... URL: From Arkady.Kanevsky at netapp.com Tue Feb 14 06:35:01 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Tue, 14 Feb 2006 09:35:01 -0500 Subject: [openib-general] cache.c Message-ID: Roland, in core/cache.c should device->cache.gid_cache = kmalloc(sizeof *device->cache.pkey_cache * (end_port(device) - start_port(device) + 1), GFP_KERNEL); be device->cache.gid_cache = kmalloc(sizeof *device->cache.gid_cache * (end_port(device) - start_port(device) + 1), GFP_KERNEL); Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Tue Feb 14 06:40:51 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 14 Feb 2006 06:40:51 -0800 Subject: [openib-general] Re: newbie to openib In-Reply-To: <43F1A3C8.2070106@cs.ucsb.edu> (Lamia M. Youseff's message of "Tue, 14 Feb 2006 01:32:56 -0800") References: <43F18A01.1050608@cs.ucsb.edu> <20060214074840.GE15063@mellanox.co.il> <43F18F03.7040103@cs.ucsb.edu> <20060214082034.GB10494@mellanox.co.il> <43F1961A.2000100@cs.ucsb.edu> <20060214083909.GC10494@mellanox.co.il> <43F19EB4.50909@cs.ucsb.edu> <20060214091606.GA12170@mellanox.co.il> <43F1A3C8.2070106@cs.ucsb.edu> Message-ID: Lamia> There is no trace for mthca, MTHCA, pci-x, pci-express or Lamia> hca in dmsg or /var/log/messages. I guess this would mean a Lamia> hardware problem (possibly the card is not present), Am i Lamia> correct in my assumption? What does lspci -vv show? Is the ib_mthca module loaded? BTW, when debugging problems, you should probably set CONFIG_INFINIBAND_MTHCA_DEBUG=y. - R. From rdreier at cisco.com Tue Feb 14 06:44:08 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 14 Feb 2006 06:44:08 -0800 Subject: [openib-general] cache.c In-Reply-To: (Arkady Kanevsky's message of "Tue, 14 Feb 2006 09:35:01 -0500") References: Message-ID: Arkady> Roland, in core/cache.c Arkady> should device-> cache.gid_cache = Arkady> kmalloc(sizeof *device->cache.pkey_cache * Arkady> (end_port(device) - start_port(device) + 1), GFP_KERNEL); Arkady> be device-> cache.gid_cache = Arkady> kmalloc(sizeof *device->cache.gid_cache * Arkady> (end_port(device) - start_port(device) + 1), GFP_KERNEL); Yes, I guess so. It makes no practical difference since all pointers are always going to be the same size, but we might as well get it right. Care to send a patch? Thanks, Roland From rdreier at cisco.com Tue Feb 14 06:50:09 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 14 Feb 2006 06:50:09 -0800 Subject: [openib-general] Re: query_qp and query_srq (and libibverbs 1.0) In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3015CD952@mtlexch01.mtl.com> (Dotan Barak's message of "Tue, 14 Feb 2006 10:57:09 +0200") References: <6AB138A2AB8C8E4A98B9C0C3D52670E3015CD952@mtlexch01.mtl.com> Message-ID: Your patch got line-wrapped :( Fortunately it was trivial enough to apply by hand. Thanks, Roland From ianjiang.ict at gmail.com Tue Feb 14 06:54:05 2006 From: ianjiang.ict at gmail.com (Ian Jiang) Date: Tue, 14 Feb 2006 22:54:05 +0800 Subject: [openib-general] Which driver to choose, OpenIB or IBGD? In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3015CDA95@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3015CDA95@mtlexch01.mtl.com> Message-ID: <7b2fa1820602140654h1eeee2bcod1e5f8c15f468443@mail.gmail.com> Hi Dotan, Thanks for your reply! On 2/14/06, Dotan Barak wrote: > > Supported Linux kernels versions > > Latency > > ----------- > > - openIB based applications can may better latency than IBGD based > applicaitons > > Did anybody get an experiment result of comparison? Few weeks ago, Mellanox published IBG2 (which is a stable version of the > openIB driver) > Do you know which version of OpenIB driver does the IBG2 base on? Another question: Is it matters which driver is used for the communication between IB cards or swithes? Or could a IB card drived by IBGD/IBG2 communicate correctly with one drived by OpenIB? Thanks again and best wishes! -- Ian Jiang ianjiang.ict at gmail.com Laboratory of Spatial Information Technology Division of System Architecture Institute of Computing Technology Chinese Academy of Sciences -------------- next part -------------- An HTML attachment was scrubbed... URL: From monil at voltaire.com Tue Feb 14 06:55:46 2006 From: monil at voltaire.com (Moni Levy) Date: Tue, 14 Feb 2006 16:55:46 +0200 Subject: [openib-general] We have an OpenIB code release team In-Reply-To: <43F1D754.2020807@mellanox.co.il> References: <1AC79F16F5C5284499BB9591B33D6F0006E16BCF@orsmsx408> <1139885565.2954.413.camel@sarium.internal.keyresearch.com> <1139902987.29855.51.camel@localhost> <43F1D754.2020807@mellanox.co.il> Message-ID: <6a122cc00602140655t3733e797hd3e810ae6d0efcd2@mail.gmail.com> Hi Matt, I would be happy to join the release team as additional Voltaire representative. Moni Levy | +972-971-7670(o) Project Manager, Mainstream IB host stack Voltaire – The Grid Backbone http://www.voltaire.com/ On 2/14/06, Tziporet Koren wrote: > Hi Matt, > > Good that we start the release effort. I would like to join the release > team as Mellanox representative. > > Tziporet > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Tue Feb 14 07:00:48 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 14 Feb 2006 07:00:48 -0800 Subject: [openib-general] We have an OpenIB code release team In-Reply-To: <6a122cc00602140655t3733e797hd3e810ae6d0efcd2@mail.gmail.com> (Moni Levy's message of "Tue, 14 Feb 2006 16:55:46 +0200") References: <1AC79F16F5C5284499BB9591B33D6F0006E16BCF@orsmsx408> <1139885565.2954.413.camel@sarium.internal.keyresearch.com> <1139902987.29855.51.camel@localhost> <43F1D754.2020807@mellanox.co.il> <6a122cc00602140655t3733e797hd3e810ae6d0efcd2@mail.gmail.com> Message-ID: It's great that so many people want to help with the release, but the whole point of having a release team was to have a small number of people that can move the release rapidly. I think that the three members of the team is the maximum number already. Please give the release team time to identify places where they need help. Once they have had a chance to begin working, I'm sure there will be ample opportunity to volunteer. - R. From mst at mellanox.co.il Tue Feb 14 08:36:29 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 14 Feb 2006 18:36:29 +0200 Subject: [openib-general] mthca: switching between MSI-X and MSI mode Message-ID: <20060214163629.GA12974@mellanox.co.il> Hi! I am trying to enable MSI-X, then disable and enable MSI. This fails. ~>modprobe ib_mthca msi_x=1 ~>rmmod ib_mthca Works fine ~>modprobe ib_mthca msi=1 The second modprobe fails to get interrupts from the device. An attempt to fall back to regular interrupts fails as well: ib_mthca 0000:04:00.0: NOP command failed to generate interrupt (IRQ 201), aborting. ib_mthca 0000:04:00.0: Try again with MSI/MSI-X disabled. ib_mthca 0000:04:00.0: Clearing mask 00000000001f43fe for eqn 2 ib_mthca 0000:04:00.0: Clearing mask 0000000000000400 for eqn 3 ACPI: PCI interrupt for device 0000:04:00.0 disabled ib_mthca: probe of 0000:04:00.0 failed with error -16 Looks like a linux bug, does it not? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Tue Feb 14 08:42:36 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 14 Feb 2006 18:42:36 +0200 Subject: [openib-general] Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: <20060214065145.GE24524@minantech.com> References: <20060213154114.GO32041@mellanox.co.il> <20060213190206.GC12458@mellanox.co.il> <20060214065145.GE24524@minantech.com> Message-ID: <20060214164236.GB12974@mellanox.co.il> All, MADV_DONTFORK patch is now part of the -mm tree. Everyone who's interested in fork support, please test 2.6.16-rc3-mm1 and publish the results here and on lkml. Please make sure to Cc me on results: I'm not subscribed to lkml. Thanks, -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From Arkady.Kanevsky at netapp.com Tue Feb 14 08:45:08 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Tue, 14 Feb 2006 11:45:08 -0500 Subject: [openib-general] IBTA IP Addressing Annex Message-ID: incorporates fixes based on Ted and Mike's comments. Major change is ARI suggested value length is present in byte 2 and suggested value starts at byte 4 of ARI which is align to 16-byte boundary. Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ip_address_annex_v6.pdf Type: application/octet-stream Size: 69930 bytes Desc: ip_address_annex_v6.pdf URL: From mr_daniel110chea at yahoo.co.in Tue Feb 14 09:33:05 2006 From: mr_daniel110chea at yahoo.co.in (Mr.Daniel Chea) Date: Tue, 14 Feb 2006 17:33:05 +0000 Subject: [openib-general] GOOD DAY Message-ID: <20060214174513.DB6592283EF@openib.ca.sandia.gov> Attn/Pls Dear , I would like to apply through this medium for your co-operation and to secure an opportunity to invest and do joint business with you in your country. I have a substantial capital i honourably intend to invest in your country into a very lucrative business venture of which you are to advise and execute the said venture over there for the mutual benefits of both of us. Your able co-operation is to become my business partner in your country and create ideas on how money will be invested,properly managed and the type of investment after the money is transferred to your custody with your help and assistance. Meanwhile,on indication of your willingness to handle this transaction sincerely by protecting our interests and upon your acceptance of this proposal,I would furnish you with the full detailed information, procedure,amount involve and mutually agree on your percentage interest or share holding for helping me to secure the release of the deposit and investing the money in your country under your proper management and care. I shall be glad to reserve this respect and opportunity for you,if you so desire,but do urge you to give the matter your immediate attention it deserves. If this proposal is acceptable by you,please do not make undue advantage of the trust i bestow on you,and your urgent reply is highly needed,for more detailes information. Looking forward to your candid and urgent call and positive reply today and a mutual healthy business relationship with you, . Best regards,and have a great day. Yours Faithfully, Mr.Daniel Chea From mlleinin at hpcn.ca.sandia.gov Tue Feb 14 09:24:54 2006 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Tue, 14 Feb 2006 09:24:54 -0800 Subject: [openib-general] We have an OpenIB code release team In-Reply-To: References: <1AC79F16F5C5284499BB9591B33D6F0006E16BCF@orsmsx408> <1139885565.2954.413.camel@sarium.internal.keyresearch.com> <1139902987.29855.51.camel@localhost> <43F1D754.2020807@mellanox.co.il> <6a122cc00602140655t3733e797hd3e810ae6d0efcd2@mail.gmail.com> Message-ID: <1139937894.29855.87.camel@localhost> On Tue, 2006-02-14 at 07:00 -0800, Roland Dreier wrote: > It's great that so many people want to help with the release, but the > whole point of having a release team was to have a small number of > people that can move the release rapidly. I think that the three > members of the team is the maximum number already. > > Please give the release team time to identify places where they need > help. Once they have had a chance to begin working, I'm sure there > will be ample opportunity to volunteer. Limiting the "release team" to three people is an attempt to keep this process manageable. However, for the OpenIB code release process to work well we will need to take advantage of the Q&A resources at several companies (as well as the wider community). To first order I'd like to see Mellanox, Cisco, Voltaire, and SilverStorm, working with the OpenIB release team, to take advantage of each companies Q&A process. It's good that Moni (Voltaire) and Tziporet (Mellanox) would like to be the points of contact for their companies working with the release team. Would anyone from Cisco and SilverStorm be willing to interface between your internal Q&A process and the OpenIB release team? Thanks, - Matt From robert.j.woodruff at intel.com Tue Feb 14 09:48:50 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Tue, 14 Feb 2006 09:48:50 -0800 Subject: [openib-general] We have an OpenIB code release team In-Reply-To: <1139937894.29855.87.camel@localhost> Message-ID: <000001c6318e$ea9afa20$98cc180a@amr.corp.intel.com> Matt wrote, > Limiting the "release team" to three people is an attempt to keep this > process manageable. However, for the OpenIB code release process to > work well we will need to take advantage of the Q&A resources at several > companies (as well as the wider community). To first order I'd like to > see Mellanox, Cisco, Voltaire, and SilverStorm, working with the OpenIB > release team, to take advantage of each companies Q&A process. It's > good that Moni (Voltaire) and Tziporet (Mellanox) would like to be the > points of contact for their companies working with the release team. > Would anyone from Cisco and SilverStorm be willing to interface > between your internal Q&A process and the OpenIB release team? > Thanks, > - Matt We appreciate all of the people that want to help with this and I am sure that we will be able to utilize the resources of all of the companies, especially their expertise in the testing and QA phase of the release. woody From sweitzen at cisco.com Tue Feb 14 09:51:09 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 14 Feb 2006 09:51:09 -0800 Subject: [openib-general] We have an OpenIB code release team Message-ID: I will be the Cisco SQA interface. Scott Weitzenkamp SQA Manager Cisco Systems -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Matt Leininger Sent: Tuesday, February 14, 2006 9:25 AM To: Roland Dreier (rdreier) Cc: Robert J Woodruff; openib-general at openib.org Subject: Re: [openib-general] We have an OpenIB code release team On Tue, 2006-02-14 at 07:00 -0800, Roland Dreier wrote: > It's great that so many people want to help with the release, but the > whole point of having a release team was to have a small number of > people that can move the release rapidly. I think that the three > members of the team is the maximum number already. > > Please give the release team time to identify places where they need > help. Once they have had a chance to begin working, I'm sure there > will be ample opportunity to volunteer. Limiting the "release team" to three people is an attempt to keep this process manageable. However, for the OpenIB code release process to work well we will need to take advantage of the Q&A resources at several companies (as well as the wider community). To first order I'd like to see Mellanox, Cisco, Voltaire, and SilverStorm, working with the OpenIB release team, to take advantage of each companies Q&A process. It's good that Moni (Voltaire) and Tziporet (Mellanox) would like to be the points of contact for their companies working with the release team. Would anyone from Cisco and SilverStorm be willing to interface between your internal Q&A process and the OpenIB release team? Thanks, - Matt _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From kschoche at scl.ameslab.gov Tue Feb 14 10:08:04 2006 From: kschoche at scl.ameslab.gov (Kyle Schochenmaier) Date: Tue, 14 Feb 2006 12:08:04 -0600 Subject: [openib-general] libmthca error codes question Message-ID: <43F21C84.6060200@scl.ameslab.gov> I have come across some error codes which are not entirely documented (or at least very well hidden).. They are generated when I try to send via rdma using uverbs interface on mthca drivers. When polling for work completions, I get the following behavior, and have not been able to figure out if they are related, and/or what they actually mean. ( I'm polling using ibv_poll_cq() inside a loop) I receive these two error values/codes for the ibv_wc.status variable: IBV_WC_LOC_PROT_ERR IBV_WC_WR_FLUSH_ERR (after the first one, all of the status fields are this) Is the LOC_PROT_ERR being generated due to some wrong access flags for the queue pair upon initialization? I can send/recv fine on the same queue pair when not using rdma, so I'm somewhat confused about why I'd not get this error elsewhere. what causes the two error codes? thanks, - Kyle -- Kyle Schochenmaier kschoche at scl.ameslab.gov Research Assistant, Dr. Brett Bode AmesLab - US Dept.Energy Scalable Computing Laboratory From rdreier at cisco.com Tue Feb 14 10:19:20 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 14 Feb 2006 10:19:20 -0800 Subject: [openib-general] libmthca error codes question In-Reply-To: <43F21C84.6060200@scl.ameslab.gov> (Kyle Schochenmaier's message of "Tue, 14 Feb 2006 12:08:04 -0600") References: <43F21C84.6060200@scl.ameslab.gov> Message-ID: Kyle> I have come across some error codes which are not entirely Kyle> documented (or at least very well hidden).. They are Kyle> generated when I try to send via rdma using uverbs interface Kyle> on mthca drivers. They are documented in the IB spec. See section 11.6.2 of volume 1. IBV_WC_LOC_PROT_ERR == "Local Protection Error" and IBV_WC_WR_FLUSH_ERR == "Work Request Flushed Error" Kyle> ( I'm polling using ibv_poll_cq() inside a loop) I receive Kyle> these two error values/codes for the ibv_wc.status variable: Kyle> IBV_WC_LOC_PROT_ERR IBV_WC_WR_FLUSH_ERR (after the first Kyle> one, all of the status fields are this) Kyle> Is the LOC_PROT_ERR being generated due to some wrong access Kyle> flags for the queue pair upon initialization? I can Kyle> send/recv fine on the same queue pair when not using rdma, Kyle> so I'm somewhat confused about why I'd not get this error Kyle> elsewhere. Local protection error means that your work request used an address/L_Key that it does not have access to. If you post your code then we might be able to help you document it further. Once you have one work request complete with an error, the QP will transition to the error state and all further requests that are queued will be completed with the flush error. - R. From ostampflee at terrasoftsolutions.com Tue Feb 14 11:26:54 2006 From: ostampflee at terrasoftsolutions.com (Owen Stampflee) Date: Tue, 14 Feb 2006 11:26:54 -0800 Subject: [openib-general] OpenSM realloc error Message-ID: <1139945214.2169.4.camel@beast.terraplex.com> [root at m1 ~]# opensm ------------------------------------------------- OpenSM Rev:openib-1.1.0 Based on OpenIB svn 5411 Command Line Arguments: Log File: /var/log/osm.log ------------------------------------------------- *** glibc detected *** realloc(): invalid next size: 0x000000001007ae90 *** Aborted Installed components are: * kernel-g5-smp-2.6.15-1.yhpc.1 * gcc-3.4.4-2.ydl.2 * glibc-2.3.4-2.13.ydl.0 * SBS mthca adapter Everything is built 64-bit, with -fPIC. Any idea on what could be going on? Thanks, Owen From Arkady.Kanevsky at netapp.com Tue Feb 14 11:38:38 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Tue, 14 Feb 2006 14:38:38 -0500 Subject: [openib-general] IBTA IP addressing Message-ID: merged ARI tables together. Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ip_address_annex_v7.pdf Type: application/octet-stream Size: 71585 bytes Desc: ip_address_annex_v7.pdf URL: From lyouseff at cs.ucsb.edu Tue Feb 14 12:15:07 2006 From: lyouseff at cs.ucsb.edu (Lamia M.Youseff) Date: Tue, 14 Feb 2006 12:15:07 -0800 Subject: [openib-general] Re: newbie to openib In-Reply-To: References: <43F18A01.1050608@cs.ucsb.edu> <20060214074840.GE15063@mellanox.co.il> <43F18F03.7040103@cs.ucsb.edu> <20060214082034.GB10494@mellanox.co.il> <43F1961A.2000100@cs.ucsb.edu> <20060214083909.GC10494@mellanox.co.il> <43F19EB4.50909@cs.ucsb.edu> <20060214091606.GA12170@mellanox.co.il> <43F1A3C8.2070106@cs.ucsb.edu> Message-ID: <43F23A4B.6040101@cs.ucsb.edu> >What does lspci -vv show? > >Is the ib_mthca module loaded? > >BTW, when debugging problems, you should probably set >CONFIG_INFINIBAND_MTHCA_DEBUG=y. > > - R. > > Thanks Roland for your reply. I recompiled the kernel with infiniband debugging. After making sure lib_mthca module is load, I tried ibstatus. It seems that the tool is looking for some file in /sys/class/infiniband/*/ports. However, this infiniband directory is empty. The output of lspci -vv shows a PCI-EXPRESS card is there, which i believe is the infiniband card. However, nothing about mthca or hca. I have the output of those commands below. That is getting even more confusing for me. -- Lamia -bash-3.00# lsmod Module Size Used by ib_ucm 18296 0 ib_cm 33904 1 ib_ucm ib_uverbs 30872 0 ib_umad 15776 0 ib_mthca 98848 0 ib_mad 39208 3 ib_cm,ib_umad,ib_mthca ib_core 44544 5 ib_cm,ib_uverbs,ib_umad,ib_mthca,ib_mad -bash-3.00# ibstatus Fatal error: device '*': sys files not found (/sys/class/infiniband/*/ports) -bash-3.00# ibv_devinfo libibverbs: Fatal: no infiniband class devices found. No IB devices found -bash-3.00# lspci 00:00.0 Host bridge: Intel Corporation E7525 Memory Controller Hub (rev 0a) 00:02.0 PCI bridge: Intel Corporation E7525/E7520/E7320 PCI Express Port A (rev 0a) 00:03.0 PCI bridge: Intel Corporation E7525/E7520/E7320 PCI Express Port A1 (rev 0a) 00:04.0 PCI bridge: Intel Corporation E7525/E7520 PCI Express Port B (rev 0a) 00:1d.0 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #1 (rev 02) 00:1d.1 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #2 (rev 02) 00:1d.2 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #3 (rev 02) 00:1d.3 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #4 (rev 02) 00:1d.7 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB2 EHCI Controller (rev 02) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev c2) 00:1f.0 ISA bridge: Intel Corporation 82801EB/ER (ICH5/ICH5R) LPC Interface Bridge (rev 02) 00:1f.2 IDE interface: Intel Corporation 82801EB (ICH5) SATA Controller (rev 02) 00:1f.3 SMBus: Intel Corporation 82801EB/ER (ICH5/ICH5R) SMBus Controller (rev 02) 00:1f.5 Multimedia audio controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) AC'97 Audio Controller (rev 02) 02:00.0 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge A (rev 09) 02:00.1 PIC: Intel Corporation 6700/6702PXH I/OxAPIC Interrupt Controller A (rev 09) 02:00.2 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge B (rev 09) 02:00.3 PIC: Intel Corporation 6700PXH I/OxAPIC Interrupt Controller B (rev 09) 04:02.0 Ethernet controller: Intel Corporation 82545GM Gigabit Ethernet Controller (rev 04) -bash-3.00# lspci -v 00:00.0 Host bridge: Intel Corporation E7525 Memory Controller Hub (rev 0a) Subsystem: Super Micro Computer Inc: Unknown device 5680 Flags: bus master, fast devsel, latency 0 Capabilities: [40] Vendor Specific Information 00:02.0 PCI bridge: Intel Corporation E7525/E7520/E7320 PCI Express Port A (rev 0a) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0 Bus: primary=00, secondary=01, subordinate=01, sec-latency=0 Capabilities: [50] Power Management version 2 Capabilities: [58] Message Signalled Interrupts: 64bit- Queue=0/1 Enable- Capabilities: [64] Express Root Port (Slot-) IRQ 0 00:03.0 PCI bridge: Intel Corporation E7525/E7520/E7320 PCI Express Port A1 (rev 0a) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0 Bus: primary=00, secondary=02, subordinate=04, sec-latency=0 I/O behind bridge: 00002000-00002fff Memory behind bridge: dfb00000-dfcfffff Capabilities: [50] Power Management version 2 Capabilities: [58] Message Signalled Interrupts: 64bit- Queue=0/1 Enable- Capabilities: [64] Express Root Port (Slot-) IRQ 0 00:04.0 PCI bridge: Intel Corporation E7525/E7520 PCI Express Port B (rev 0a) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0 Bus: primary=00, secondary=05, subordinate=05, sec-latency=0 Capabilities: [50] Power Management version 2 Capabilities: [58] Message Signalled Interrupts: 64bit- Queue=0/1 Enable- Capabilities: [64] Express Root Port (Slot-) IRQ 0 00:1d.0 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #1 (rev 02) (prog-if 00 [UHCI]) Subsystem: Super Micro Computer Inc: Unknown device 5680 Flags: bus master, medium devsel, latency 0, IRQ 145 I/O ports at 1840 [size=32] 00:1d.1 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #2 (rev 02) (prog-if 00 [UHCI]) Subsystem: Super Micro Computer Inc: Unknown device 5680 Flags: bus master, medium devsel, latency 0, IRQ 169 I/O ports at 1860 [size=32] 00:1d.2 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #3 (rev 02) (prog-if 00 [UHCI]) Subsystem: Super Micro Computer Inc: Unknown device 5680 Flags: bus master, medium devsel, latency 0, IRQ 161 I/O ports at 1880 [size=32] 00:1d.3 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #4 (rev 02) (prog-if 00 [UHCI]) Subsystem: Super Micro Computer Inc: Unknown device 5680 Flags: bus master, medium devsel, latency 0, IRQ 145 I/O ports at 18a0 [size=32] 00:1d.7 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB2 EHCI Controller (rev 02) (prog-if 20 [EHCI]) Subsystem: Super Micro Computer Inc: Unknown device 5680 Flags: bus master, medium devsel, latency 0, IRQ 177 Memory at dfa00000 (32-bit, non-prefetchable) [size=1K] Capabilities: [50] Power Management version 2 Capabilities: [58] Debug port 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev c2) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0 Bus: primary=00, secondary=06, subordinate=06, sec-latency=32 00:1f.0 ISA bridge: Intel Corporation 82801EB/ER (ICH5/ICH5R) LPC Interface Bridge (rev 02) Flags: bus master, medium devsel, latency 0 00:1f.2 IDE interface: Intel Corporation 82801EB (ICH5) SATA Controller (rev 02) (prog-if 8a [Master SecP PriP]) Subsystem: Super Micro Computer Inc: Unknown device 5680 Flags: bus master, 66Mhz, medium devsel, latency 0, IRQ 161 I/O ports at I/O ports at I/O ports at I/O ports at I/O ports at 18f0 [size=16] 00:1f.3 SMBus: Intel Corporation 82801EB/ER (ICH5/ICH5R) SMBus Controller (rev 02) Subsystem: Super Micro Computer Inc: Unknown device 5680 Flags: medium devsel, IRQ 153 I/O ports at 1100 [size=32] 00:1f.5 Multimedia audio controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) AC'97 Audio Controller (rev 02) Subsystem: Super Micro Computer Inc: Unknown device 5680 Flags: bus master, medium devsel, latency 0, IRQ 153 I/O ports at 1400 [size=256] I/O ports at 1800 [size=64] Memory at dfa00c00 (32-bit, non-prefetchable) [size=512] Memory at dfa00800 (32-bit, non-prefetchable) [size=256] Capabilities: [50] Power Management version 2 02:00.0 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge A (rev 09) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0 Bus: primary=02, secondary=03, subordinate=03, sec-latency=64 Capabilities: [44] Express PCI/PCI-X Bridge IRQ 0 Capabilities: [5c] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable- Capabilities: [6c] Power Management version 2 Capabilities: [d8] PCI-X bridge device. 02:00.1 PIC: Intel Corporation 6700/6702PXH I/OxAPIC Interrupt Controller A (rev 09) (prog-if 20 [IO(X)-APIC]) Subsystem: Super Micro Computer Inc: Unknown device 5680 Flags: bus master, fast devsel, latency 0 Memory at dfb00000 (32-bit, non-prefetchable) [size=4K] Capabilities: [44] Express Endpoint IRQ 0 Capabilities: [6c] Power Management version 2 02:00.2 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge B (rev 09) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0 Bus: primary=02, secondary=04, subordinate=04, sec-latency=64 I/O behind bridge: 00002000-00002fff Memory behind bridge: dfc00000-dfcfffff Capabilities: [44] Express PCI/PCI-X Bridge IRQ 0 Capabilities: [5c] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable- Capabilities: [6c] Power Management version 2 Capabilities: [d8] PCI-X bridge device. 02:00.3 PIC: Intel Corporation 6700PXH I/OxAPIC Interrupt Controller B (rev 09) (prog-if 20 [IO(X)-APIC]) Subsystem: Super Micro Computer Inc: Unknown device 5680 Flags: bus master, fast devsel, latency 0 Memory at dfb01000 (32-bit, non-prefetchable) [size=4K] Capabilities: [44] Express Endpoint IRQ 0 Capabilities: [6c] Power Management version 2 04:02.0 Ethernet controller: Intel Corporation 82545GM Gigabit Ethernet Controller (rev 04) Subsystem: Intel Corporation PRO/1000 MT Server Adapter Flags: bus master, 66Mhz, medium devsel, latency 64, IRQ 185 Memory at dfc00000 (64-bit, non-prefetchable) [size=128K] I/O ports at 2000 [size=64] Capabilities: [dc] Power Management version 2 Capabilities: [e4] PCI-X non-bridge device. Capabilities: [f0] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable- -bash-3.00# lspci -vv 00:02.0 PCI bridge: Intel Corporation E7525/E7520/E7320 PCI Express Port A (rev 0a) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Step ping- SERR+ FastB2B- Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR+ TAbort- Reset- FastB2B- Capabilities: [50] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot +,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=0 PME- Capabilities: [58] Message Signalled Interrupts: 64bit- Queue=0/1 Enable - Address: fee00000 Data: 0000 Capabilities: [64] Express Root Port (Slot-) IRQ 0 Device: Supported: MaxPayload 256 bytes, PhantFunc 0, ExtTag- Device: Latency L0s <64ns, L1 <1us Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported- Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- Device: MaxPayload 128 bytes, MaxReadReq 128 bytes Link: Supported Speed 2.5Gb/s, Width x0, ASPM L0s, Port 2 Link: Latency L0s <4us, L1 unlimited Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch- Link: Speed 2.5Gb/s, Width x0 Root: Correctable- Non-Fatal- Fatal- PME- 00:03.0 PCI bridge: Intel Corporation E7525/E7520/E7320 PCI Express Port A1 (rev 0a) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Step ping- SERR+ FastB2B- Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- Reset- FastB2B- Capabilities: [50] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot +,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=0 PME- Capabilities: [58] Message Signalled Interrupts: 64bit- Queue=0/1 Enable - Address: fee00000 Data: 0000 Capabilities: [64] Express Root Port (Slot-) IRQ 0 Device: Supported: MaxPayload 256 bytes, PhantFunc 0, ExtTag- Device: Latency L0s <64ns, L1 <1us Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported- Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- Device: MaxPayload 256 bytes, MaxReadReq 512 bytes Link: Supported Speed 2.5Gb/s, Width x4, ASPM L0s, Port 3 Link: Latency L0s <4us, L1 unlimited Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch- Link: Speed 2.5Gb/s, Width x4 Root: Correctable- Non-Fatal- Fatal- PME- 00:04.0 PCI bridge: Intel Corporation E7525/E7520 PCI Express Port B (rev 0a) (p rog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Step ping- SERR+ FastB2B- Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR+ TAbort- Reset- FastB2B- Capabilities: [50] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot +,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=0 PME- Capabilities: [58] Message Signalled Interrupts: 64bit- Queue=0/1 Enable - Address: fee00000 Data: 0000 Capabilities: [64] Express Root Port (Slot-) IRQ 0 Device: Supported: MaxPayload 256 bytes, PhantFunc 0, ExtTag- Device: Latency L0s <64ns, L1 <1us Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported- Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- Device: MaxPayload 128 bytes, MaxReadReq 128 bytes Link: Supported Speed 2.5Gb/s, Width x0, ASPM L0s, Port 4 Link: Latency L0s <4us, L1 unlimited Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch- Link: Speed 2.5Gb/s, Width x0 Root: Correctable- Non-Fatal- Fatal- PME- ............... 02:00.0 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge A (rev 0 9) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Step ping- SERR+ FastB2B- Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- Reset- FastB2B- Capabilities: [44] Express PCI/PCI-X Bridge IRQ 0 Device: Supported: MaxPayload 256 bytes, PhantFunc 0, ExtTag- Device: Latency L0s <64ns, L1 <1us Device: AtnBtn- AtnInd- PwrInd- Device: Errors: Correctable- Non-Fatal- Fatal+ Unsupported- Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- Device: MaxPayload 256 bytes, MaxReadReq 512 bytes Link: Supported Speed 2.5Gb/s, Width x8, ASPM L0s, Port 0 Link: Latency L0s unlimited, L1 unlimited Link: ASPM Disabled CommClk- ExtSynch- Link: Speed 2.5Gb/s, Width x8 Capabilities: [5c] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable - Address: 0000000000000000 Data: 0000 Capabilities: [6c] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot +,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=0 PME- Capabilities: [d8] PCI-X bridge device. Secondary Status: 64bit+, 133MHz+, SCD-, USC-, SCO-, SRD- Freq=2 Status: Bus=2 Dev=0 Func=0 64bit- 133MHz- SCD- USC-, SCO-, SRD- : Upstream: Capacity=65535, Commitment Limit=65535 : Downstream: Capacity=65535, Commitment Limit=65535 ............... 02:00.2 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge B (rev 09) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- Reset- FastB2B- Capabilities: [44] Express PCI/PCI-X Bridge IRQ 0 Device: Supported: MaxPayload 256 bytes, PhantFunc 0, ExtTag- Device: Latency L0s <64ns, L1 <1us Device: AtnBtn- AtnInd- PwrInd- Device: Errors: Correctable- Non-Fatal- Fatal+ Unsupported- Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- Device: MaxPayload 256 bytes, MaxReadReq 512 bytes Link: Supported Speed 2.5Gb/s, Width x8, ASPM L0s, Port 0 Link: Latency L0s unlimited, L1 unlimited Link: ASPM Disabled CommClk- ExtSynch- Link: Speed 2.5Gb/s, Width x8 Capabilities: [5c] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable- Address: 0000000000000000 Data: 0000 Capabilities: [6c] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=0 PME- Capabilities: [d8] PCI-X bridge device. Secondary Status: 64bit+, 133MHz+, SCD-, USC-, SCO-, SRD- Freq=3 Status: Bus=2 Dev=0 Func=2 64bit- 133MHz- SCD- USC-, SCO-, SRD- : Upstream: Capacity=65535, Commitment Limit=65535 : Downstream: Capacity=65535, Commitment Limit=65535 ................. -bash-3.00# lspci -vv | grep hca -bash-3.00# lspci -vv | grep HCA -bash-3.00# lspci -vv | grep MTHCA -bash-3.00# lspci -vv | grep mthca From dledford at redhat.com Tue Feb 14 12:20:02 2006 From: dledford at redhat.com (Doug Ledford) Date: Tue, 14 Feb 2006 15:20:02 -0500 Subject: [openib-general] Re: newbie to openib In-Reply-To: <43F23A4B.6040101@cs.ucsb.edu> References: <20060214074840.GE15063@mellanox.co.il> <43F18F03.7040103@cs.ucsb.edu> <20060214082034.GB10494@mellanox.co.il> <43F1961A.2000100@cs.ucsb.edu> <20060214083909.GC10494@mellanox.co.il> <43F19EB4.50909@cs.ucsb.edu> <20060214091606.GA12170@mellanox.co.il> <43F1A3C8.2070106@cs.ucsb.edu> <43F23A4B.6040101@cs.ucsb.edu> Message-ID: <20060214202002.GD10672@redhat.com> On Tue, Feb 14, 2006 at 12:15:07PM -0800, Lamia M.Youseff wrote: > -bash-3.00# lspci > 00:00.0 Host bridge: Intel Corporation E7525 Memory Controller Hub (rev 0a) > 00:02.0 PCI bridge: Intel Corporation E7525/E7520/E7320 PCI Express Port A > (rev 0a) > 00:03.0 PCI bridge: Intel Corporation E7525/E7520/E7320 PCI Express Port A1 > (rev 0a) > 00:04.0 PCI bridge: Intel Corporation E7525/E7520 PCI Express Port B (rev > 0a) > 00:1d.0 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI > Controller #1 (rev 02) > 00:1d.1 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI > Controller #2 (rev 02) > 00:1d.2 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI > Controller #3 (rev 02) > 00:1d.3 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI > Controller #4 (rev 02) > 00:1d.7 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB2 EHCI > Controller (rev 02) > 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev c2) > 00:1f.0 ISA bridge: Intel Corporation 82801EB/ER (ICH5/ICH5R) LPC Interface > Bridge (rev 02) > 00:1f.2 IDE interface: Intel Corporation 82801EB (ICH5) SATA Controller > (rev 02) > 00:1f.3 SMBus: Intel Corporation 82801EB/ER (ICH5/ICH5R) SMBus Controller > (rev 02) > 00:1f.5 Multimedia audio controller: Intel Corporation 82801EB/ER > (ICH5/ICH5R) AC'97 Audio Controller (rev 02) > 02:00.0 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge A > (rev 09) > 02:00.1 PIC: Intel Corporation 6700/6702PXH I/OxAPIC Interrupt Controller A > (rev 09) > 02:00.2 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge B > (rev 09) > 02:00.3 PIC: Intel Corporation 6700PXH I/OxAPIC Interrupt Controller B (rev > 09) > 04:02.0 Ethernet controller: Intel Corporation 82545GM Gigabit Ethernet > Controller (rev 04) You don't have an Infiniband card in the machine (or else the card is broken). If there was a mellanox card in the machine there would be an entry in the lspci output similar to this: 02:06.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev a1) 03:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1) -- Doug Ledford 919-754-3700 x44233 Red Hat, Inc. 1801 Varsity Dr. Raleigh, NC 27606 From rdreier at cisco.com Tue Feb 14 13:23:49 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 14 Feb 2006 13:23:49 -0800 Subject: [openib-general] Re: newbie to openib In-Reply-To: <20060214202002.GD10672@redhat.com> (Doug Ledford's message of "Tue, 14 Feb 2006 15:20:02 -0500") References: <20060214074840.GE15063@mellanox.co.il> <43F18F03.7040103@cs.ucsb.edu> <20060214082034.GB10494@mellanox.co.il> <43F1961A.2000100@cs.ucsb.edu> <20060214083909.GC10494@mellanox.co.il> <43F19EB4.50909@cs.ucsb.edu> <20060214091606.GA12170@mellanox.co.il> <43F1A3C8.2070106@cs.ucsb.edu> <43F23A4B.6040101@cs.ucsb.edu> <20060214202002.GD10672@redhat.com> Message-ID: Doug> You don't have an Infiniband card in the machine (or else Doug> the card is broken). If there was a mellanox card in the Doug> machine there would be an entry in the lspci output similar Doug> to this: Exactly. There is no HCA device listed in your lspci output. Just for completeness, a PCI Express HCA would look like: 0000:04:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (rev a0) Until the HCA shows up on your PCI bus, there's nothing you can do. - R. From halr at voltaire.com Tue Feb 14 16:25:44 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Feb 2006 19:25:44 -0500 Subject: [openib-general] OpenSM realloc error In-Reply-To: <1139945214.2169.4.camel@beast.terraplex.com> References: <1139945214.2169.4.camel@beast.terraplex.com> Message-ID: <1139963140.4333.16065.camel@hal.voltaire.com> Hi Owen, On Tue, 2006-02-14 at 14:26, Owen Stampflee wrote: > [root at m1 ~]# opensm > ------------------------------------------------- > OpenSM Rev:openib-1.1.0 > Based on OpenIB svn 5411 > Command Line Arguments: > Log File: /var/log/osm.log > ------------------------------------------------- > *** glibc detected *** realloc(): invalid next size: 0x000000001007ae90 ^^^^^^^^^^^^^^^^^ That's a pretty big size (268,938,896). Any idea what was going on ? How far do you get with opensm ? Is there anything in the log ? Out of curiousity, how big is your subnet ? > *** > Aborted Is this reproducible ? > Installed components are: > * kernel-g5-smp-2.6.15-1.yhpc.1 > * gcc-3.4.4-2.ydl.2 > * glibc-2.3.4-2.13.ydl.0 Is this PowerPC ? > * SBS mthca adapter > > Everything is built 64-bit, with -fPIC. Any idea on what could be going > on? The default build doesn't use -fPIC but I wouldn't think that should have anything to do with it. -- Hal > Thanks, > Owen > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From alex_tomass01 at yahoo.com Tue Feb 14 15:38:02 2006 From: alex_tomass01 at yahoo.com (alex_tomass01 at yahoo.com) Date: Tue, 14 Feb 2006 18:38:02 -0500 Subject: [openib-general] Re[6]: /!agra GET IT HERE!!! Save $37 Message-ID: <001701c631c8$13cc8f92$10b881d8@your-f78bf48ce2> An HTML attachment was scrubbed... URL: From ostampflee at terrasoftsolutions.com Tue Feb 14 16:51:23 2006 From: ostampflee at terrasoftsolutions.com (Owen Stampflee) Date: Tue, 14 Feb 2006 16:51:23 -0800 Subject: [openib-general] OpenSM realloc error In-Reply-To: <1139963140.4333.16065.camel@hal.voltaire.com> References: <1139945214.2169.4.camel@beast.terraplex.com> <1139963140.4333.16065.camel@hal.voltaire.com> Message-ID: <1139964683.5941.10.camel@beast.terraplex.com> On Tue, 2006-02-14 at 19:25 -0500, Hal Rosenstock wrote: > Hi Owen, > > On Tue, 2006-02-14 at 14:26, Owen Stampflee wrote: > > [root at m1 ~]# opensm > > ------------------------------------------------- > > OpenSM Rev:openib-1.1.0 > > Based on OpenIB svn 5411 > > Command Line Arguments: > > Log File: /var/log/osm.log > > ------------------------------------------------- > > *** glibc detected *** realloc(): invalid next size: 0x000000001007ae90 > ^^^^^^^^^^^^^^^^^ > That's a pretty big size (268,938,896). Any idea what was going on ? > > How far do you get with opensm ? Is there anything in the log ? > > Out of curiousity, how big is your subnet ? Small, two nodes directly connected, I'll see if it works better with a switch in the middle. > > *** > > Aborted > > Is this reproducible ? Yes, everytime opensm runs. Only thing is the log is this: Feb 14 13:18:48 488484 [A9020] -> OpenSM Rev:openib-1.1.0 OpenIB svn 5411 > > Installed components are: > > * kernel-g5-smp-2.6.15-1.yhpc.1 > > * gcc-3.4.4-2.ydl.2 > > * glibc-2.3.4-2.13.ydl.0 > > Is this PowerPC ? Yup, PowerMac G5s. > > * SBS mthca adapter > > > > Everything is built 64-bit, with -fPIC. Any idea on what could be going > > on? > > The default build doesn't use -fPIC but I wouldn't think that should > have anything to do with it. I tried without -fPIC, also rebuild the Fedora FC5 RPMs (svn4265) to see if that helped, but still no luck. I'm using svn5411. From jgunthorpe at obsidianresearch.com Tue Feb 14 17:02:10 2006 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Tue, 14 Feb 2006 18:02:10 -0700 Subject: [openib-general] Low cost MBs compatible with Mellanox PCIE Cards Message-ID: <20060215010210.GC31277@obsidianresearch.com> Hi All, Does anyone have any idea if any lower end bare bones motherboards (ie the Asus/Gigabyte type vendors) work successfully with a Mellanox PCI-E x8 card (MHGS18-XT A2 specifically)? I'm looking for a cost-effective way to get a bunch of IB hosts for testing, so compute performance isn't a concern. I have a couple desktop systems here that won't even POST if the IB card is installed in the x16 slot so there are definately some compatability problems out there.. Thanks, Jason From dwatkins at insite.co.nz Tue Feb 14 17:12:10 2006 From: dwatkins at insite.co.nz (Dave Watkins) Date: Wed, 15 Feb 2006 14:12:10 +1300 Subject: [openib-general] Low cost MBs compatible with Mellanox PCIE Cards Message-ID: I've tried a few Asus boards with IB cards in their graphics card slots with no luck, the boards did post but the cards weren't visable to the drivers. Same cards in other machines were fine. I suspect somewhere in the BIOS or chipset you can actually say that a certain PCI-E x16 slot is for "graphics" and no other cards are initialised, but that's only a guess. Dave > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Jason > Gunthorpe > Sent: Wednesday, 15 February 2006 2:02 p.m. > To: openib-general at openib.org > Subject: [openib-general] Low cost MBs compatible with > Mellanox PCIE Cards > > Hi All, > > Does anyone have any idea if any lower end bare bones > motherboards (ie the Asus/Gigabyte type vendors) work > successfully with a Mellanox PCI-E x8 card (MHGS18-XT A2 > specifically)? > > I'm looking for a cost-effective way to get a bunch of IB > hosts for testing, so compute performance isn't a concern. I > have a couple desktop systems here that won't even POST if > the IB card is installed in the x16 slot so there are > definately some compatability problems out there.. > > Thanks, > Jason > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Tue Feb 14 17:15:51 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 14 Feb 2006 17:15:51 -0800 Subject: [openib-general] Low cost MBs compatible with Mellanox PCIE Cards In-Reply-To: <20060215010210.GC31277@obsidianresearch.com> (Jason Gunthorpe's message of "Tue, 14 Feb 2006 18:02:10 -0700") References: <20060215010210.GC31277@obsidianresearch.com> Message-ID: I have a PCIe HCA working in the secondary x16 slot of an ASUS A8N-SLI motherboard, but only as a x1 device. - R. From mairiobom at 55mail.cc Tue Feb 14 17:23:49 2006 From: mairiobom at 55mail.cc (mairiobom at 55mail.cc) Date: Tue, 14 Feb 2006 17:23:49 -0800 (PST) Subject: [openib-general] =?iso-2022-jp?b?GyRCM1ohOTg9Qmc2YiUyJUMbKEI=?= =?iso-2022-jp?b?GyRCJUgbKEI=?= Message-ID: 20060215092921.83889mail@mail.hyper_luckylady8754158754_lookserver772_serebusystem03_woman-luckylady.tv $B"(:#$N@$$NCf$*6b$,$9$Y$F!D$=$3$G=PD%%[%9%H$G$*6b;}$A$N$*>nMM$d%;%l%V.8/$$%2%C%H$7$A$c$$$^$7$g$&!*!*(B $B"v$*9%$_$N=w at -$rA*Br$7$F2<$5$$"v(B http://luckylady.cx/h/ $B!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a(B $B!Z%k!<%k!&%^%J!$K1~$($i$l$k;v!#(B $B-#=w at -$H%3%^%a$K%a!<%k$N$d$j@\%"%]$r$H$C$F$/$@$5$$!#(B $B-&Js=7$OD>@\=w at -$+$ipJs!J40A4L5NA!K(B Message-ID: <20060215024224.20068.qmail@mail.swhdy.com> $B(.(.(#(B $B!!(BAetane News$B!J(B2006$BG/HG!K(B $B(.(.(.(B $B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(B $B!!$3$s$K$A$O!*(B $B!!8BDj8x3+(BBBS$B$rFC at _$7$^$7$?!*(B $B!!(B $B!!2q0w8BDj$J$N$G!"B>$NJ}$K$OHkL)$K$7$F2<$5$$$M"v(B $B!!$+$J$j2D0&$$L<$,B7$C$F$$$^$9!#(B $B!!(B $B!!:#$9$0$4Mw$/$@$5$$!*(B(^^) $B!!(B $B!!(Bhttp://www.awg5.net/?sf2 $B"!(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,"!(B $B!!!!(B<< C$B!!(BO$B!!(BN$B!!(BT$B!!(BE$B!!(BN$B!!(BT$B!!(BS >> $B"!(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,"!(B $B!?!?!y<+Bp$G(B2shot$B at 8EEOC(B $B!?!?!y#S#M:G?7>pJs(B $B!?!?!y8BDj8x3+%W%m%U%#!<%k(B $B"#<+Bp$G(B2shot$B at 8EEOC(B $B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(B $B"'EEOC$GB(7h$r$44uK>$N$*5RMM$X!yEv2q0wMM$*$9$9$a$N=w at -$?$A$G$9!#(B $B""$"$:$5!!(B(19) $B!!!V$a$C$A$c$+$o$$$$$G$9!*!*!W(B $B""$R$J$N!!(B(21) $B!!!V%A%c!<%_%s%0$J>P4i$H#E%+%C%WH~F}$N%"%s%P%i%s%9$,$H$F$bL%NOE*(B(*^^*)$B!W(B $B""$^$I$+!!(B(20) $B!!!V%(%C%A$K at Q6KE*$JL@$k$/$F855$$J=w$N;R!#!W(B $B""$J$*$_!!(B(18) $B!!!V$^$@$"$I$1$J$5$,;D$k2D0&$$L<$@$M!#!W(B $B"(EEOC$G#H!yB(%"%]!yNx?MJg=8!yEy!&!&A4$F$N$*5RMM$N4uK>$K$*1~$(CW$7$^$9"v(B http://www.awg5.net/?sf2 $B"##S#M:G?7>pJs(B $B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(B $B(.(,(,(,(,(,(,(,(,(,(B $B(-!z#S#M>pJsBh#3CF!z(B $B(-=w2&MM$N$9$i$j$H?-$S$?H~5S$d$`$C$A$j$7$?%*%7%j$r4.G=!*!*(B $B!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g(B $B0lEY7P83$7$F$_$?$$$"$J$?$X!#!#!#(B $B%=%U%H$+$i%O!<%I$^$G!#!#!#L#$o$$?<$$=w at -C#$,B3!9$H!*(B(^^; http://www.awg5.net/?sf2 $B;s8$E[Nl!&$"$J$?$@$1$N=w2&$r!#!#!#A*$SJ|Bj!*(B(^^)v $B!y(B*$B!&(B*$B!y(B*$B!&(B*$B!y(B*$B!&(B*$B!y(B*$B!&(B*$B!y(B*$B!&(B*$B!y(B*$B!&(B*$B!y(B*$B!&(B*$B!y(B*$B!&!y(B*$B!&(B*$B!y(B $B"#8BDj8x3+%W%m%U%#!<%k(B $B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(B $B"'AaB.O"Mm$5$l$F$_$F$O!)(B $B%?%$%H%k!'(B090*9**691$B$5$f$j(B $BK\J8!'$5$f$j$G$9!#;dC#$H$*OC$7$^$;$s$+!)$7$F$_$?$$(BH$B$,$"$k$s$G$9!#(B $BEEOCHV9f$H%W%m%U%#!<%k$r;dC#$N%W%m%](BBBS$B$K=q$$$FD:$1$l$PO"Mm$7$^$9!#(B^^ $BK\5$$N>Z5r$K;d$NO"Mm at h=q$$$F$*$-$^$9$M!*!*(B $B$?$@$7:G6aL58@$N%$%?%:%iEEOC$,B?$$$+$i(BT-T$B!K!"O"Mm at h$H%W%m%U$N$J$$J}$K$O(B $BO"Mm$7$^$;$s!*(B $B$5$f$j$A$c$s$r(BGET$B!*"*(B http://www.awg5.net/?sf2 ---------------------------------------------------------------------- $B!&Ev6I$O!"%a!<%k%^%,%8%sG[?. at lMQ%9%?%s%I$G$9!#(B $B!&Ev6I$h$j$*Aw$j$9$k%a!<%k%^%,%8%s$NFbMF!"$4MxMQ$K4X$7$F$O!"$49XFIR2p$5$;$FD:$$$F$$$k%5%$%H$K$*$1$k$$$+$J$k%H%i%V%kB;32$KBP(B $B!!$7$F$b0l at Z$N at UG$$rIi$$$+$M$^$9!#(B $B!&Ev6I$O%a!<%k%^%,%8%s$NFbMF$K$D$$$F$N$4:\$9$k;v$r6X;_CW$7$^$9!#(B $B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(B $B""9XFI2r=|(B $B9XFI2r=|$r$44uK>$NJ}$O!"$* References: <20060213154114.GO32041@mellanox.co.il> <20060213210906.GC13603@mellanox.co.il> <20060213225538.GE13603@mellanox.co.il> <20060213233517.GG13603@mellanox.co.il> Message-ID: <43F2AEAE.5010700@yahoo.com.au> Michael S. Tsirkin wrote: > Index: linux-2.6.16-rc2/include/asm-x86_64/mman.h > =================================================================== > --- linux-2.6.16-rc2.orig/include/asm-x86_64/mman.h 2006-02-14 01:22:27.000000000 +0200 > +++ linux-2.6.16-rc2/include/asm-x86_64/mman.h 2006-02-14 01:24:57.000000000 +0200 > @@ -37,6 +37,8 @@ > #define MADV_WILLNEED 0x3 /* pre-fault pages */ > #define MADV_DONTNEED 0x4 /* discard these pages */ > #define MADV_REMOVE 0x5 /* remove these pages & resources */ > +#define MADV_DONTFORK 0x30 /* dont inherit across fork */ > +#define MADV_DOFORK 0x31 /* do inherit across fork */ > May I ask, what is the rationale for ignoring the apparent conventions of all architectures? For example parisc, you appear to even go contrary to the comment. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com From rdreier at cisco.com Tue Feb 14 22:09:34 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 14 Feb 2006 22:09:34 -0800 Subject: [openib-general] Re: [PATCH] madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: <43F2AEAE.5010700@yahoo.com.au> (Nick Piggin's message of "Wed, 15 Feb 2006 15:31:42 +1100") References: <20060213154114.GO32041@mellanox.co.il> <20060213210906.GC13603@mellanox.co.il> <20060213225538.GE13603@mellanox.co.il> <20060213233517.GG13603@mellanox.co.il> <43F2AEAE.5010700@yahoo.com.au> Message-ID: Nick> May I ask, what is the rationale for ignoring the apparent Nick> conventions of all architectures? For example parisc, you Nick> appear to even go contrary to the comment. Looking through include/asm-*/mman.h, I have to agree. The parisc example seemly especially bad, as (in addition to being in the reserved range as Nick notes) the DONTFORK/DOFORK values are stuck in a block with the page size values instead of the previous block where they seem more sensible. However, in other files like the alpha version, where the rest of the values are in decimal, the hex defines look rather jarring. Michael, what led you to choose 0x30 and 0x31 for the two new values? It does seem that keeping them uniform across architectures is a reasonable thing to do, but as far as I can tell the values 9 and 10 are unused on all architectures, and have the added merit of not falling in the parisc reserved range. Do we still have a chance to change this? - R. From akpm at osdl.org Tue Feb 14 22:16:54 2006 From: akpm at osdl.org (Andrew Morton) Date: Tue, 14 Feb 2006 22:16:54 -0800 Subject: [openib-general] Re: [PATCH] madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: References: <20060213154114.GO32041@mellanox.co.il> <20060213210906.GC13603@mellanox.co.il> <20060213225538.GE13603@mellanox.co.il> <20060213233517.GG13603@mellanox.co.il> <43F2AEAE.5010700@yahoo.com.au> Message-ID: <20060214221654.67288424.akpm@osdl.org> Roland Dreier wrote: > > Do we still have a chance to change this? yes, please do. From chiemi_bomber007 at dmey.com Tue Feb 14 18:24:44 2006 From: chiemi_bomber007 at dmey.com (=?ISO-2022-JP?B?GyRCJEQkOCRBJCgkXxsoQg==?=) Date: 15 Feb 2006 11:24:44 +0900 Subject: [openib-general] $B!zHV9f65$($F(B!!$B!z(B Message-ID: <20060215022444.9643.qmail@mail.dmey.com> 涼子様からの連絡です: いくら指名しても連絡が来ないけど、私じゃだめなの? 金額に不満がありました ら、携帯で相談に乗ります、私の番号は090-61**0**8です。でも、わからない番号に 出るのは怖いから、 http://www.chabby-funk.com?hope ここから登録してメールで貴方の番号と話せる時間を教えてください、こちらから連 絡します、私のIDは【112070】です。 私のこと嫌いなら 拒否nomore at chabby-funk.com From glebn at voltaire.com Tue Feb 14 23:34:52 2006 From: glebn at voltaire.com (Gleb Natapov) Date: Wed, 15 Feb 2006 09:34:52 +0200 Subject: [openib-general] Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: <20060214164236.GB12974@mellanox.co.il> References: <20060213154114.GO32041@mellanox.co.il> <20060213190206.GC12458@mellanox.co.il> <20060214065145.GE24524@minantech.com> <20060214164236.GB12974@mellanox.co.il> Message-ID: <20060215073452.GI24524@minantech.com> On Tue, Feb 14, 2006 at 06:42:36PM +0200, Michael S. Tsirkin wrote: > All, MADV_DONTFORK patch is now part of the -mm tree. > Everyone who's interested in fork support, please test 2.6.16-rc3-mm1 and > publish the results here and on lkml. > Good news! Should call to madvise be the part of reg_mr call? -- Gleb. From nickpiggin at yahoo.com.au Tue Feb 14 23:06:55 2006 From: nickpiggin at yahoo.com.au (Nick Piggin) Date: Wed, 15 Feb 2006 18:06:55 +1100 Subject: [openib-general] Re: [PATCH] Fix up MADV_DONTFORK/MADV_DOFORK definitions In-Reply-To: <20060215064836.GG24524@minantech.com> References: <20060213210906.GC13603@mellanox.co.il> <20060213225538.GE13603@mellanox.co.il> <20060213233517.GG13603@mellanox.co.il> <43F2AEAE.5010700@yahoo.com.au> <20060214221654.67288424.akpm@osdl.org> <20060215064836.GG24524@minantech.com> Message-ID: <43F2D30F.9050904@yahoo.com.au> Gleb Natapov wrote: > On Tue, Feb 14, 2006 at 10:34:48PM -0800, Roland Dreier wrote: > >> Andrew> yes, please do. >> >>OK, here's a patch that changes them to 9 and 10. I would hold off >>sending this to Linus until Michael has a chance to speak up, in case >>there's a reason I don't know for choosing 0x30 and 0x31. >> > > Here > http://marc.theaimsgroup.com/?l=linux-kernel&m=113162971606408&w=2 > at the end there is a reasoning. Well it may make userspace portability slightly easier for this one case (exactly how, I'm not so sure because each architecture has their own MADV_ defines anyway). I rather think this should be left up to arch maintainers' numbering schemes, but... > So I think 9 and 10 will do too. s/too// ? 0x30 and 0x31 broke parisc's numbering scheme. > By the way Nick was on CC list back than and haven't raised any > concerns :) > I probably would have assumed it had gone past arch maintainers and so wouldn't have given it a second thought: I don't know a great deal about the issues here. I just now happened to see the parisc comment. But no harm done this time. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com From mst at mellanox.co.il Wed Feb 15 00:13:25 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Feb 2006 10:13:25 +0200 Subject: [openib-general] Re: [PATCH] madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: References: <20060213210906.GC13603@mellanox.co.il> <20060213225538.GE13603@mellanox.co.il> <20060213233517.GG13603@mellanox.co.il> <43F2AEAE.5010700@yahoo.com.au> Message-ID: <20060215081325.GC10026@mellanox.co.il> Quoting r. Roland Dreier : > Michael, what led you to choose 0x30 and 0x31 for the two new values? > It does seem that keeping them uniform across architectures is a > reasonable thing to do, but as far as I can tell the values 9 and 10 > are unused on all architectures, and have the added merit of not > falling in the parisc reserved range. No particular reason - I just selected values away from the rest of pack. Lets go ahead and change them. > Do we still have a chance to change this? So, any value consistent across architectures will do. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From rdreier at cisco.com Tue Feb 14 22:34:48 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 14 Feb 2006 22:34:48 -0800 Subject: [openib-general] [PATCH] Fix up MADV_DONTFORK/MADV_DOFORK definitions In-Reply-To: <20060214221654.67288424.akpm@osdl.org> (Andrew Morton's message of "Tue, 14 Feb 2006 22:16:54 -0800") References: <20060213154114.GO32041@mellanox.co.il> <20060213210906.GC13603@mellanox.co.il> <20060213225538.GE13603@mellanox.co.il> <20060213233517.GG13603@mellanox.co.il> <43F2AEAE.5010700@yahoo.com.au> <20060214221654.67288424.akpm@osdl.org> Message-ID: Andrew> yes, please do. OK, here's a patch that changes them to 9 and 10. I would hold off sending this to Linus until Michael has a chance to speak up, in case there's a reason I don't know for choosing 0x30 and 0x31. - R. The recently added MADV_DONTFORK and MADV_DOFORK values were defined to be 0x30 and 0x31 respectively. This leaves a strange gap from the older values, and ends up putting the values in the range of values that parisc reserves for page size specification. Also, the macros were always defined using hex, which looks somewhat strange when an architecture defines all the other values in decimal. Change MADV_DONTFORK and MADV_DOFORK to be 9 and 10 respectively. These values are unused on all architectures and safely outside of the parisc reserved range. Define the values in decimal or hex to match the surrounding style for each architecture. While we're touching all this, change the comments from "dont inherit" to "don't inherit." Signed-off-by: Roland Dreier diff --git a/include/asm-alpha/mman.h b/include/asm-alpha/mman.h index a21515c..0831a7c 100644 --- a/include/asm-alpha/mman.h +++ b/include/asm-alpha/mman.h @@ -43,8 +43,8 @@ #define MADV_SPACEAVAIL 5 /* ensure resources are available */ #define MADV_DONTNEED 6 /* don't need these pages */ #define MADV_REMOVE 7 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ +#define MADV_DONTFORK 9 /* don't inherit across fork */ +#define MADV_DOFORK 10 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff --git a/include/asm-arm/mman.h b/include/asm-arm/mman.h index 693ed85..9a87604 100644 --- a/include/asm-arm/mman.h +++ b/include/asm-arm/mman.h @@ -36,8 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ +#define MADV_DONTFORK 0x9 /* don't inherit across fork */ +#define MADV_DOFORK 0xa /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff --git a/include/asm-arm26/mman.h b/include/asm-arm26/mman.h index 2096c50..83240c8 100644 --- a/include/asm-arm26/mman.h +++ b/include/asm-arm26/mman.h @@ -36,8 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ +#define MADV_DONTFORK 0x9 /* don't inherit across fork */ +#define MADV_DOFORK 0xa /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff --git a/include/asm-cris/mman.h b/include/asm-cris/mman.h index deddfb2..536bb02 100644 --- a/include/asm-cris/mman.h +++ b/include/asm-cris/mman.h @@ -38,8 +38,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ +#define MADV_DONTFORK 0x9 /* don't inherit across fork */ +#define MADV_DOFORK 0xa /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff --git a/include/asm-frv/mman.h b/include/asm-frv/mman.h index d3bca30..7f96e5f 100644 --- a/include/asm-frv/mman.h +++ b/include/asm-frv/mman.h @@ -36,8 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ +#define MADV_DONTFORK 0x9 /* don't inherit across fork */ +#define MADV_DOFORK 0xa /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff --git a/include/asm-h8300/mman.h b/include/asm-h8300/mman.h index ac0346f..e03cbd8 100644 --- a/include/asm-h8300/mman.h +++ b/include/asm-h8300/mman.h @@ -36,8 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ +#define MADV_DONTFORK 0x9 /* don't inherit across fork */ +#define MADV_DOFORK 0xa /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff --git a/include/asm-i386/mman.h b/include/asm-i386/mman.h index ab2339a..2c740e2 100644 --- a/include/asm-i386/mman.h +++ b/include/asm-i386/mman.h @@ -36,8 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ +#define MADV_DONTFORK 0x9 /* don't inherit across fork */ +#define MADV_DOFORK 0xa /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff --git a/include/asm-ia64/mman.h b/include/asm-ia64/mman.h index 357ebb7..a4b8dc1 100644 --- a/include/asm-ia64/mman.h +++ b/include/asm-ia64/mman.h @@ -44,8 +44,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ +#define MADV_DONTFORK 0x9 /* don't inherit across fork */ +#define MADV_DOFORK 0xa /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff --git a/include/asm-m32r/mman.h b/include/asm-m32r/mman.h index 6b02fe3..68c28c5 100644 --- a/include/asm-m32r/mman.h +++ b/include/asm-m32r/mman.h @@ -38,8 +38,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ +#define MADV_DONTFORK 0x9 /* don't inherit across fork */ +#define MADV_DOFORK 0xa /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff --git a/include/asm-m68k/mman.h b/include/asm-m68k/mman.h index efd12bc..dd98f77 100644 --- a/include/asm-m68k/mman.h +++ b/include/asm-m68k/mman.h @@ -36,8 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ +#define MADV_DONTFORK 0x9 /* don't inherit across fork */ +#define MADV_DOFORK 0xa /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff --git a/include/asm-mips/mman.h b/include/asm-mips/mman.h index 6d01e26..018eb2e 100644 --- a/include/asm-mips/mman.h +++ b/include/asm-mips/mman.h @@ -66,8 +66,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ +#define MADV_DONTFORK 0x9 /* don't inherit across fork */ +#define MADV_DOFORK 0xa /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff --git a/include/asm-parisc/mman.h b/include/asm-parisc/mman.h index a381cf5..e231d7a 100644 --- a/include/asm-parisc/mman.h +++ b/include/asm-parisc/mman.h @@ -39,6 +39,8 @@ #define MADV_VPS_PURGE 6 /* Purge pages from VM page cache */ #define MADV_VPS_INHERIT 7 /* Inherit parents page size */ #define MADV_REMOVE 8 /* remove these pages & resources */ +#define MADV_DONTFORK 9 /* don't inherit across fork */ +#define MADV_DOFORK 10 /* do inherit across fork */ /* The range 12-64 is reserved for page size specification. */ #define MADV_4K_PAGES 12 /* Use 4K pages */ @@ -49,8 +51,6 @@ #define MADV_4M_PAGES 22 /* Use 4 Megabyte pages */ #define MADV_16M_PAGES 24 /* Use 16 Megabyte pages */ #define MADV_64M_PAGES 26 /* Use 64 Megabyte pages */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff --git a/include/asm-powerpc/mman.h b/include/asm-powerpc/mman.h index fcff25d..e81aa80 100644 --- a/include/asm-powerpc/mman.h +++ b/include/asm-powerpc/mman.h @@ -45,8 +45,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ +#define MADV_DONTFORK 0x9 /* don't inherit across fork */ +#define MADV_DOFORK 0xa /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff --git a/include/asm-s390/mman.h b/include/asm-s390/mman.h index d41ca14..d9a5387 100644 --- a/include/asm-s390/mman.h +++ b/include/asm-s390/mman.h @@ -44,8 +44,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ +#define MADV_DONTFORK 0x9 /* don't inherit across fork */ +#define MADV_DOFORK 0xa /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff --git a/include/asm-sh/mman.h b/include/asm-sh/mman.h index 0e08d05..0e3efb1 100644 --- a/include/asm-sh/mman.h +++ b/include/asm-sh/mman.h @@ -36,8 +36,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ +#define MADV_DONTFORK 0x9 /* don't inherit across fork */ +#define MADV_DOFORK 0xa /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff --git a/include/asm-sparc/mman.h b/include/asm-sparc/mman.h index 4a298b2..d1618d8 100644 --- a/include/asm-sparc/mman.h +++ b/include/asm-sparc/mman.h @@ -55,8 +55,8 @@ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_FREE 0x5 /* (Solaris) contents can be freed */ #define MADV_REMOVE 0x6 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ +#define MADV_DONTFORK 0x9 /* don't inherit across fork */ +#define MADV_DOFORK 0xa /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff --git a/include/asm-sparc64/mman.h b/include/asm-sparc64/mman.h index d705ec9..cabec87 100644 --- a/include/asm-sparc64/mman.h +++ b/include/asm-sparc64/mman.h @@ -55,8 +55,8 @@ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_FREE 0x5 /* (Solaris) contents can be freed */ #define MADV_REMOVE 0x6 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ +#define MADV_DONTFORK 0x9 /* don't inherit across fork */ +#define MADV_DOFORK 0xa /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff --git a/include/asm-v850/mman.h b/include/asm-v850/mman.h index 7b851c3..d652235 100644 --- a/include/asm-v850/mman.h +++ b/include/asm-v850/mman.h @@ -33,8 +33,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ +#define MADV_DONTFORK 0x9 /* don't inherit across fork */ +#define MADV_DOFORK 0xa /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff --git a/include/asm-x86_64/mman.h b/include/asm-x86_64/mman.h index b699a38..9333709 100644 --- a/include/asm-x86_64/mman.h +++ b/include/asm-x86_64/mman.h @@ -37,8 +37,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ +#define MADV_DONTFORK 0x9 /* don't inherit across fork */ +#define MADV_DOFORK 0xa /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS diff --git a/include/asm-xtensa/mman.h b/include/asm-xtensa/mman.h index e2d7afb..12f93a7 100644 --- a/include/asm-xtensa/mman.h +++ b/include/asm-xtensa/mman.h @@ -73,8 +73,8 @@ #define MADV_WILLNEED 0x3 /* pre-fault pages */ #define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ +#define MADV_DONTFORK 0x9 /* don't inherit across fork */ +#define MADV_DOFORK 0xa /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS From gleb at minantech.com Tue Feb 14 22:48:36 2006 From: gleb at minantech.com (Gleb Natapov) Date: Wed, 15 Feb 2006 08:48:36 +0200 Subject: [openib-general] Re: [PATCH] Fix up MADV_DONTFORK/MADV_DOFORK definitions In-Reply-To: References: <20060213210906.GC13603@mellanox.co.il> <20060213225538.GE13603@mellanox.co.il> <20060213233517.GG13603@mellanox.co.il> <43F2AEAE.5010700@yahoo.com.au> <20060214221654.67288424.akpm@osdl.org> Message-ID: <20060215064836.GG24524@minantech.com> On Tue, Feb 14, 2006 at 10:34:48PM -0800, Roland Dreier wrote: > Andrew> yes, please do. > > OK, here's a patch that changes them to 9 and 10. I would hold off > sending this to Linus until Michael has a chance to speak up, in case > there's a reason I don't know for choosing 0x30 and 0x31. > Here http://marc.theaimsgroup.com/?l=linux-kernel&m=113162971606408&w=2 at the end there is a reasoning. So I think 9 and 10 will do too. By the way Nick was on CC list back than and haven't raised any concerns :) -- Gleb. From mst at mellanox.co.il Wed Feb 15 00:18:57 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Feb 2006 10:18:57 +0200 Subject: [openib-general] Re: [PATCH] Fix up MADV_DONTFORK/MADV_DOFORK definitions In-Reply-To: References: <20060213210906.GC13603@mellanox.co.il> <20060213225538.GE13603@mellanox.co.il> <20060213233517.GG13603@mellanox.co.il> <43F2AEAE.5010700@yahoo.com.au> <20060214221654.67288424.akpm@osdl.org> Message-ID: <20060215081857.GD10026@mellanox.co.il> Please apply. Quoting Roland Dreier : > Change MADV_DONTFORK and MADV_DOFORK to be 9 and 10 respectively. > These values are unused on all architectures and safely outside of the > parisc reserved range. Define the values in decimal or hex to match > the surrounding style for each architecture. While we're touching all > this, change the comments from "dont inherit" to "don't inherit." > > Signed-off-by: Roland Dreier Acked-by: Michael S. Tsirkin Thanks, -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Wed Feb 15 00:23:31 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Feb 2006 10:23:31 +0200 Subject: [openib-general] Re: Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: <20060215073452.GI24524@minantech.com> References: <20060213154114.GO32041@mellanox.co.il> <20060213190206.GC12458@mellanox.co.il> <20060214065145.GE24524@minantech.com> <20060214164236.GB12974@mellanox.co.il> <20060215073452.GI24524@minantech.com> Message-ID: <20060215082331.GE10026@mellanox.co.il> Quoting r. Gleb Natapov : > Subject: Re: Re: madvise MADV_DONTFORK/MADV_DOFORK > > On Tue, Feb 14, 2006 at 06:42:36PM +0200, Michael S. Tsirkin wrote: > > All, MADV_DONTFORK patch is now part of the -mm tree. > > Everyone who's interested in fork support, please test 2.6.16-rc3-mm1 and > > publish the results here and on lkml. > > > Good news! > > Should call to madvise be the part of reg_mr call? Probably no - MPI should have to do it. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From glebn at voltaire.com Wed Feb 15 00:31:16 2006 From: glebn at voltaire.com (Gleb Natapov) Date: Wed, 15 Feb 2006 10:31:16 +0200 Subject: [openib-general] Re: Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: <20060215082331.GE10026@mellanox.co.il> References: <20060213154114.GO32041@mellanox.co.il> <20060213190206.GC12458@mellanox.co.il> <20060214065145.GE24524@minantech.com> <20060214164236.GB12974@mellanox.co.il> <20060215073452.GI24524@minantech.com> <20060215082331.GE10026@mellanox.co.il> Message-ID: <20060215083115.GJ24524@minantech.com> On Wed, Feb 15, 2006 at 10:23:31AM +0200, Michael S. Tsirkin wrote: > Quoting r. Gleb Natapov : > > Subject: Re: Re: madvise MADV_DONTFORK/MADV_DOFORK > > > > On Tue, Feb 14, 2006 at 06:42:36PM +0200, Michael S. Tsirkin wrote: > > > All, MADV_DONTFORK patch is now part of the -mm tree. > > > Everyone who's interested in fork support, please test 2.6.16-rc3-mm1 and > > > publish the results here and on lkml. > > > > > Good news! > > > > Should call to madvise be the part of reg_mr call? > > Probably no - MPI should have to do it. > Then each userspace app will have to reinvent the wheel. Remember that we should gracefully handle overlapping registrations. -- Gleb. From mst at mellanox.co.il Wed Feb 15 01:02:50 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Feb 2006 11:02:50 +0200 Subject: [openib-general] Re: Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: <20060215083115.GJ24524@minantech.com> References: <20060213154114.GO32041@mellanox.co.il> <20060213190206.GC12458@mellanox.co.il> <20060214065145.GE24524@minantech.com> <20060214164236.GB12974@mellanox.co.il> <20060215073452.GI24524@minantech.com> <20060215082331.GE10026@mellanox.co.il> <20060215083115.GJ24524@minantech.com> Message-ID: <20060215090250.GF12974@mellanox.co.il> Clarification: as I see it, longer term we want to add a flag to make get_user_pages trigger an immediate page copy on fork (rather than copy_ptes). In this setup, MADV_DONTFORK will be used to speed up fork for an application that has locked a big portion of its address space. With this in mind: Quoting r. Gleb Natapov : > > > Should call to madvise be the part of reg_mr call? > > > > Probably no - MPI should have to do it. uDAPL as well, I guess. > Then each userspace app will have to reinvent the wheel. I thought applications used MPI? > Remember that we should gracefully handle overlapping registrations. Right, and madvise doesnt do any refcouting. That's one reason not to include it in reg_mr. Another is that madvise only works for full pages. Applications should be aware of these limitations, and I think the easiest way to achieve this is by asking them to use madvise directly. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From glebn at voltaire.com Wed Feb 15 01:30:07 2006 From: glebn at voltaire.com (Gleb Natapov) Date: Wed, 15 Feb 2006 11:30:07 +0200 Subject: [openib-general] Re: Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: <20060215090250.GF12974@mellanox.co.il> References: <20060213154114.GO32041@mellanox.co.il> <20060213190206.GC12458@mellanox.co.il> <20060214065145.GE24524@minantech.com> <20060214164236.GB12974@mellanox.co.il> <20060215073452.GI24524@minantech.com> <20060215082331.GE10026@mellanox.co.il> <20060215083115.GJ24524@minantech.com> <20060215090250.GF12974@mellanox.co.il> Message-ID: <20060215093007.GK24524@minantech.com> On Wed, Feb 15, 2006 at 11:02:50AM +0200, Michael S. Tsirkin wrote: > Clarification: as I see it, longer term we want to add a flag to make > get_user_pages trigger an immediate page copy on fork (rather than copy_ptes). Can you elaborate? Do you mean one more VMA flag (VM_COPYONFORK)? > In this setup, MADV_DONTFORK will be used to speed up fork for an application > that has locked a big portion of its address space. With this in mind: > > Quoting r. Gleb Natapov : > > > > Should call to madvise be the part of reg_mr call? > > > > > > Probably no - MPI should have to do it. > uDAPL as well, I guess. > > > Then each userspace app will have to reinvent the wheel. > I thought applications used MPI? I hope you don't think that infiniband is good only for HPC :) More and more organisation want to develop applications directly for infiniband without middle layer. Not all of them want to understand deep VM magic to do so. > > > Remember that we should gracefully handle overlapping registrations. > Right, and madvise doesnt do any refcouting. That's one reason not to include it > in reg_mr. I beg to differ. I think this is exactly the reason to include it in reg_mr. Otherwise each application should reinvent refcounting logic. It is much better to do it right once instead of doing it wrong many times. > Another is that madvise only works for full pages. Everything in VM works only for full pages. Unix don't try to hide this from user. > > Applications should be aware of these limitations, and I think the easiest way > to achieve this is by asking them to use madvise directly. The problem not in madvice but in refcounting that each application must maintain. -- Gleb. From yael at mellanox.co.il Wed Feb 15 02:07:27 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 15 Feb 2006 12:07:27 +0200 Subject: [openib-general] [PATCH] Opensm - osmt_service.c fixes Message-ID: <5zu0b1j5pc.fsf@mtl066.yok.mtl.com> Hi Hal, The following patch fixes some problems in osmt_service.c flow, along with cosmetic cleanups of the code. Thanks, Yael Signed-off-by: Yael Kalka Index: osmtest/osmt_service.c =================================================================== --- osmtest/osmt_service.c (revision 5403) +++ osmtest/osmt_service.c (working copy) @@ -80,7 +80,7 @@ osmt_register_service( IN osmtest_t * co osm_log( &p_osmt->log, OSM_LOG_INFO, "osmt_register_service: " - "Registering Service: name:%s id:0x%" PRIx64 ".\n", + "Registering service: name:%s id:0x%" PRIx64 "\n", service_name, cl_ntoh64(service_id)); cl_memclr( &req, sizeof( req ) ); @@ -140,8 +140,8 @@ osmt_register_service( IN osmtest_t * co if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_register_service: ERR 0303: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_register_service: ERR 4A01: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); goto Exit; } @@ -150,14 +150,14 @@ osmt_register_service( IN osmtest_t * co if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_register_service: ERR 0364: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_register_service: ERR 4A02: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); if( status == IB_REMOTE_ERROR ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_register_service: " - "Remote error = %s.\n", + "Remote error = %s\n", ib_get_mad_status_str( osm_madw_get_mad_ptr ( context.result. p_result_madw ) ) ); @@ -195,7 +195,7 @@ osmt_register_service_with_full_key ( IN osm_log( &p_osmt->log, OSM_LOG_INFO, "osmt_register_service_with_full_key: " - "Registering Service: name:%s id:0x%" PRIx64 ".\n", + "Registering service: name:%s id:0x%" PRIx64 "\n", service_name, cl_ntoh64(service_id)); cl_memclr( &req, sizeof( req ) ); @@ -256,8 +256,8 @@ osmt_register_service_with_full_key ( IN if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_register_service_with_full_key: ERR 0303: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_register_service_with_full_key: ERR 4A03: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); goto Exit; } @@ -280,7 +280,7 @@ osmt_register_service_with_full_key ( IN status = IB_REMOTE_ERROR; osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_register_service_with_full_key:" - "Data mismatch in service_key.\n" + "Data mismatch in service_key\n" ); goto Exit; } @@ -288,14 +288,14 @@ osmt_register_service_with_full_key ( IN if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_register_service_with_full_key: ERR 0364: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_register_service_with_full_key: ERR 4A04: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); if( status == IB_REMOTE_ERROR ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_register_service_with_full_key: " - "Remote error = %s.\n", + "Remote error = %s\n", ib_get_mad_status_str( osm_madw_get_mad_ptr ( context.result. p_result_madw ) ) ); @@ -337,7 +337,7 @@ osmt_register_service_with_data( IN osmt osm_log( &p_osmt->log, OSM_LOG_INFO, "osmt_register_service_with_data: " - "Registering Service: name:%s id:0x%" PRIx64 ".\n", + "Registering service: name:%s id:0x%" PRIx64 "\n", service_name, cl_ntoh64(service_id)); cl_memclr( &req, sizeof( req ) ); @@ -426,8 +426,8 @@ osmt_register_service_with_data( IN osmt if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_register_service_with_data: ERR 0303: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_register_service_with_data: ERR 4A05: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); goto Exit; } @@ -436,14 +436,14 @@ osmt_register_service_with_data( IN osmt if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_register_service_with_data: ERR 0364: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_register_service_with_data: ERR 4A06: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); if( status == IB_REMOTE_ERROR ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_register_service_with_data: " - "Remote error = %s.\n", + "Remote error = %s\n", ib_get_mad_status_str( osm_madw_get_mad_ptr ( context.result. p_result_madw ) ) ); @@ -464,7 +464,7 @@ osmt_register_service_with_data( IN osmt status = IB_REMOTE_ERROR; osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_register_service_with_data: " - "Data mismatch in service_data8.\n" + "Data mismatch in service_data8\n" ); goto Exit; } @@ -499,7 +499,7 @@ osmt_get_service_by_id_and_name ( IN osm if( osm_log_is_active( &p_osmt->log, OSM_LOG_VERBOSE ) ) osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_id_and_name: " - "Getting Service Record by id 0x%016" PRIx64 " and name : %s\n", + "Getting service record: id:0x%016" PRIx64 " and name:%s\n", cl_ntoh64(sid),sr_name); /* @@ -509,26 +509,29 @@ osmt_get_service_by_id_and_name ( IN osm * * The query structures are locals. */ - cl_memclr( &svc_rec, sizeof( svc_rec ) ); cl_memclr( &req, sizeof( req ) ); cl_memclr( &context, sizeof( context ) ); - cl_memclr( &user, sizeof( user ) ); - /* set the new service record fields */ - cl_memclr(svc_rec.service_name, sizeof(svc_rec.service_name)); - cl_memcpy(svc_rec.service_name, sr_name, - (strlen(sr_name)+1)*sizeof(char)); - svc_rec.service_id = sid; - /* prepare the data used for this query */ + context.p_osmt = p_osmt; + + /* prepare the data used for this query */ req.query_type = OSMV_QUERY_USER_DEFINED; req.timeout_ms = p_osmt->opt.transaction_timeout; req.retry_cnt = p_osmt->opt.retry_count; req.flags = OSM_SA_FLAGS_SYNC; req.query_context = &context; req.pfn_query_cb = osmtest_query_res_cb; - req.p_query_input = &user; req.sm_key = 0; + cl_memclr( &svc_rec, sizeof( svc_rec ) ); + cl_memclr( &user, sizeof( user ) ); + /* set the new service record fields */ + cl_memclr(svc_rec.service_name, sizeof(svc_rec.service_name)); + cl_memcpy(svc_rec.service_name, sr_name, + (strlen(sr_name)+1)*sizeof(char)); + svc_rec.service_id = sid; + req.p_query_input = &user; + user.method = IB_MAD_METHOD_GET; user.attr_id = IB_MAD_ATTR_SERVICE_RECORD; user.comp_mask = IB_SR_COMPMASK_SID | IB_SR_COMPMASK_SNAME; @@ -539,62 +542,73 @@ osmt_get_service_by_id_and_name ( IN osm if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_service_by_id_and_name: ERR 0365: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_get_service_by_id_and_name: ERR 4A07: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); goto Exit; } status = context.result.status; + num_recs = context.result.result_cnt; if( status != IB_SUCCESS ) { - num_recs = 0; - if (status != IB_INVALID_PARAMETER) - { - osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_service_by_id_and_name: ERR 0370: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); - } + char mad_stat_err[256]; + /* If the failure is due to IB_SA_MAD_STATUS_NO_RECORDS and rec_num is 0, + then this is fine. */ if( status == IB_REMOTE_ERROR ) + strcpy(mad_stat_err, ib_get_mad_status_str( + osm_madw_get_mad_ptr(context.result.p_result_madw) ) ); + else + strcpy(mad_stat_err, ib_get_err_str(status) ); + if( status == IB_REMOTE_ERROR && + !strcmp(mad_stat_err, "IB_SA_MAD_STATUS_NO_RECORDS") && + rec_num == 0 ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_get_service_by_id_and_name: " - "Remote error = %s.\n", - ib_get_mad_status_str( osm_madw_get_mad_ptr - ( context.result. - p_result_madw ) ) ); - } - goto Exit; + "IS EXPECTED ERROR ^^^^\n"); + status = IB_SUCCESS; } else { - num_recs = context.result.result_cnt; + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_get_service_by_id_and_name: ERR 4A08: " + "Query failed:%s (%s)\n", + ib_get_err_str(status), + mad_stat_err ); + goto Exit; + } } if ( num_recs != rec_num ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_get_service_by_id_and_name: " - "Unmatched record number, Expeceted : %d, Got : %d.\n", + "Unmatched number of records: expeceted:%d, received:%d\n", rec_num, num_recs); status = IB_REMOTE_ERROR; goto Exit; } p_rec = osmv_get_query_svc_rec( context.result.p_result_madw, 0 ); + *p_out_rec = *p_rec; + + if (num_recs) + { osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_id_and_name: " - "Found Service Record by Name:%s ID:0x%016" PRIx64 ".\n", + "Found service record: name:%s id:0x%016" PRIx64 "\n", p_rec->service_name, cl_ntoh64(p_rec->service_id)); osm_dump_service_record(&p_osmt->log, p_rec, OSM_LOG_DEBUG); - *p_out_rec = *p_rec; + } Exit: osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_id_and_name: " - "Expected num of records is : %d, Found number of records : %d\n", - rec_num,num_recs); + "Expected and found $d records\n", + rec_num ); + if( context.result.p_result_madw != NULL ) { osm_mad_pool_put( &p_osmt->mad_pool, context.result.p_result_madw ); @@ -623,7 +637,7 @@ osmt_get_service_by_id ( IN osmtest_t * if( osm_log_is_active( &p_osmt->log, OSM_LOG_VERBOSE ) ) osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_id: " - "Getting Service Record by id 0x%016" PRIx64 "\n", + "Getting service record: id:0x%016" PRIx64 "\n", cl_ntoh64(sid)); /* @@ -633,23 +647,26 @@ osmt_get_service_by_id ( IN osmtest_t * * * The query structures are locals. */ - cl_memclr( &svc_rec, sizeof( svc_rec ) ); cl_memclr( &req, sizeof( req ) ); cl_memclr( &context, sizeof( context ) ); - cl_memclr( &user, sizeof( user ) ); - /* set the new service record fields */ - svc_rec.service_id = sid; - /* prepare the data used for this query */ + context.p_osmt = p_osmt; + + /* prepare the data used for this query */ req.query_type = OSMV_QUERY_USER_DEFINED; req.timeout_ms = p_osmt->opt.transaction_timeout; req.retry_cnt = p_osmt->opt.retry_count; req.flags = OSM_SA_FLAGS_SYNC; req.query_context = &context; req.pfn_query_cb = osmtest_query_res_cb; - req.p_query_input = &user; req.sm_key = 0; + cl_memclr( &svc_rec, sizeof( svc_rec ) ); + cl_memclr( &user, sizeof( user ) ); + /* set the new service record fields */ + svc_rec.service_id = sid; + req.p_query_input = &user; + user.method = IB_MAD_METHOD_GET; user.attr_id = IB_MAD_ATTR_SERVICE_RECORD; user.comp_mask = IB_SR_COMPMASK_SID; @@ -660,62 +677,74 @@ osmt_get_service_by_id ( IN osmtest_t * if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_service_by_id: ERR 0365: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_get_service_by_id: ERR 4A09: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); goto Exit; } status = context.result.status; + num_recs = context.result.result_cnt; if( status != IB_SUCCESS ) { - num_recs = 0; - if (status != IB_INVALID_PARAMETER) - { - osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_service_by_id: ERR 0370: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); - } + char mad_stat_err[256]; + /* If the failure is due to IB_SA_MAD_STATUS_NO_RECORDS and rec_num is 0, + then this is fine. */ if( status == IB_REMOTE_ERROR ) + strcpy(mad_stat_err, ib_get_mad_status_str( + osm_madw_get_mad_ptr(context.result.p_result_madw) ) ); + else + strcpy(mad_stat_err, ib_get_err_str(status) ); + + if( status == IB_REMOTE_ERROR && + !strcmp(mad_stat_err, "IB_SA_MAD_STATUS_NO_RECORDS") && + rec_num == 0 ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_get_service_by_id: " - "Remote error = %s.\n", - ib_get_mad_status_str( osm_madw_get_mad_ptr - ( context.result. - p_result_madw ) ) ); - } - goto Exit; + "IS EXPECTED ERROR ^^^^\n"); + status = IB_SUCCESS; } else { - num_recs = context.result.result_cnt; + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_get_service_by_id: ERR 4A0A: " + "Query failed:%s (%s)\n", + ib_get_err_str(status), + mad_stat_err ); + goto Exit; + } } if ( num_recs != rec_num ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_service_by_id: " - "Unmatched record number; Expected : %d, Got : %d.\n", + "osmt_get_service_by_id: ERR 4A0B: " + "Unmatched number of records: expected:%d received:%d\n", rec_num, num_recs); status = IB_REMOTE_ERROR; goto Exit; } p_rec = osmv_get_query_svc_rec( context.result.p_result_madw, 0 ); + *p_out_rec = *p_rec; + + if (num_recs) + { osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_id: " - "Found Service Record by Name:%s ID:0x%016" PRIx64 ".\n", + "Found service record: name:%s id:0x%016" PRIx64 "\n", p_rec->service_name, cl_ntoh64(p_rec->service_id)); osm_dump_service_record(&p_osmt->log, p_rec, OSM_LOG_DEBUG); - *p_out_rec = *p_rec; + } Exit: osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_id: " - "Expected num of records is : %d, Found number of records : %d\n", - rec_num,num_recs); + "Expected and found %d records\n", + rec_num ); + if( context.result.p_result_madw != NULL ) { osm_mad_pool_put( &p_osmt->mad_pool, context.result.p_result_madw ); @@ -737,22 +766,25 @@ osmt_get_service_by_name_and_key ( IN os osmtest_req_context_t context; osmv_query_req_t req; ib_service_record_t svc_rec,*p_rec; - uint32_t num_recs = 0; + uint32_t num_recs = 0, i; osmv_user_query_t user; OSM_LOG_ENTER( &p_osmt->log, osmt_get_service_by_name_and_key ); if( osm_log_is_active( &p_osmt->log, OSM_LOG_VERBOSE ) ) { - uint8_t i; + char buf_service_key[33]; + + sprintf(buf_service_key, + "0x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x", + skey[0], skey[1], skey[2], skey[3], skey[4], skey[5], skey[6], skey[7], + skey[8], skey[9], skey[10], skey[11], skey[12], skey[13], skey[14], + skey[15]); + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_name_and_key: " - "Getting Service Record by Name and Key :%s.\n", - sr_name); - for (i=0 ; i<=15 ; i++) - osm_log( &p_osmt->log, OSM_LOG_VERBOSE, - "Service Key[%u] = %u\n", - i,skey[i]); + "Getting service record: name:%s and key:%s\n", + sr_name, buf_service_key ); } /* @@ -762,92 +794,108 @@ osmt_get_service_by_name_and_key ( IN os * * The query structures are locals. */ - cl_memclr( &svc_rec, sizeof( svc_rec ) ); cl_memclr( &req, sizeof( req ) ); cl_memclr( &context, sizeof( context ) ); - cl_memclr( &user, sizeof( user ) ); - /* set the new service record fields */ - cl_memclr(svc_rec.service_name, sizeof(svc_rec.service_name)); - cl_memcpy(svc_rec.service_name, sr_name, - (strlen(sr_name)+1)*sizeof(char)); - /* prepare the data used for this query */ context.p_osmt = p_osmt; + + /* prepare the data used for this query */ req.query_type = OSMV_QUERY_USER_DEFINED; req.timeout_ms = p_osmt->opt.transaction_timeout; req.retry_cnt = p_osmt->opt.retry_count; req.flags = OSM_SA_FLAGS_SYNC; req.query_context = &context; req.pfn_query_cb = osmtest_query_res_cb; - req.p_query_input = &user; req.sm_key = 0; + cl_memclr( &svc_rec, sizeof( svc_rec ) ); + cl_memclr( &user, sizeof( user ) ); + /* set the new service record fields */ + cl_memclr(svc_rec.service_name, sizeof(svc_rec.service_name)); + cl_memcpy(svc_rec.service_name, sr_name, + (strlen(sr_name)+1)*sizeof(char)); + for (i=0 ; i<=15 ; i++) + svc_rec.service_key[i] = skey[i]; + + req.p_query_input = &user; + user.method = IB_MAD_METHOD_GET; user.attr_id = IB_MAD_ATTR_SERVICE_RECORD; user.comp_mask = IB_SR_COMPMASK_SNAME | IB_SR_COMPMASK_SKEY; user.attr_offset = ib_get_attr_offset( sizeof( ib_service_record_t ) ); user.p_attr = &svc_rec; - status = osmv_query_sa( p_osmt->h_bind, &req ); if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_service_by_name_and_key: ERR 0365: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_get_service_by_name_and_key: ERR 4A0C: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); goto Exit; } status = context.result.status; + num_recs = context.result.result_cnt; if( status != IB_SUCCESS ) { - num_recs = 0; - if (status != IB_INVALID_PARAMETER) - { - osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_service_by_name_and_key: ERR 0370: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); - } + char mad_stat_err[256]; + /* If the failure is due to IB_SA_MAD_STATUS_NO_RECORDS and rec_num is 0, + then this is fine. */ if( status == IB_REMOTE_ERROR ) + strcpy(mad_stat_err, ib_get_mad_status_str( + osm_madw_get_mad_ptr(context.result.p_result_madw) ) ); + else + strcpy(mad_stat_err, ib_get_err_str(status) ); + + if( status == IB_REMOTE_ERROR && + !strcmp(mad_stat_err, "IB_SA_MAD_STATUS_NO_RECORDS") && + rec_num == 0 ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_get_service_by_name_and_key: " - "Remote error = %s.\n", - ib_get_mad_status_str( osm_madw_get_mad_ptr - ( context.result. - p_result_madw ) ) ); - } - goto Exit; + "IS EXPECTED ERROR ^^^^\n"); + status = IB_SUCCESS; } else { - num_recs = context.result.result_cnt; + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_get_service_by_name_and_key: ERR 4A0D: " + "Query failed:%s (%s)\n", + ib_get_err_str(status), + mad_stat_err ); + goto Exit; + } } if ( num_recs != rec_num ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_get_service_by_name_and_key: " - "Unmatched record number, Expected : %d, Got : %d.\n", + "Unmatched number of records: expected:%d, received:%d\n", rec_num, num_recs); status = IB_REMOTE_ERROR; goto Exit; } p_rec = osmv_get_query_svc_rec( context.result.p_result_madw, 0 ); + *p_out_rec = *p_rec; + + if ( num_recs ) + { osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_name_and_key: " - "Found Service Record by Name:%s ID:0x%016" PRIx64 ".\n", + "Found service record: name:%s id:0x%016" PRIx64 "\n", sr_name, cl_ntoh64(p_rec->service_id)); osm_dump_service_record(&p_osmt->log, p_rec, OSM_LOG_DEBUG); - *p_out_rec = *p_rec; + } Exit: osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_name_and_key: " - "Expected num of records is : %d, Found number of records : %d\n", - rec_num,num_recs); + "Expected and found %d records\n", + rec_num ); + if( context.result.p_result_madw != NULL ) { osm_mad_pool_put( &p_osmt->mad_pool, context.result.p_result_madw ); @@ -874,12 +922,10 @@ osmt_get_service_by_name( IN osmtest_t * OSM_LOG_ENTER( &p_osmt->log, osmt_get_service_by_name ); if( osm_log_is_active( &p_osmt->log, OSM_LOG_VERBOSE ) ) - { osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_name: " - "Getting Service Record by Name:%s.\n", + "Getting service record: name:%s\n", sr_name); - } /* * Do a blocking query for this record in the subnet. @@ -893,78 +939,90 @@ osmt_get_service_by_name( IN osmtest_t * context.p_osmt = p_osmt; + /* prepare the data used for this query */ req.query_type = OSMV_QUERY_SVC_REC_BY_NAME; req.timeout_ms = p_osmt->opt.transaction_timeout; req.retry_cnt = p_osmt->opt.retry_count; req.flags = OSM_SA_FLAGS_SYNC; req.query_context = &context; req.pfn_query_cb = osmtest_query_res_cb; + req.sm_key = 0; + cl_memclr(service_name, sizeof(service_name)); cl_memcpy(service_name, sr_name, (strlen(sr_name)+1)*sizeof(char)); req.p_query_input = service_name; - req.sm_key = 0; status = osmv_query_sa( p_osmt->h_bind, &req ); if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_service_by_name: ERR 0365: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_get_service_by_name: ERR 4A0E: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); goto Exit; } status = context.result.status; + num_recs = context.result.result_cnt; if( status != IB_SUCCESS ) { - /* The context struct is not init OR result with illegal number of records */ - num_recs = 0; - if (status != IB_INVALID_PARAMETER) - { - osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_service_by_name: ERR 0370: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); - } + char mad_stat_err[256]; + /* If the failure is due to IB_SA_MAD_STATUS_NO_RECORDS and rec_num is 0, + then this is fine. */ if( status == IB_REMOTE_ERROR ) + strcpy(mad_stat_err, ib_get_mad_status_str( + osm_madw_get_mad_ptr(context.result.p_result_madw) ) ); + else + strcpy(mad_stat_err, ib_get_err_str(status) ); + + if( status == IB_REMOTE_ERROR && + !strcmp(mad_stat_err, "IB_SA_MAD_STATUS_NO_RECORDS") && + rec_num == 0 ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_get_service_by_name: " - "Remote error = %s.\n", - ib_get_mad_status_str( osm_madw_get_mad_ptr - ( context.result. - p_result_madw ) ) ); - } - goto Exit; + "IS EXPECTED ERROR ^^^^\n"); + status = IB_SUCCESS; } else { - num_recs = context.result.result_cnt; + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_get_service_by_name: ERR 4A0F: " + "Query failed:%s (%s)\n", + ib_get_err_str(status), + mad_stat_err ); + goto Exit; + } } if ( num_recs != rec_num ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_service_by_name: " - "Unmatched record number, Expeceted : %d, Got : %u.\n", + "osmt_get_service_by_name: ERR 4A10: " + "Unmatched number of records: expected:%d, received:%u\n", rec_num, num_recs); status = IB_REMOTE_ERROR; goto Exit; } p_rec = osmv_get_query_svc_rec( context.result.p_result_madw, 0 ); + *p_out_rec = *p_rec; + + if (num_recs) + { osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_name: " - "Found Service Record by Name:%s ID:0x%016" PRIx64 ".\n", + "Found service record: name:%s id:0x%016" PRIx64 "\n", sr_name, cl_ntoh64(p_rec->service_id)); osm_dump_service_record(&p_osmt->log, p_rec, OSM_LOG_DEBUG); - *p_out_rec = *p_rec; + } Exit: osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_name: " - "Expected num of records is : %d, Found number of records : %u\n", - rec_num,num_recs); + "Expected and found %d records\n", + rec_num ); if( context.result.p_result_madw != NULL ) { @@ -1002,7 +1060,7 @@ osmt_get_all_services_and_check_names( I { osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_all_services_and_check_names: " - "Getting All Service Records\n"); + "Getting all service records\n"); } /* * Do a blocking query for this record in the subnet. @@ -1028,8 +1086,8 @@ osmt_get_all_services_and_check_names( I if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_all_services_and_check_names: ERR 0371: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_get_all_services_and_check_names: ERR 4A12: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); goto Exit; } @@ -1040,14 +1098,14 @@ osmt_get_all_services_and_check_names( I if (status != IB_INVALID_PARAMETER) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_all_services_and_check_names: ERR 0372: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_get_all_services_and_check_names: ERR 4A13: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); } if( status == IB_REMOTE_ERROR ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_get_all_services_and_check_names: " - "Remote error = %s.\n", + "Remote error = %s\n", ib_get_mad_status_str( osm_madw_get_mad_ptr ( context.result. p_result_madw ) ) ); @@ -1058,14 +1116,14 @@ osmt_get_all_services_and_check_names( I num_recs = context.result.result_cnt; osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_all_services_and_check_names: " - "Received %u records.\n", num_recs ); + "Received %u records\n", num_recs ); for( i = 0; i < num_recs; i++ ) { p_rec = osmv_get_query_svc_rec( context.result.p_result_madw, i ); osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_all_services_and_check_names: " - "Found Service Record by Name:%s ID:0x%016" PRIx64 ".\n", + "Found service record: name:%s id:0x%016" PRIx64 "\n", p_rec->service_name, cl_ntoh64(p_rec->service_id)); osm_dump_service_record(&p_osmt->log, p_rec, OSM_LOG_VERBOSE); for ( j = 0; j < num_of_valid_names; j++) @@ -1091,8 +1149,8 @@ osmt_get_all_services_and_check_names( I if (p_checked_names[j] == 0) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_all_services_and_check_names: ERR 0377: " - "Missing Valid Service Name:%s\n",p_valid_service_names_arr[j]); + "osmt_get_all_services_and_check_names: ERR 4A14: " + "Missing valid service: name:%s\n",p_valid_service_names_arr[j]); status = IB_ERROR; goto Exit; } @@ -1124,7 +1182,7 @@ osmt_delete_service_by_name(IN osmtest_t osm_log( &p_osmt->log, OSM_LOG_INFO, "osmt_delete_service_by_name: " - "Trying to Delete Service: Name:%s.\n", + "Trying to Delete service name:%s\n", sr_name); cl_memclr( &svc_rec, sizeof( svc_rec ) ); @@ -1132,17 +1190,10 @@ osmt_delete_service_by_name(IN osmtest_t status = osmt_get_service_by_name(p_osmt, sr_name,rec_num, &svc_rec); if (status != IB_SUCCESS) { - if (IsServiceExist) osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_delete_service_by_name: ERR 001 " - "Nothing to delete - failed to find service by name: %s \n", sr_name); - else - { - osm_log( &p_osmt->log, OSM_LOG_INFO, - "osmt_delete_service_by_name: " - "Record should not exist, i.e. BAD flow\n"); - status = IB_SUCCESS; - } + "osmt_delete_service_by_name: ERR 4A15: " + "Failed to get service: name:%s\n", + sr_name ); goto ExitNoDel; } @@ -1175,29 +1226,57 @@ osmt_delete_service_by_name(IN osmtest_t if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_delete_service_by_name: ERR 0373: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_delete_service_by_name: ERR 4A16: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); goto Exit; } status = context.result.status; - + if ( IsServiceExist ) + { + /* If IsServiceExist = 1 then we should succeed here */ if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_delete_service_by_name: ERR 0374: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_delete_service_by_name: ERR 4A17: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); if( status == IB_REMOTE_ERROR ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_delete_service_by_name: " - "Remote error = %s.\n", + "osmt_delete_service_by_name: ERR 4A18: " + "Remote error = %s\n", ib_get_mad_status_str( osm_madw_get_mad_ptr ( context.result. p_result_madw ) ) ); } - goto Exit; + } + } + else + { + /* If IsServiceExist = 0 then we should fail here */ + if ( status == IB_SUCCESS ) + { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_delete_service_by_name: ERR 4A19: " + "Succeeded to delete service:%s which " + "shouldn't exist", + sr_name ); + status = IB_ERROR; + } + else + { + /* The deletion should have failed, since the service_name + shouldn't exist. */ + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: " + "IS EXPECTED ERROR ^^^^\n"); + osm_log( &p_osmt->log, OSM_LOG_INFO, + "osmt_delete_service_by_name: " + "Failed to delete service_name:%s\n", + sr_name ); + status = IB_SUCCESS; + } } Exit: @@ -1362,144 +1441,256 @@ osmt_run_service_records_flow( IN osmtes /* Let OpenSM handle it */ usleep(100); + /* Make sure service_name[0] exists */ status = osmt_get_service_by_name(p_osmt, (char*)service_name[0],1, &srv_rec); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A1A: " + "Fail to find service: name:%s\n", + (char*)service_name[0] ); + status = IB_ERROR; goto Exit; } + /* Make sure service_name[1] exists */ status = osmt_get_service_by_name(p_osmt, (char*)service_name[1],1, &srv_rec); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A1B: " + "Fail to find service: name:%s\n", + (char*)service_name[1] ); + status = IB_ERROR; goto Exit; } + /* Make sure service_name[2] exists */ status = osmt_get_service_by_name(p_osmt, (char*)service_name[2],1, &srv_rec); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A1C: " + "Fail to find service: name:%s\n", + (char*)service_name[2] ); + status = IB_ERROR; goto Exit; } - /* Try to get osmt.srvc.4 b4 (there should be 1 record) and after 10 sec - It should be deleted */ + /* Make sure service_name[3] exists. */ + /* After 10 seconds the service should not exist: service_lease = 10 */ status = osmt_get_service_by_name(p_osmt, (char*)service_name[3],1, &srv_rec); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A1D: " + "Fail to find service: name:%s\n", + (char*)service_name[3] ); + status = IB_ERROR; goto Exit; } + sleep(10); + status = osmt_get_service_by_name(p_osmt, (char*)service_name[3],0, &srv_rec); - if (status == IB_SUCCESS) + if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A1E: " + "Found service: name:%s that should have been " + "deleted due to service lease expiring\n", + (char*)service_name[3] ); + status = IB_ERROR; goto Exit; } - /* Check that for the current Service ID only one record exists */ + + /* Check that for service: id[5] only one record exists */ status = osmt_get_service_by_id(p_osmt, 1, cl_ntoh64(id[5]),&srv_rec); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A1F: " + "Found number of records!=1 for " + "service: id:0x%016" PRIx64 "\n", + id[5] ); + status = IB_ERROR; goto Exit; } - /* Bad Flow of Get with invalid Service ID */ + /* Bad Flow of Get with invalid Service ID: id[7] */ status = osmt_get_service_by_id(p_osmt, 0,cl_ntoh64(id[7]),&srv_rec); - if (status == IB_SUCCESS) + if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A20: " + "Found service: id:0x%016 " PRIx64 + "that is invalid\n", + id[7] ); + status = IB_ERROR; goto Exit; } - /* Check that for correct name and ID we get record set b4 */ + + /* Check by both id and service name: id[0], service_name[0] */ status = osmt_get_service_by_id_and_name(p_osmt, 1, cl_ntoh64(id[0]), (char*)service_name[0], &srv_rec); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A21: " + "Fail to find service: id:0x%016 " PRIx64 + "name:%s\n", + id[0], + (char*)service_name[0] ); + status = IB_ERROR; goto Exit; } + + /* Check by both id and service name: id[5], service_name[6] */ status = osmt_get_service_by_id_and_name(p_osmt, 1, cl_ntoh64(id[5]), (char*)service_name[6], &srv_rec); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A22: " + "Fail to find service: id:0x%016 " PRIx64 + "name:%s\n", + id[5], + (char*)service_name[6] ); + status = IB_ERROR; goto Exit; } - /* Bad Flow of Get with valid name and invalid ID */ + /* Bad Flow of Get with invalid name(service_name[3]) and valid ID(id[0]) */ status = osmt_get_service_by_id_and_name(p_osmt, 0, cl_ntoh64(id[0]), (char*)service_name[3], &srv_rec); - if (status == IB_SUCCESS) + if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A23: " + "Found service: id:0x%016" PRIx64 + "name:%s which is an invalid service\n", + id[0], + (char*)service_name[3] ); + status = IB_ERROR; goto Exit; } - /* Bad Flow of Get with unmatched name (exists but not with the following ID) and valid ID */ + + /* Bad Flow of Get with unmatched name(service_name[5]) and id(id[3]) (both valid) */ status = osmt_get_service_by_id_and_name(p_osmt, 0, cl_ntoh64(id[3]), (char*)service_name[5], &srv_rec); - if (status == IB_SUCCESS) + if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A24: " + "Found service: id:0x%016" PRIx64 + "name:%s which is an invalid service\n", + id[3], + (char*)service_name[5] ); + status = IB_ERROR; goto Exit; } + /* Bad Flow of Get with service name that doesn't exist (service_name[4]) */ status = osmt_get_service_by_name(p_osmt, (char*)service_name[4],0, &srv_rec); - if (status == IB_SUCCESS) + if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A25: " + "Found service: name:%s that shouldn't exist\n", + (char*)service_name[4] ); + status = IB_ERROR; goto Exit; } - /* Bad Flow : Check that getting osmt.srvc.6 brings no records since another service has been updated with the same ID - osmt.srvc.7 */ + + /* Bad Flow : Check that getting service_name[5] brings no records since another service + has been updated with the same ID (service_name[6] */ status = osmt_get_service_by_name(p_osmt, (char*)service_name[5],0, &srv_rec); - if (status == IB_SUCCESS) + if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A26: " + "Found service: name:%s which is an " + "invalid service\n", + (char*)service_name[5] ); + status = IB_ERROR; goto Exit; } - /* Check that getting osmt.srvc.7 by name ONLY is valid since we do not support key&name association, also trusted queries */ + + /* Check that getting service_name[6] by name ONLY is valid, + since we do not support key&name association, also trusted queries */ status = osmt_get_service_by_name(p_osmt, (char*)service_name[6],1, &srv_rec); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A27: " + "Fail to find service: name:%s\n", + (char*)service_name[6] ); + status = IB_ERROR; goto Exit; } /* Test Service Key */ cl_memclr(service_key,16*sizeof(uint8_t)); + + /* Check for service_name[5] with service_key=0 - the service shouldn't + exist with this name. */ status = osmt_get_service_by_name_and_key (p_osmt, (char*)service_name[5], 0, service_key,&srv_rec); - if (status == IB_SUCCESS) + if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A28: " + "Found service: name:%s key:0 which is an " + "invalid service (wrong name)\n", + (char*)service_name[5] ); + status = IB_ERROR; goto Exit; } - else - { - status = IB_SUCCESS; - } + + /* Check for service_name[6] with service_key=0 - the service should + exist with different key. */ status = osmt_get_service_by_name_and_key (p_osmt, (char*)service_name[6], 0, service_key,&srv_rec); - if (status == IB_SUCCESS) + if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A29: " + "Found service: name:%s key:0 which is an " + "invalid service (wrong service_key)\n", + (char*)service_name[6] ); + status = IB_ERROR; goto Exit; } - else - { - status = IB_SUCCESS; - } + /* check for service_name[6] with the correct service_key */ for (i=0;i <= 15;i++) service_key[i]=i + 1; status = osmt_get_service_by_name_and_key (p_osmt, (char*)service_name[6], - 0, service_key,&srv_rec); - if (status == IB_SUCCESS) + 1, service_key, &srv_rec); + if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A2A: " + "Fail to find service: name:%s with " + "correct service key\n", + (char*)service_name[6] ); + status = IB_ERROR; goto Exit; } - else - { - status = IB_SUCCESS; - } #ifdef VENDOR_RMPP_SUPPORT + /* These ar the only service_names which are valid */ cl_memcpy(&service_valid_names[0],&service_name[0],sizeof(uint8_t)*64); cl_memcpy(&service_valid_names[1],&service_name[2],sizeof(uint8_t)*64); cl_memcpy(&service_valid_names[2],&service_name[6],sizeof(uint8_t)*64); @@ -1507,79 +1698,101 @@ osmt_run_service_records_flow( IN osmtes status = osmt_get_all_services_and_check_names(p_osmt,service_valid_names,3, &num_recs); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A2B: " + "Fail to find all services that should exist\n" ); + status = IB_ERROR; goto Exit; } #endif + /* Delete service_name[0] */ status = osmt_delete_service_by_name(p_osmt,1, (char*)service_name[0],1); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A2C: " + "Fail to delete service: name:%s\n", + (char*)service_name[0] ); + status = IB_ERROR; goto Exit; } + /* Make sure deletion of service_name[0] succeeded */ status = osmt_get_service_by_name(p_osmt, - (char*)service_name[0],1, &srv_rec); - if (status == IB_SUCCESS) + (char*)service_name[0],0, &srv_rec); + if (status != IB_SUCCESS) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_run_service_records_flow: ERR 001 " - "Expected not to find osmt.srvc.1 \n"); + "osmt_run_service_records_flow: ERR 4A2D: " + "Found service: name:%s that was deleted\n", + (char*)service_name[0] ); status = IB_ERROR; goto Exit; } - osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_run_service_records_flow: " - "IS EXPECTED ERR ^^^^\n"); - - sleep(3); + /* Make sure service_name[1] doesn't exist (expired service lease) */ status = osmt_get_service_by_name(p_osmt, - (char*)service_name[1],1, &srv_rec); - if (status == IB_SUCCESS) + (char*)service_name[1],0, &srv_rec); + if (status != IB_SUCCESS) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_run_service_records_flow: ERR 001 " - "Expected not to find osmt.srvc.2 \n"); + "osmt_run_service_records_flow: ERR 4A2E: " + "Found service: name:%s that should have expired\n", + (char*)service_name[1] ); status = IB_ERROR; goto Exit; } - else - { - osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_run_service_records_flow: " - "IS EXPECTED ERR ^^^^\n"); - status = IB_SUCCESS; - } + /* Make sure service_name[2] exists */ status = osmt_get_service_by_name(p_osmt, (char*)service_name[2],1, &srv_rec); if (status != IB_SUCCESS) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_run_service_records_flow: ERR 001 " - "Expected to find service osmt.srvc.3\n" ); + "osmt_run_service_records_flow: ERR 4A2F: " + "Fail to find service: name:%s\n", + (char*)service_name[2] ); + status = IB_ERROR; goto Exit; } - status = osmt_delete_service_by_name(p_osmt,1, - (char*)service_name[6],1); + + /* Bad Flow - try to delete non-existent service_name[5] */ + status = osmt_delete_service_by_name(p_osmt,0, + (char*)service_name[5],0); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A30: " + "Succeed to delete non-existent service: name:%s\n", + (char*)service_name[5] ); + status = IB_ERROR; goto Exit; } - /* Bad Flow */ - status = osmt_delete_service_by_name(p_osmt,0, - (char*)service_name[5],1); - if (status == IB_SUCCESS) + /* Delete service_name[2] */ + status = osmt_delete_service_by_name(p_osmt,1, + (char*)service_name[2],1); + if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A31: " + "Fail to delete service: name:%s\n", + (char*)service_name[2] ); + status = IB_ERROR; goto Exit; } + /* Delete service_name[6] */ status = osmt_delete_service_by_name(p_osmt,1, - (char*)service_name[2],1); + (char*)service_name[6],1); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A32: " + "Failed to delete service name:%s\n", + (char*)service_name[6] ); goto Exit; } Index: opensm/osm_sa_service_record.c =================================================================== --- opensm/osm_sa_service_record.c (revision 5403) +++ opensm/osm_sa_service_record.c (working copy) @@ -798,6 +798,13 @@ osm_sr_rcv_process_get_method( p_recvd_service_rec = (ib_service_record_t*)ib_sa_mad_get_payload_ptr( p_sa_mad ); + if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) + { + osm_dump_service_record( p_rcv->p_log, + p_recvd_service_rec, + OSM_LOG_DEBUG ); + } + cl_qlist_init(&sr_match_item.sr_list); sr_match_item.p_service_rec = p_recvd_service_rec; sr_match_item.comp_mask = p_sa_mad->comp_mask; From mst at mellanox.co.il Wed Feb 15 02:14:48 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Feb 2006 12:14:48 +0200 Subject: [openib-general] Re: Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: <20060215093007.GK24524@minantech.com> References: <20060213154114.GO32041@mellanox.co.il> <20060213190206.GC12458@mellanox.co.il> <20060214065145.GE24524@minantech.com> <20060214164236.GB12974@mellanox.co.il> <20060215073452.GI24524@minantech.com> <20060215082331.GE10026@mellanox.co.il> <20060215083115.GJ24524@minantech.com> <20060215090250.GF12974@mellanox.co.il> <20060215093007.GK24524@minantech.com> Message-ID: <20060215101448.GJ12974@mellanox.co.il> Quoting r. Gleb Natapov : > > Clarification: as I see it, longer term we want to add a flag to make > > get_user_pages trigger an immediate page copy on fork (rather than > > copy_ptes). > > Can you elaborate? Do you mean one more VMA flag (VM_COPYONFORK)? This should hopefully solve more than just the reg_mr issue, and not specific to infiniband. See e.g. here: http://lkml.org/lkml/2005/12/12/30 So no, this will have to be a per-page flag: set by get_user_pages when passed some new option, and cleared by put_page when the page ref count drops to page map count. BTW, I dont know when I will get around to working on it, so any help would be appreciated. > > In this setup, MADV_DONTFORK will be used to speed up fork for an > > application that has locked a big portion of its address space. With this in > > mind: > > > > Quoting r. Gleb Natapov : > > > > > Should call to madvise be the part of reg_mr call? > > > > > > > > Probably no - MPI should have to do it. > > > > uDAPL as well, I guess. > > > > > Then each userspace app will have to reinvent the wheel. > > > > I thought applications used MPI? > > I hope you don't think that infiniband is good only for HPC :) More and more > organisation want to develop applications directly for infiniband without > middle layer. Not all of them want to understand deep VM magic to do so. See my comment above. Once pages locked by get_user_pages are copied on fork, madvise becomes an optimization to speed up fork. So life as usual: you need to get linux-specific to get some speedup. > > > Remember that we should gracefully handle overlapping registrations. > > > Right, and madvise doesnt do any refcouting. That's one reason not to > > include it in reg_mr. > > I beg to differ. I think this is exactly the reason to include it in > reg_mr. Otherwise each application should reinvent refcounting logic. It > is much better to do it right once instead of doing it wrong many times. Talking about applications developed directly for infiniband again? But why do you think they always use overlapping regions? > > Another is that madvise only works for full pages. > > Everything in VM works only for full pages. Unix don't try to hide this > from user. ibv_reg_mr works fine for sub-page regions. Doesnt it? > > Applications should be aware of these limitations, and I think the easiest > > way to achieve this is by asking them to use madvise directly. > > The problem not in madvice but in refcounting that each application must > maintain. I dont really see a good way around this. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From tziporet at mellanox.co.il Wed Feb 15 02:15:40 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Wed, 15 Feb 2006 12:15:40 +0200 Subject: [openib-general] Which driver to choose, OpenIB or IBGD? In-Reply-To: <7b2fa1820602140654h1eeee2bcod1e5f8c15f468443@mail.gmail.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3015CDA95@mtlexch01.mtl.com> <7b2fa1820602140654h1eeee2bcod1e5f8c15f468443@mail.gmail.com> Message-ID: <43F2FF4C.7000009@mellanox.co.il> Ian Jiang wrote: > > Did anybody get an experiment result of comparison? user level performance test show that gen2 latency is ~0.3 usec (depend on CPU mhz you have) better then gen1. Bandwidth is the same. > Do you know which version of OpenIB driver does the IBG2 base on? svn rev: 5166 > Another question: Is it matters which driver is used for the > communication between IB cards or swithes? Or could a IB card drived > by IBGD/IBG2 communicate correctly with one drived by OpenIB? Between HCA and switch is IB wire protocol, so SW should not influence on this. In general I would recommend developing new projects on gen2. Tziporet From yael at mellanox.co.il Wed Feb 15 02:32:28 2006 From: yael at mellanox.co.il (Yael Kalka) Date: Wed, 15 Feb 2006 12:32:28 +0200 Subject: [openib-general] RE: [PATCH] Opensm - osmt_service.c fixes Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FDAB@mtlexch01.mtl.com> Hi Hal, I am re-sending the patch with another fix. Please ignore this patch. Thanks, Yael -----Original Message----- From: Yael Kalka Sent: Wednesday, February 15, 2006 12:07 PM To: halr at voltaire.com Cc: openib-general at openib.org; Eitan Zahavi; Yael Kalka Subject: [PATCH] Opensm - osmt_service.c fixes Hi Hal, The following patch fixes some problems in osmt_service.c flow, along with cosmetic cleanups of the code. Thanks, Yael Signed-off-by: Yael Kalka Index: osmtest/osmt_service.c =================================================================== --- osmtest/osmt_service.c (revision 5403) +++ osmtest/osmt_service.c (working copy) @@ -80,7 +80,7 @@ osmt_register_service( IN osmtest_t * co osm_log( &p_osmt->log, OSM_LOG_INFO, "osmt_register_service: " - "Registering Service: name:%s id:0x%" PRIx64 ".\n", + "Registering service: name:%s id:0x%" PRIx64 "\n", service_name, cl_ntoh64(service_id)); cl_memclr( &req, sizeof( req ) ); @@ -140,8 +140,8 @@ osmt_register_service( IN osmtest_t * co if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_register_service: ERR 0303: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_register_service: ERR 4A01: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); goto Exit; } @@ -150,14 +150,14 @@ osmt_register_service( IN osmtest_t * co if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_register_service: ERR 0364: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_register_service: ERR 4A02: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); if( status == IB_REMOTE_ERROR ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_register_service: " - "Remote error = %s.\n", + "Remote error = %s\n", ib_get_mad_status_str( osm_madw_get_mad_ptr ( context.result. p_result_madw ) ) ); @@ -195,7 +195,7 @@ osmt_register_service_with_full_key ( IN osm_log( &p_osmt->log, OSM_LOG_INFO, "osmt_register_service_with_full_key: " - "Registering Service: name:%s id:0x%" PRIx64 ".\n", + "Registering service: name:%s id:0x%" PRIx64 "\n", service_name, cl_ntoh64(service_id)); cl_memclr( &req, sizeof( req ) ); @@ -256,8 +256,8 @@ osmt_register_service_with_full_key ( IN if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_register_service_with_full_key: ERR 0303: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_register_service_with_full_key: ERR 4A03: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); goto Exit; } @@ -280,7 +280,7 @@ osmt_register_service_with_full_key ( IN status = IB_REMOTE_ERROR; osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_register_service_with_full_key:" - "Data mismatch in service_key.\n" + "Data mismatch in service_key\n" ); goto Exit; } @@ -288,14 +288,14 @@ osmt_register_service_with_full_key ( IN if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_register_service_with_full_key: ERR 0364: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_register_service_with_full_key: ERR 4A04: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); if( status == IB_REMOTE_ERROR ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_register_service_with_full_key: " - "Remote error = %s.\n", + "Remote error = %s\n", ib_get_mad_status_str( osm_madw_get_mad_ptr ( context.result. p_result_madw ) ) ); @@ -337,7 +337,7 @@ osmt_register_service_with_data( IN osmt osm_log( &p_osmt->log, OSM_LOG_INFO, "osmt_register_service_with_data: " - "Registering Service: name:%s id:0x%" PRIx64 ".\n", + "Registering service: name:%s id:0x%" PRIx64 "\n", service_name, cl_ntoh64(service_id)); cl_memclr( &req, sizeof( req ) ); @@ -426,8 +426,8 @@ osmt_register_service_with_data( IN osmt if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_register_service_with_data: ERR 0303: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_register_service_with_data: ERR 4A05: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); goto Exit; } @@ -436,14 +436,14 @@ osmt_register_service_with_data( IN osmt if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_register_service_with_data: ERR 0364: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_register_service_with_data: ERR 4A06: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); if( status == IB_REMOTE_ERROR ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_register_service_with_data: " - "Remote error = %s.\n", + "Remote error = %s\n", ib_get_mad_status_str( osm_madw_get_mad_ptr ( context.result. p_result_madw ) ) ); @@ -464,7 +464,7 @@ osmt_register_service_with_data( IN osmt status = IB_REMOTE_ERROR; osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_register_service_with_data: " - "Data mismatch in service_data8.\n" + "Data mismatch in service_data8\n" ); goto Exit; } @@ -499,7 +499,7 @@ osmt_get_service_by_id_and_name ( IN osm if( osm_log_is_active( &p_osmt->log, OSM_LOG_VERBOSE ) ) osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_id_and_name: " - "Getting Service Record by id 0x%016" PRIx64 " and name : %s\n", + "Getting service record: id:0x%016" PRIx64 " and name:%s\n", cl_ntoh64(sid),sr_name); /* @@ -509,26 +509,29 @@ osmt_get_service_by_id_and_name ( IN osm * * The query structures are locals. */ - cl_memclr( &svc_rec, sizeof( svc_rec ) ); cl_memclr( &req, sizeof( req ) ); cl_memclr( &context, sizeof( context ) ); - cl_memclr( &user, sizeof( user ) ); - /* set the new service record fields */ - cl_memclr(svc_rec.service_name, sizeof(svc_rec.service_name)); - cl_memcpy(svc_rec.service_name, sr_name, - (strlen(sr_name)+1)*sizeof(char)); - svc_rec.service_id = sid; - /* prepare the data used for this query */ + context.p_osmt = p_osmt; + + /* prepare the data used for this query */ req.query_type = OSMV_QUERY_USER_DEFINED; req.timeout_ms = p_osmt->opt.transaction_timeout; req.retry_cnt = p_osmt->opt.retry_count; req.flags = OSM_SA_FLAGS_SYNC; req.query_context = &context; req.pfn_query_cb = osmtest_query_res_cb; - req.p_query_input = &user; req.sm_key = 0; + cl_memclr( &svc_rec, sizeof( svc_rec ) ); + cl_memclr( &user, sizeof( user ) ); + /* set the new service record fields */ + cl_memclr(svc_rec.service_name, sizeof(svc_rec.service_name)); + cl_memcpy(svc_rec.service_name, sr_name, + (strlen(sr_name)+1)*sizeof(char)); + svc_rec.service_id = sid; + req.p_query_input = &user; + user.method = IB_MAD_METHOD_GET; user.attr_id = IB_MAD_ATTR_SERVICE_RECORD; user.comp_mask = IB_SR_COMPMASK_SID | IB_SR_COMPMASK_SNAME; @@ -539,62 +542,73 @@ osmt_get_service_by_id_and_name ( IN osm if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_service_by_id_and_name: ERR 0365: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_get_service_by_id_and_name: ERR 4A07: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); goto Exit; } status = context.result.status; + num_recs = context.result.result_cnt; if( status != IB_SUCCESS ) { - num_recs = 0; - if (status != IB_INVALID_PARAMETER) - { - osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_service_by_id_and_name: ERR 0370: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); - } + char mad_stat_err[256]; + /* If the failure is due to IB_SA_MAD_STATUS_NO_RECORDS and rec_num is 0, + then this is fine. */ if( status == IB_REMOTE_ERROR ) + strcpy(mad_stat_err, ib_get_mad_status_str( + osm_madw_get_mad_ptr(context.result.p_result_madw) ) ); + else + strcpy(mad_stat_err, ib_get_err_str(status) ); + if( status == IB_REMOTE_ERROR && + !strcmp(mad_stat_err, "IB_SA_MAD_STATUS_NO_RECORDS") && + rec_num == 0 ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_get_service_by_id_and_name: " - "Remote error = %s.\n", - ib_get_mad_status_str( osm_madw_get_mad_ptr - ( context.result. - p_result_madw ) ) ); - } - goto Exit; + "IS EXPECTED ERROR ^^^^\n"); + status = IB_SUCCESS; } else { - num_recs = context.result.result_cnt; + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_get_service_by_id_and_name: ERR 4A08: " + "Query failed:%s (%s)\n", + ib_get_err_str(status), + mad_stat_err ); + goto Exit; + } } if ( num_recs != rec_num ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_get_service_by_id_and_name: " - "Unmatched record number, Expeceted : %d, Got : %d.\n", + "Unmatched number of records: expeceted:%d, received:%d\n", rec_num, num_recs); status = IB_REMOTE_ERROR; goto Exit; } p_rec = osmv_get_query_svc_rec( context.result.p_result_madw, 0 ); + *p_out_rec = *p_rec; + + if (num_recs) + { osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_id_and_name: " - "Found Service Record by Name:%s ID:0x%016" PRIx64 ".\n", + "Found service record: name:%s id:0x%016" PRIx64 "\n", p_rec->service_name, cl_ntoh64(p_rec->service_id)); osm_dump_service_record(&p_osmt->log, p_rec, OSM_LOG_DEBUG); - *p_out_rec = *p_rec; + } Exit: osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_id_and_name: " - "Expected num of records is : %d, Found number of records : %d\n", - rec_num,num_recs); + "Expected and found $d records\n", + rec_num ); + if( context.result.p_result_madw != NULL ) { osm_mad_pool_put( &p_osmt->mad_pool, context.result.p_result_madw ); @@ -623,7 +637,7 @@ osmt_get_service_by_id ( IN osmtest_t * if( osm_log_is_active( &p_osmt->log, OSM_LOG_VERBOSE ) ) osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_id: " - "Getting Service Record by id 0x%016" PRIx64 "\n", + "Getting service record: id:0x%016" PRIx64 "\n", cl_ntoh64(sid)); /* @@ -633,23 +647,26 @@ osmt_get_service_by_id ( IN osmtest_t * * * The query structures are locals. */ - cl_memclr( &svc_rec, sizeof( svc_rec ) ); cl_memclr( &req, sizeof( req ) ); cl_memclr( &context, sizeof( context ) ); - cl_memclr( &user, sizeof( user ) ); - /* set the new service record fields */ - svc_rec.service_id = sid; - /* prepare the data used for this query */ + context.p_osmt = p_osmt; + + /* prepare the data used for this query */ req.query_type = OSMV_QUERY_USER_DEFINED; req.timeout_ms = p_osmt->opt.transaction_timeout; req.retry_cnt = p_osmt->opt.retry_count; req.flags = OSM_SA_FLAGS_SYNC; req.query_context = &context; req.pfn_query_cb = osmtest_query_res_cb; - req.p_query_input = &user; req.sm_key = 0; + cl_memclr( &svc_rec, sizeof( svc_rec ) ); + cl_memclr( &user, sizeof( user ) ); + /* set the new service record fields */ + svc_rec.service_id = sid; + req.p_query_input = &user; + user.method = IB_MAD_METHOD_GET; user.attr_id = IB_MAD_ATTR_SERVICE_RECORD; user.comp_mask = IB_SR_COMPMASK_SID; @@ -660,62 +677,74 @@ osmt_get_service_by_id ( IN osmtest_t * if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_service_by_id: ERR 0365: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_get_service_by_id: ERR 4A09: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); goto Exit; } status = context.result.status; + num_recs = context.result.result_cnt; if( status != IB_SUCCESS ) { - num_recs = 0; - if (status != IB_INVALID_PARAMETER) - { - osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_service_by_id: ERR 0370: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); - } + char mad_stat_err[256]; + /* If the failure is due to IB_SA_MAD_STATUS_NO_RECORDS and rec_num is 0, + then this is fine. */ if( status == IB_REMOTE_ERROR ) + strcpy(mad_stat_err, ib_get_mad_status_str( + osm_madw_get_mad_ptr(context.result.p_result_madw) ) ); + else + strcpy(mad_stat_err, ib_get_err_str(status) ); + + if( status == IB_REMOTE_ERROR && + !strcmp(mad_stat_err, "IB_SA_MAD_STATUS_NO_RECORDS") && + rec_num == 0 ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_get_service_by_id: " - "Remote error = %s.\n", - ib_get_mad_status_str( osm_madw_get_mad_ptr - ( context.result. - p_result_madw ) ) ); - } - goto Exit; + "IS EXPECTED ERROR ^^^^\n"); + status = IB_SUCCESS; } else { - num_recs = context.result.result_cnt; + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_get_service_by_id: ERR 4A0A: " + "Query failed:%s (%s)\n", + ib_get_err_str(status), + mad_stat_err ); + goto Exit; + } } if ( num_recs != rec_num ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_service_by_id: " - "Unmatched record number; Expected : %d, Got : %d.\n", + "osmt_get_service_by_id: ERR 4A0B: " + "Unmatched number of records: expected:%d received:%d\n", rec_num, num_recs); status = IB_REMOTE_ERROR; goto Exit; } p_rec = osmv_get_query_svc_rec( context.result.p_result_madw, 0 ); + *p_out_rec = *p_rec; + + if (num_recs) + { osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_id: " - "Found Service Record by Name:%s ID:0x%016" PRIx64 ".\n", + "Found service record: name:%s id:0x%016" PRIx64 "\n", p_rec->service_name, cl_ntoh64(p_rec->service_id)); osm_dump_service_record(&p_osmt->log, p_rec, OSM_LOG_DEBUG); - *p_out_rec = *p_rec; + } Exit: osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_id: " - "Expected num of records is : %d, Found number of records : %d\n", - rec_num,num_recs); + "Expected and found %d records\n", + rec_num ); + if( context.result.p_result_madw != NULL ) { osm_mad_pool_put( &p_osmt->mad_pool, context.result.p_result_madw ); @@ -737,22 +766,25 @@ osmt_get_service_by_name_and_key ( IN os osmtest_req_context_t context; osmv_query_req_t req; ib_service_record_t svc_rec,*p_rec; - uint32_t num_recs = 0; + uint32_t num_recs = 0, i; osmv_user_query_t user; OSM_LOG_ENTER( &p_osmt->log, osmt_get_service_by_name_and_key ); if( osm_log_is_active( &p_osmt->log, OSM_LOG_VERBOSE ) ) { - uint8_t i; + char buf_service_key[33]; + + sprintf(buf_service_key, + "0x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x", + skey[0], skey[1], skey[2], skey[3], skey[4], skey[5], skey[6], skey[7], + skey[8], skey[9], skey[10], skey[11], skey[12], skey[13], skey[14], + skey[15]); + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_name_and_key: " - "Getting Service Record by Name and Key :%s.\n", - sr_name); - for (i=0 ; i<=15 ; i++) - osm_log( &p_osmt->log, OSM_LOG_VERBOSE, - "Service Key[%u] = %u\n", - i,skey[i]); + "Getting service record: name:%s and key:%s\n", + sr_name, buf_service_key ); } /* @@ -762,92 +794,108 @@ osmt_get_service_by_name_and_key ( IN os * * The query structures are locals. */ - cl_memclr( &svc_rec, sizeof( svc_rec ) ); cl_memclr( &req, sizeof( req ) ); cl_memclr( &context, sizeof( context ) ); - cl_memclr( &user, sizeof( user ) ); - /* set the new service record fields */ - cl_memclr(svc_rec.service_name, sizeof(svc_rec.service_name)); - cl_memcpy(svc_rec.service_name, sr_name, - (strlen(sr_name)+1)*sizeof(char)); - /* prepare the data used for this query */ context.p_osmt = p_osmt; + + /* prepare the data used for this query */ req.query_type = OSMV_QUERY_USER_DEFINED; req.timeout_ms = p_osmt->opt.transaction_timeout; req.retry_cnt = p_osmt->opt.retry_count; req.flags = OSM_SA_FLAGS_SYNC; req.query_context = &context; req.pfn_query_cb = osmtest_query_res_cb; - req.p_query_input = &user; req.sm_key = 0; + cl_memclr( &svc_rec, sizeof( svc_rec ) ); + cl_memclr( &user, sizeof( user ) ); + /* set the new service record fields */ + cl_memclr(svc_rec.service_name, sizeof(svc_rec.service_name)); + cl_memcpy(svc_rec.service_name, sr_name, + (strlen(sr_name)+1)*sizeof(char)); + for (i=0 ; i<=15 ; i++) + svc_rec.service_key[i] = skey[i]; + + req.p_query_input = &user; + user.method = IB_MAD_METHOD_GET; user.attr_id = IB_MAD_ATTR_SERVICE_RECORD; user.comp_mask = IB_SR_COMPMASK_SNAME | IB_SR_COMPMASK_SKEY; user.attr_offset = ib_get_attr_offset( sizeof( ib_service_record_t ) ); user.p_attr = &svc_rec; - status = osmv_query_sa( p_osmt->h_bind, &req ); if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_service_by_name_and_key: ERR 0365: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_get_service_by_name_and_key: ERR 4A0C: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); goto Exit; } status = context.result.status; + num_recs = context.result.result_cnt; if( status != IB_SUCCESS ) { - num_recs = 0; - if (status != IB_INVALID_PARAMETER) - { - osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_service_by_name_and_key: ERR 0370: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); - } + char mad_stat_err[256]; + /* If the failure is due to IB_SA_MAD_STATUS_NO_RECORDS and rec_num is 0, + then this is fine. */ if( status == IB_REMOTE_ERROR ) + strcpy(mad_stat_err, ib_get_mad_status_str( + osm_madw_get_mad_ptr(context.result.p_result_madw) ) ); + else + strcpy(mad_stat_err, ib_get_err_str(status) ); + + if( status == IB_REMOTE_ERROR && + !strcmp(mad_stat_err, "IB_SA_MAD_STATUS_NO_RECORDS") && + rec_num == 0 ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_get_service_by_name_and_key: " - "Remote error = %s.\n", - ib_get_mad_status_str( osm_madw_get_mad_ptr - ( context.result. - p_result_madw ) ) ); - } - goto Exit; + "IS EXPECTED ERROR ^^^^\n"); + status = IB_SUCCESS; } else { - num_recs = context.result.result_cnt; + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_get_service_by_name_and_key: ERR 4A0D: " + "Query failed:%s (%s)\n", + ib_get_err_str(status), + mad_stat_err ); + goto Exit; + } } if ( num_recs != rec_num ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_get_service_by_name_and_key: " - "Unmatched record number, Expected : %d, Got : %d.\n", + "Unmatched number of records: expected:%d, received:%d\n", rec_num, num_recs); status = IB_REMOTE_ERROR; goto Exit; } p_rec = osmv_get_query_svc_rec( context.result.p_result_madw, 0 ); + *p_out_rec = *p_rec; + + if ( num_recs ) + { osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_name_and_key: " - "Found Service Record by Name:%s ID:0x%016" PRIx64 ".\n", + "Found service record: name:%s id:0x%016" PRIx64 "\n", sr_name, cl_ntoh64(p_rec->service_id)); osm_dump_service_record(&p_osmt->log, p_rec, OSM_LOG_DEBUG); - *p_out_rec = *p_rec; + } Exit: osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_name_and_key: " - "Expected num of records is : %d, Found number of records : %d\n", - rec_num,num_recs); + "Expected and found %d records\n", + rec_num ); + if( context.result.p_result_madw != NULL ) { osm_mad_pool_put( &p_osmt->mad_pool, context.result.p_result_madw ); @@ -874,12 +922,10 @@ osmt_get_service_by_name( IN osmtest_t * OSM_LOG_ENTER( &p_osmt->log, osmt_get_service_by_name ); if( osm_log_is_active( &p_osmt->log, OSM_LOG_VERBOSE ) ) - { osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_name: " - "Getting Service Record by Name:%s.\n", + "Getting service record: name:%s\n", sr_name); - } /* * Do a blocking query for this record in the subnet. @@ -893,78 +939,90 @@ osmt_get_service_by_name( IN osmtest_t * context.p_osmt = p_osmt; + /* prepare the data used for this query */ req.query_type = OSMV_QUERY_SVC_REC_BY_NAME; req.timeout_ms = p_osmt->opt.transaction_timeout; req.retry_cnt = p_osmt->opt.retry_count; req.flags = OSM_SA_FLAGS_SYNC; req.query_context = &context; req.pfn_query_cb = osmtest_query_res_cb; + req.sm_key = 0; + cl_memclr(service_name, sizeof(service_name)); cl_memcpy(service_name, sr_name, (strlen(sr_name)+1)*sizeof(char)); req.p_query_input = service_name; - req.sm_key = 0; status = osmv_query_sa( p_osmt->h_bind, &req ); if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_service_by_name: ERR 0365: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_get_service_by_name: ERR 4A0E: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); goto Exit; } status = context.result.status; + num_recs = context.result.result_cnt; if( status != IB_SUCCESS ) { - /* The context struct is not init OR result with illegal number of records */ - num_recs = 0; - if (status != IB_INVALID_PARAMETER) - { - osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_service_by_name: ERR 0370: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); - } + char mad_stat_err[256]; + /* If the failure is due to IB_SA_MAD_STATUS_NO_RECORDS and rec_num is 0, + then this is fine. */ if( status == IB_REMOTE_ERROR ) + strcpy(mad_stat_err, ib_get_mad_status_str( + osm_madw_get_mad_ptr(context.result.p_result_madw) ) ); + else + strcpy(mad_stat_err, ib_get_err_str(status) ); + + if( status == IB_REMOTE_ERROR && + !strcmp(mad_stat_err, "IB_SA_MAD_STATUS_NO_RECORDS") && + rec_num == 0 ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_get_service_by_name: " - "Remote error = %s.\n", - ib_get_mad_status_str( osm_madw_get_mad_ptr - ( context.result. - p_result_madw ) ) ); - } - goto Exit; + "IS EXPECTED ERROR ^^^^\n"); + status = IB_SUCCESS; } else { - num_recs = context.result.result_cnt; + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_get_service_by_name: ERR 4A0F: " + "Query failed:%s (%s)\n", + ib_get_err_str(status), + mad_stat_err ); + goto Exit; + } } if ( num_recs != rec_num ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_service_by_name: " - "Unmatched record number, Expeceted : %d, Got : %u.\n", + "osmt_get_service_by_name: ERR 4A10: " + "Unmatched number of records: expected:%d, received:%u\n", rec_num, num_recs); status = IB_REMOTE_ERROR; goto Exit; } p_rec = osmv_get_query_svc_rec( context.result.p_result_madw, 0 ); + *p_out_rec = *p_rec; + + if (num_recs) + { osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_name: " - "Found Service Record by Name:%s ID:0x%016" PRIx64 ".\n", + "Found service record: name:%s id:0x%016" PRIx64 "\n", sr_name, cl_ntoh64(p_rec->service_id)); osm_dump_service_record(&p_osmt->log, p_rec, OSM_LOG_DEBUG); - *p_out_rec = *p_rec; + } Exit: osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_name: " - "Expected num of records is : %d, Found number of records : %u\n", - rec_num,num_recs); + "Expected and found %d records\n", + rec_num ); if( context.result.p_result_madw != NULL ) { @@ -1002,7 +1060,7 @@ osmt_get_all_services_and_check_names( I { osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_all_services_and_check_names: " - "Getting All Service Records\n"); + "Getting all service records\n"); } /* * Do a blocking query for this record in the subnet. @@ -1028,8 +1086,8 @@ osmt_get_all_services_and_check_names( I if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_all_services_and_check_names: ERR 0371: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_get_all_services_and_check_names: ERR 4A12: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); goto Exit; } @@ -1040,14 +1098,14 @@ osmt_get_all_services_and_check_names( I if (status != IB_INVALID_PARAMETER) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_all_services_and_check_names: ERR 0372: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_get_all_services_and_check_names: ERR 4A13: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); } if( status == IB_REMOTE_ERROR ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_get_all_services_and_check_names: " - "Remote error = %s.\n", + "Remote error = %s\n", ib_get_mad_status_str( osm_madw_get_mad_ptr ( context.result. p_result_madw ) ) ); @@ -1058,14 +1116,14 @@ osmt_get_all_services_and_check_names( I num_recs = context.result.result_cnt; osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_all_services_and_check_names: " - "Received %u records.\n", num_recs ); + "Received %u records\n", num_recs ); for( i = 0; i < num_recs; i++ ) { p_rec = osmv_get_query_svc_rec( context.result.p_result_madw, i ); osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_all_services_and_check_names: " - "Found Service Record by Name:%s ID:0x%016" PRIx64 ".\n", + "Found service record: name:%s id:0x%016" PRIx64 "\n", p_rec->service_name, cl_ntoh64(p_rec->service_id)); osm_dump_service_record(&p_osmt->log, p_rec, OSM_LOG_VERBOSE); for ( j = 0; j < num_of_valid_names; j++) @@ -1091,8 +1149,8 @@ osmt_get_all_services_and_check_names( I if (p_checked_names[j] == 0) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_all_services_and_check_names: ERR 0377: " - "Missing Valid Service Name:%s\n",p_valid_service_names_arr[j]); + "osmt_get_all_services_and_check_names: ERR 4A14: " + "Missing valid service: name:%s\n",p_valid_service_names_arr[j]); status = IB_ERROR; goto Exit; } @@ -1124,7 +1182,7 @@ osmt_delete_service_by_name(IN osmtest_t osm_log( &p_osmt->log, OSM_LOG_INFO, "osmt_delete_service_by_name: " - "Trying to Delete Service: Name:%s.\n", + "Trying to Delete service name:%s\n", sr_name); cl_memclr( &svc_rec, sizeof( svc_rec ) ); @@ -1132,17 +1190,10 @@ osmt_delete_service_by_name(IN osmtest_t status = osmt_get_service_by_name(p_osmt, sr_name,rec_num, &svc_rec); if (status != IB_SUCCESS) { - if (IsServiceExist) osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_delete_service_by_name: ERR 001 " - "Nothing to delete - failed to find service by name: %s \n", sr_name); - else - { - osm_log( &p_osmt->log, OSM_LOG_INFO, - "osmt_delete_service_by_name: " - "Record should not exist, i.e. BAD flow\n"); - status = IB_SUCCESS; - } + "osmt_delete_service_by_name: ERR 4A15: " + "Failed to get service: name:%s\n", + sr_name ); goto ExitNoDel; } @@ -1175,29 +1226,57 @@ osmt_delete_service_by_name(IN osmtest_t if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_delete_service_by_name: ERR 0373: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_delete_service_by_name: ERR 4A16: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); goto Exit; } status = context.result.status; - + if ( IsServiceExist ) + { + /* If IsServiceExist = 1 then we should succeed here */ if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_delete_service_by_name: ERR 0374: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_delete_service_by_name: ERR 4A17: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); if( status == IB_REMOTE_ERROR ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_delete_service_by_name: " - "Remote error = %s.\n", + "osmt_delete_service_by_name: ERR 4A18: " + "Remote error = %s\n", ib_get_mad_status_str( osm_madw_get_mad_ptr ( context.result. p_result_madw ) ) ); } - goto Exit; + } + } + else + { + /* If IsServiceExist = 0 then we should fail here */ + if ( status == IB_SUCCESS ) + { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_delete_service_by_name: ERR 4A19: " + "Succeeded to delete service:%s which " + "shouldn't exist", + sr_name ); + status = IB_ERROR; + } + else + { + /* The deletion should have failed, since the service_name + shouldn't exist. */ + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: " + "IS EXPECTED ERROR ^^^^\n"); + osm_log( &p_osmt->log, OSM_LOG_INFO, + "osmt_delete_service_by_name: " + "Failed to delete service_name:%s\n", + sr_name ); + status = IB_SUCCESS; + } } Exit: @@ -1362,144 +1441,256 @@ osmt_run_service_records_flow( IN osmtes /* Let OpenSM handle it */ usleep(100); + /* Make sure service_name[0] exists */ status = osmt_get_service_by_name(p_osmt, (char*)service_name[0],1, &srv_rec); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A1A: " + "Fail to find service: name:%s\n", + (char*)service_name[0] ); + status = IB_ERROR; goto Exit; } + /* Make sure service_name[1] exists */ status = osmt_get_service_by_name(p_osmt, (char*)service_name[1],1, &srv_rec); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A1B: " + "Fail to find service: name:%s\n", + (char*)service_name[1] ); + status = IB_ERROR; goto Exit; } + /* Make sure service_name[2] exists */ status = osmt_get_service_by_name(p_osmt, (char*)service_name[2],1, &srv_rec); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A1C: " + "Fail to find service: name:%s\n", + (char*)service_name[2] ); + status = IB_ERROR; goto Exit; } - /* Try to get osmt.srvc.4 b4 (there should be 1 record) and after 10 sec - It should be deleted */ + /* Make sure service_name[3] exists. */ + /* After 10 seconds the service should not exist: service_lease = 10 */ status = osmt_get_service_by_name(p_osmt, (char*)service_name[3],1, &srv_rec); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A1D: " + "Fail to find service: name:%s\n", + (char*)service_name[3] ); + status = IB_ERROR; goto Exit; } + sleep(10); + status = osmt_get_service_by_name(p_osmt, (char*)service_name[3],0, &srv_rec); - if (status == IB_SUCCESS) + if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A1E: " + "Found service: name:%s that should have been " + "deleted due to service lease expiring\n", + (char*)service_name[3] ); + status = IB_ERROR; goto Exit; } - /* Check that for the current Service ID only one record exists */ + + /* Check that for service: id[5] only one record exists */ status = osmt_get_service_by_id(p_osmt, 1, cl_ntoh64(id[5]),&srv_rec); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A1F: " + "Found number of records!=1 for " + "service: id:0x%016" PRIx64 "\n", + id[5] ); + status = IB_ERROR; goto Exit; } - /* Bad Flow of Get with invalid Service ID */ + /* Bad Flow of Get with invalid Service ID: id[7] */ status = osmt_get_service_by_id(p_osmt, 0,cl_ntoh64(id[7]),&srv_rec); - if (status == IB_SUCCESS) + if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A20: " + "Found service: id:0x%016 " PRIx64 + "that is invalid\n", + id[7] ); + status = IB_ERROR; goto Exit; } - /* Check that for correct name and ID we get record set b4 */ + + /* Check by both id and service name: id[0], service_name[0] */ status = osmt_get_service_by_id_and_name(p_osmt, 1, cl_ntoh64(id[0]), (char*)service_name[0], &srv_rec); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A21: " + "Fail to find service: id:0x%016 " PRIx64 + "name:%s\n", + id[0], + (char*)service_name[0] ); + status = IB_ERROR; goto Exit; } + + /* Check by both id and service name: id[5], service_name[6] */ status = osmt_get_service_by_id_and_name(p_osmt, 1, cl_ntoh64(id[5]), (char*)service_name[6], &srv_rec); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A22: " + "Fail to find service: id:0x%016 " PRIx64 + "name:%s\n", + id[5], + (char*)service_name[6] ); + status = IB_ERROR; goto Exit; } - /* Bad Flow of Get with valid name and invalid ID */ + /* Bad Flow of Get with invalid name(service_name[3]) and valid ID(id[0]) */ status = osmt_get_service_by_id_and_name(p_osmt, 0, cl_ntoh64(id[0]), (char*)service_name[3], &srv_rec); - if (status == IB_SUCCESS) + if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A23: " + "Found service: id:0x%016" PRIx64 + "name:%s which is an invalid service\n", + id[0], + (char*)service_name[3] ); + status = IB_ERROR; goto Exit; } - /* Bad Flow of Get with unmatched name (exists but not with the following ID) and valid ID */ + + /* Bad Flow of Get with unmatched name(service_name[5]) and id(id[3]) (both valid) */ status = osmt_get_service_by_id_and_name(p_osmt, 0, cl_ntoh64(id[3]), (char*)service_name[5], &srv_rec); - if (status == IB_SUCCESS) + if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A24: " + "Found service: id:0x%016" PRIx64 + "name:%s which is an invalid service\n", + id[3], + (char*)service_name[5] ); + status = IB_ERROR; goto Exit; } + /* Bad Flow of Get with service name that doesn't exist (service_name[4]) */ status = osmt_get_service_by_name(p_osmt, (char*)service_name[4],0, &srv_rec); - if (status == IB_SUCCESS) + if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A25: " + "Found service: name:%s that shouldn't exist\n", + (char*)service_name[4] ); + status = IB_ERROR; goto Exit; } - /* Bad Flow : Check that getting osmt.srvc.6 brings no records since another service has been updated with the same ID - osmt.srvc.7 */ + + /* Bad Flow : Check that getting service_name[5] brings no records since another service + has been updated with the same ID (service_name[6] */ status = osmt_get_service_by_name(p_osmt, (char*)service_name[5],0, &srv_rec); - if (status == IB_SUCCESS) + if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A26: " + "Found service: name:%s which is an " + "invalid service\n", + (char*)service_name[5] ); + status = IB_ERROR; goto Exit; } - /* Check that getting osmt.srvc.7 by name ONLY is valid since we do not support key&name association, also trusted queries */ + + /* Check that getting service_name[6] by name ONLY is valid, + since we do not support key&name association, also trusted queries */ status = osmt_get_service_by_name(p_osmt, (char*)service_name[6],1, &srv_rec); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A27: " + "Fail to find service: name:%s\n", + (char*)service_name[6] ); + status = IB_ERROR; goto Exit; } /* Test Service Key */ cl_memclr(service_key,16*sizeof(uint8_t)); + + /* Check for service_name[5] with service_key=0 - the service shouldn't + exist with this name. */ status = osmt_get_service_by_name_and_key (p_osmt, (char*)service_name[5], 0, service_key,&srv_rec); - if (status == IB_SUCCESS) + if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A28: " + "Found service: name:%s key:0 which is an " + "invalid service (wrong name)\n", + (char*)service_name[5] ); + status = IB_ERROR; goto Exit; } - else - { - status = IB_SUCCESS; - } + + /* Check for service_name[6] with service_key=0 - the service should + exist with different key. */ status = osmt_get_service_by_name_and_key (p_osmt, (char*)service_name[6], 0, service_key,&srv_rec); - if (status == IB_SUCCESS) + if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A29: " + "Found service: name:%s key:0 which is an " + "invalid service (wrong service_key)\n", + (char*)service_name[6] ); + status = IB_ERROR; goto Exit; } - else - { - status = IB_SUCCESS; - } + /* check for service_name[6] with the correct service_key */ for (i=0;i <= 15;i++) service_key[i]=i + 1; status = osmt_get_service_by_name_and_key (p_osmt, (char*)service_name[6], - 0, service_key,&srv_rec); - if (status == IB_SUCCESS) + 1, service_key, &srv_rec); + if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A2A: " + "Fail to find service: name:%s with " + "correct service key\n", + (char*)service_name[6] ); + status = IB_ERROR; goto Exit; } - else - { - status = IB_SUCCESS; - } #ifdef VENDOR_RMPP_SUPPORT + /* These ar the only service_names which are valid */ cl_memcpy(&service_valid_names[0],&service_name[0],sizeof(uint8_t)*64); cl_memcpy(&service_valid_names[1],&service_name[2],sizeof(uint8_t)*64); cl_memcpy(&service_valid_names[2],&service_name[6],sizeof(uint8_t)*64); @@ -1507,79 +1698,101 @@ osmt_run_service_records_flow( IN osmtes status = osmt_get_all_services_and_check_names(p_osmt,service_valid_names,3, &num_recs); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A2B: " + "Fail to find all services that should exist\n" ); + status = IB_ERROR; goto Exit; } #endif + /* Delete service_name[0] */ status = osmt_delete_service_by_name(p_osmt,1, (char*)service_name[0],1); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A2C: " + "Fail to delete service: name:%s\n", + (char*)service_name[0] ); + status = IB_ERROR; goto Exit; } + /* Make sure deletion of service_name[0] succeeded */ status = osmt_get_service_by_name(p_osmt, - (char*)service_name[0],1, &srv_rec); - if (status == IB_SUCCESS) + (char*)service_name[0],0, &srv_rec); + if (status != IB_SUCCESS) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_run_service_records_flow: ERR 001 " - "Expected not to find osmt.srvc.1 \n"); + "osmt_run_service_records_flow: ERR 4A2D: " + "Found service: name:%s that was deleted\n", + (char*)service_name[0] ); status = IB_ERROR; goto Exit; } - osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_run_service_records_flow: " - "IS EXPECTED ERR ^^^^\n"); - - sleep(3); + /* Make sure service_name[1] doesn't exist (expired service lease) */ status = osmt_get_service_by_name(p_osmt, - (char*)service_name[1],1, &srv_rec); - if (status == IB_SUCCESS) + (char*)service_name[1],0, &srv_rec); + if (status != IB_SUCCESS) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_run_service_records_flow: ERR 001 " - "Expected not to find osmt.srvc.2 \n"); + "osmt_run_service_records_flow: ERR 4A2E: " + "Found service: name:%s that should have expired\n", + (char*)service_name[1] ); status = IB_ERROR; goto Exit; } - else - { - osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_run_service_records_flow: " - "IS EXPECTED ERR ^^^^\n"); - status = IB_SUCCESS; - } + /* Make sure service_name[2] exists */ status = osmt_get_service_by_name(p_osmt, (char*)service_name[2],1, &srv_rec); if (status != IB_SUCCESS) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_run_service_records_flow: ERR 001 " - "Expected to find service osmt.srvc.3\n" ); + "osmt_run_service_records_flow: ERR 4A2F: " + "Fail to find service: name:%s\n", + (char*)service_name[2] ); + status = IB_ERROR; goto Exit; } - status = osmt_delete_service_by_name(p_osmt,1, - (char*)service_name[6],1); + + /* Bad Flow - try to delete non-existent service_name[5] */ + status = osmt_delete_service_by_name(p_osmt,0, + (char*)service_name[5],0); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A30: " + "Succeed to delete non-existent service: name:%s\n", + (char*)service_name[5] ); + status = IB_ERROR; goto Exit; } - /* Bad Flow */ - status = osmt_delete_service_by_name(p_osmt,0, - (char*)service_name[5],1); - if (status == IB_SUCCESS) + /* Delete service_name[2] */ + status = osmt_delete_service_by_name(p_osmt,1, + (char*)service_name[2],1); + if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A31: " + "Fail to delete service: name:%s\n", + (char*)service_name[2] ); + status = IB_ERROR; goto Exit; } + /* Delete service_name[6] */ status = osmt_delete_service_by_name(p_osmt,1, - (char*)service_name[2],1); + (char*)service_name[6],1); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A32: " + "Failed to delete service name:%s\n", + (char*)service_name[6] ); goto Exit; } Index: opensm/osm_sa_service_record.c =================================================================== --- opensm/osm_sa_service_record.c (revision 5403) +++ opensm/osm_sa_service_record.c (working copy) @@ -798,6 +798,13 @@ osm_sr_rcv_process_get_method( p_recvd_service_rec = (ib_service_record_t*)ib_sa_mad_get_payload_ptr( p_sa_mad ); + if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) + { + osm_dump_service_record( p_rcv->p_log, + p_recvd_service_rec, + OSM_LOG_DEBUG ); + } + cl_qlist_init(&sr_match_item.sr_list); sr_match_item.p_service_rec = p_recvd_service_rec; sr_match_item.comp_mask = p_sa_mad->comp_mask; From yael at mellanox.co.il Wed Feb 15 02:35:42 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 15 Feb 2006 12:35:42 +0200 Subject: [openib-general] [PATCH] Opensm - osmt_service.c fixes - take #2 Message-ID: <5zslqkkiyp.fsf@mtl066.yok.mtl.com> Hi Hal, The following patch add the following to osmt_service.c: 1. Currently, the flow sometimes exits with "TEST PATH" although the test actually fails. 2. Added cleanup of all the services created at the end of the test. 3. Cosmetic cleanups of the code. Thanks, Yael Signed-off-by: Yael Kalka Index: osmtest/osmt_service.c =================================================================== --- osmtest/osmt_service.c (revision 5403) +++ osmtest/osmt_service.c (working copy) @@ -80,7 +80,7 @@ osmt_register_service( IN osmtest_t * co osm_log( &p_osmt->log, OSM_LOG_INFO, "osmt_register_service: " - "Registering Service: name:%s id:0x%" PRIx64 ".\n", + "Registering service: name:%s id:0x%" PRIx64 "\n", service_name, cl_ntoh64(service_id)); cl_memclr( &req, sizeof( req ) ); @@ -140,8 +140,8 @@ osmt_register_service( IN osmtest_t * co if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_register_service: ERR 0303: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_register_service: ERR 4A01: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); goto Exit; } @@ -150,14 +150,14 @@ osmt_register_service( IN osmtest_t * co if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_register_service: ERR 0364: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_register_service: ERR 4A02: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); if( status == IB_REMOTE_ERROR ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_register_service: " - "Remote error = %s.\n", + "Remote error = %s\n", ib_get_mad_status_str( osm_madw_get_mad_ptr ( context.result. p_result_madw ) ) ); @@ -195,7 +195,7 @@ osmt_register_service_with_full_key ( IN osm_log( &p_osmt->log, OSM_LOG_INFO, "osmt_register_service_with_full_key: " - "Registering Service: name:%s id:0x%" PRIx64 ".\n", + "Registering service: name:%s id:0x%" PRIx64 "\n", service_name, cl_ntoh64(service_id)); cl_memclr( &req, sizeof( req ) ); @@ -256,8 +256,8 @@ osmt_register_service_with_full_key ( IN if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_register_service_with_full_key: ERR 0303: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_register_service_with_full_key: ERR 4A03: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); goto Exit; } @@ -280,7 +280,7 @@ osmt_register_service_with_full_key ( IN status = IB_REMOTE_ERROR; osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_register_service_with_full_key:" - "Data mismatch in service_key.\n" + "Data mismatch in service_key\n" ); goto Exit; } @@ -288,14 +288,14 @@ osmt_register_service_with_full_key ( IN if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_register_service_with_full_key: ERR 0364: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_register_service_with_full_key: ERR 4A04: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); if( status == IB_REMOTE_ERROR ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_register_service_with_full_key: " - "Remote error = %s.\n", + "Remote error = %s\n", ib_get_mad_status_str( osm_madw_get_mad_ptr ( context.result. p_result_madw ) ) ); @@ -337,7 +337,7 @@ osmt_register_service_with_data( IN osmt osm_log( &p_osmt->log, OSM_LOG_INFO, "osmt_register_service_with_data: " - "Registering Service: name:%s id:0x%" PRIx64 ".\n", + "Registering service: name:%s id:0x%" PRIx64 "\n", service_name, cl_ntoh64(service_id)); cl_memclr( &req, sizeof( req ) ); @@ -426,8 +426,8 @@ osmt_register_service_with_data( IN osmt if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_register_service_with_data: ERR 0303: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_register_service_with_data: ERR 4A05: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); goto Exit; } @@ -436,14 +436,14 @@ osmt_register_service_with_data( IN osmt if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_register_service_with_data: ERR 0364: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_register_service_with_data: ERR 4A06: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); if( status == IB_REMOTE_ERROR ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_register_service_with_data: " - "Remote error = %s.\n", + "Remote error = %s\n", ib_get_mad_status_str( osm_madw_get_mad_ptr ( context.result. p_result_madw ) ) ); @@ -464,7 +464,7 @@ osmt_register_service_with_data( IN osmt status = IB_REMOTE_ERROR; osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_register_service_with_data: " - "Data mismatch in service_data8.\n" + "Data mismatch in service_data8\n" ); goto Exit; } @@ -499,7 +499,7 @@ osmt_get_service_by_id_and_name ( IN osm if( osm_log_is_active( &p_osmt->log, OSM_LOG_VERBOSE ) ) osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_id_and_name: " - "Getting Service Record by id 0x%016" PRIx64 " and name : %s\n", + "Getting service record: id:0x%016" PRIx64 " and name:%s\n", cl_ntoh64(sid),sr_name); /* @@ -509,26 +509,29 @@ osmt_get_service_by_id_and_name ( IN osm * * The query structures are locals. */ - cl_memclr( &svc_rec, sizeof( svc_rec ) ); cl_memclr( &req, sizeof( req ) ); cl_memclr( &context, sizeof( context ) ); - cl_memclr( &user, sizeof( user ) ); - /* set the new service record fields */ - cl_memclr(svc_rec.service_name, sizeof(svc_rec.service_name)); - cl_memcpy(svc_rec.service_name, sr_name, - (strlen(sr_name)+1)*sizeof(char)); - svc_rec.service_id = sid; - /* prepare the data used for this query */ + context.p_osmt = p_osmt; + + /* prepare the data used for this query */ req.query_type = OSMV_QUERY_USER_DEFINED; req.timeout_ms = p_osmt->opt.transaction_timeout; req.retry_cnt = p_osmt->opt.retry_count; req.flags = OSM_SA_FLAGS_SYNC; req.query_context = &context; req.pfn_query_cb = osmtest_query_res_cb; - req.p_query_input = &user; req.sm_key = 0; + cl_memclr( &svc_rec, sizeof( svc_rec ) ); + cl_memclr( &user, sizeof( user ) ); + /* set the new service record fields */ + cl_memclr(svc_rec.service_name, sizeof(svc_rec.service_name)); + cl_memcpy(svc_rec.service_name, sr_name, + (strlen(sr_name)+1)*sizeof(char)); + svc_rec.service_id = sid; + req.p_query_input = &user; + user.method = IB_MAD_METHOD_GET; user.attr_id = IB_MAD_ATTR_SERVICE_RECORD; user.comp_mask = IB_SR_COMPMASK_SID | IB_SR_COMPMASK_SNAME; @@ -539,62 +542,73 @@ osmt_get_service_by_id_and_name ( IN osm if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_service_by_id_and_name: ERR 0365: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_get_service_by_id_and_name: ERR 4A07: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); goto Exit; } status = context.result.status; + num_recs = context.result.result_cnt; if( status != IB_SUCCESS ) { - num_recs = 0; - if (status != IB_INVALID_PARAMETER) - { - osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_service_by_id_and_name: ERR 0370: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); - } + char mad_stat_err[256]; + /* If the failure is due to IB_SA_MAD_STATUS_NO_RECORDS and rec_num is 0, + then this is fine. */ if( status == IB_REMOTE_ERROR ) + strcpy(mad_stat_err, ib_get_mad_status_str( + osm_madw_get_mad_ptr(context.result.p_result_madw) ) ); + else + strcpy(mad_stat_err, ib_get_err_str(status) ); + if( status == IB_REMOTE_ERROR && + !strcmp(mad_stat_err, "IB_SA_MAD_STATUS_NO_RECORDS") && + rec_num == 0 ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_get_service_by_id_and_name: " - "Remote error = %s.\n", - ib_get_mad_status_str( osm_madw_get_mad_ptr - ( context.result. - p_result_madw ) ) ); - } - goto Exit; + "IS EXPECTED ERROR ^^^^\n"); + status = IB_SUCCESS; } else { - num_recs = context.result.result_cnt; + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_get_service_by_id_and_name: ERR 4A08: " + "Query failed:%s (%s)\n", + ib_get_err_str(status), + mad_stat_err ); + goto Exit; + } } - if ( num_recs != rec_num ) + if ( rec_num && num_recs != rec_num ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_get_service_by_id_and_name: " - "Unmatched record number, Expeceted : %d, Got : %d.\n", + "Unmatched number of records: expeceted:%d, received:%d\n", rec_num, num_recs); status = IB_REMOTE_ERROR; goto Exit; } p_rec = osmv_get_query_svc_rec( context.result.p_result_madw, 0 ); + *p_out_rec = *p_rec; + + if (num_recs) + { osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_id_and_name: " - "Found Service Record by Name:%s ID:0x%016" PRIx64 ".\n", + "Found service record: name:%s id:0x%016" PRIx64 "\n", p_rec->service_name, cl_ntoh64(p_rec->service_id)); osm_dump_service_record(&p_osmt->log, p_rec, OSM_LOG_DEBUG); - *p_out_rec = *p_rec; + } Exit: osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_id_and_name: " - "Expected num of records is : %d, Found number of records : %d\n", - rec_num,num_recs); + "Expected and found $d records\n", + rec_num ); + if( context.result.p_result_madw != NULL ) { osm_mad_pool_put( &p_osmt->mad_pool, context.result.p_result_madw ); @@ -623,7 +637,7 @@ osmt_get_service_by_id ( IN osmtest_t * if( osm_log_is_active( &p_osmt->log, OSM_LOG_VERBOSE ) ) osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_id: " - "Getting Service Record by id 0x%016" PRIx64 "\n", + "Getting service record: id:0x%016" PRIx64 "\n", cl_ntoh64(sid)); /* @@ -633,23 +647,26 @@ osmt_get_service_by_id ( IN osmtest_t * * * The query structures are locals. */ - cl_memclr( &svc_rec, sizeof( svc_rec ) ); cl_memclr( &req, sizeof( req ) ); cl_memclr( &context, sizeof( context ) ); - cl_memclr( &user, sizeof( user ) ); - /* set the new service record fields */ - svc_rec.service_id = sid; - /* prepare the data used for this query */ + context.p_osmt = p_osmt; + + /* prepare the data used for this query */ req.query_type = OSMV_QUERY_USER_DEFINED; req.timeout_ms = p_osmt->opt.transaction_timeout; req.retry_cnt = p_osmt->opt.retry_count; req.flags = OSM_SA_FLAGS_SYNC; req.query_context = &context; req.pfn_query_cb = osmtest_query_res_cb; - req.p_query_input = &user; req.sm_key = 0; + cl_memclr( &svc_rec, sizeof( svc_rec ) ); + cl_memclr( &user, sizeof( user ) ); + /* set the new service record fields */ + svc_rec.service_id = sid; + req.p_query_input = &user; + user.method = IB_MAD_METHOD_GET; user.attr_id = IB_MAD_ATTR_SERVICE_RECORD; user.comp_mask = IB_SR_COMPMASK_SID; @@ -660,62 +677,74 @@ osmt_get_service_by_id ( IN osmtest_t * if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_service_by_id: ERR 0365: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_get_service_by_id: ERR 4A09: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); goto Exit; } status = context.result.status; + num_recs = context.result.result_cnt; if( status != IB_SUCCESS ) { - num_recs = 0; - if (status != IB_INVALID_PARAMETER) - { - osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_service_by_id: ERR 0370: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); - } + char mad_stat_err[256]; + /* If the failure is due to IB_SA_MAD_STATUS_NO_RECORDS and rec_num is 0, + then this is fine. */ if( status == IB_REMOTE_ERROR ) + strcpy(mad_stat_err, ib_get_mad_status_str( + osm_madw_get_mad_ptr(context.result.p_result_madw) ) ); + else + strcpy(mad_stat_err, ib_get_err_str(status) ); + + if( status == IB_REMOTE_ERROR && + !strcmp(mad_stat_err, "IB_SA_MAD_STATUS_NO_RECORDS") && + rec_num == 0 ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_get_service_by_id: " - "Remote error = %s.\n", - ib_get_mad_status_str( osm_madw_get_mad_ptr - ( context.result. - p_result_madw ) ) ); - } - goto Exit; + "IS EXPECTED ERROR ^^^^\n"); + status = IB_SUCCESS; } else { - num_recs = context.result.result_cnt; + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_get_service_by_id: ERR 4A0A: " + "Query failed:%s (%s)\n", + ib_get_err_str(status), + mad_stat_err ); + goto Exit; + } } - if ( num_recs != rec_num ) + if ( rec_num && num_recs != rec_num ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_service_by_id: " - "Unmatched record number; Expected : %d, Got : %d.\n", + "osmt_get_service_by_id: ERR 4A0B: " + "Unmatched number of records: expected:%d received:%d\n", rec_num, num_recs); status = IB_REMOTE_ERROR; goto Exit; } p_rec = osmv_get_query_svc_rec( context.result.p_result_madw, 0 ); + *p_out_rec = *p_rec; + + if (num_recs) + { osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_id: " - "Found Service Record by Name:%s ID:0x%016" PRIx64 ".\n", + "Found service record: name:%s id:0x%016" PRIx64 "\n", p_rec->service_name, cl_ntoh64(p_rec->service_id)); osm_dump_service_record(&p_osmt->log, p_rec, OSM_LOG_DEBUG); - *p_out_rec = *p_rec; + } Exit: osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_id: " - "Expected num of records is : %d, Found number of records : %d\n", - rec_num,num_recs); + "Expected and found %d records\n", + rec_num ); + if( context.result.p_result_madw != NULL ) { osm_mad_pool_put( &p_osmt->mad_pool, context.result.p_result_madw ); @@ -737,22 +766,25 @@ osmt_get_service_by_name_and_key ( IN os osmtest_req_context_t context; osmv_query_req_t req; ib_service_record_t svc_rec,*p_rec; - uint32_t num_recs = 0; + uint32_t num_recs = 0, i; osmv_user_query_t user; OSM_LOG_ENTER( &p_osmt->log, osmt_get_service_by_name_and_key ); if( osm_log_is_active( &p_osmt->log, OSM_LOG_VERBOSE ) ) { - uint8_t i; + char buf_service_key[33]; + + sprintf(buf_service_key, + "0x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x", + skey[0], skey[1], skey[2], skey[3], skey[4], skey[5], skey[6], skey[7], + skey[8], skey[9], skey[10], skey[11], skey[12], skey[13], skey[14], + skey[15]); + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_name_and_key: " - "Getting Service Record by Name and Key :%s.\n", - sr_name); - for (i=0 ; i<=15 ; i++) - osm_log( &p_osmt->log, OSM_LOG_VERBOSE, - "Service Key[%u] = %u\n", - i,skey[i]); + "Getting service record: name:%s and key:%s\n", + sr_name, buf_service_key ); } /* @@ -762,92 +794,108 @@ osmt_get_service_by_name_and_key ( IN os * * The query structures are locals. */ - cl_memclr( &svc_rec, sizeof( svc_rec ) ); cl_memclr( &req, sizeof( req ) ); cl_memclr( &context, sizeof( context ) ); - cl_memclr( &user, sizeof( user ) ); - /* set the new service record fields */ - cl_memclr(svc_rec.service_name, sizeof(svc_rec.service_name)); - cl_memcpy(svc_rec.service_name, sr_name, - (strlen(sr_name)+1)*sizeof(char)); - /* prepare the data used for this query */ context.p_osmt = p_osmt; + + /* prepare the data used for this query */ req.query_type = OSMV_QUERY_USER_DEFINED; req.timeout_ms = p_osmt->opt.transaction_timeout; req.retry_cnt = p_osmt->opt.retry_count; req.flags = OSM_SA_FLAGS_SYNC; req.query_context = &context; req.pfn_query_cb = osmtest_query_res_cb; - req.p_query_input = &user; req.sm_key = 0; + cl_memclr( &svc_rec, sizeof( svc_rec ) ); + cl_memclr( &user, sizeof( user ) ); + /* set the new service record fields */ + cl_memclr(svc_rec.service_name, sizeof(svc_rec.service_name)); + cl_memcpy(svc_rec.service_name, sr_name, + (strlen(sr_name)+1)*sizeof(char)); + for (i=0 ; i<=15 ; i++) + svc_rec.service_key[i] = skey[i]; + + req.p_query_input = &user; + user.method = IB_MAD_METHOD_GET; user.attr_id = IB_MAD_ATTR_SERVICE_RECORD; user.comp_mask = IB_SR_COMPMASK_SNAME | IB_SR_COMPMASK_SKEY; user.attr_offset = ib_get_attr_offset( sizeof( ib_service_record_t ) ); user.p_attr = &svc_rec; - status = osmv_query_sa( p_osmt->h_bind, &req ); if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_service_by_name_and_key: ERR 0365: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_get_service_by_name_and_key: ERR 4A0C: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); goto Exit; } status = context.result.status; + num_recs = context.result.result_cnt; if( status != IB_SUCCESS ) { - num_recs = 0; - if (status != IB_INVALID_PARAMETER) - { - osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_service_by_name_and_key: ERR 0370: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); - } + char mad_stat_err[256]; + /* If the failure is due to IB_SA_MAD_STATUS_NO_RECORDS and rec_num is 0, + then this is fine. */ if( status == IB_REMOTE_ERROR ) + strcpy(mad_stat_err, ib_get_mad_status_str( + osm_madw_get_mad_ptr(context.result.p_result_madw) ) ); + else + strcpy(mad_stat_err, ib_get_err_str(status) ); + + if( status == IB_REMOTE_ERROR && + !strcmp(mad_stat_err, "IB_SA_MAD_STATUS_NO_RECORDS") && + rec_num == 0 ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_get_service_by_name_and_key: " - "Remote error = %s.\n", - ib_get_mad_status_str( osm_madw_get_mad_ptr - ( context.result. - p_result_madw ) ) ); - } - goto Exit; + "IS EXPECTED ERROR ^^^^\n"); + status = IB_SUCCESS; } else { - num_recs = context.result.result_cnt; + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_get_service_by_name_and_key: ERR 4A0D: " + "Query failed:%s (%s)\n", + ib_get_err_str(status), + mad_stat_err ); + goto Exit; + } } - if ( num_recs != rec_num ) + if ( rec_num && num_recs != rec_num ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_get_service_by_name_and_key: " - "Unmatched record number, Expected : %d, Got : %d.\n", + "Unmatched number of records: expected:%d, received:%d\n", rec_num, num_recs); status = IB_REMOTE_ERROR; goto Exit; } p_rec = osmv_get_query_svc_rec( context.result.p_result_madw, 0 ); + *p_out_rec = *p_rec; + + if ( num_recs ) + { osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_name_and_key: " - "Found Service Record by Name:%s ID:0x%016" PRIx64 ".\n", + "Found service record: name:%s id:0x%016" PRIx64 "\n", sr_name, cl_ntoh64(p_rec->service_id)); osm_dump_service_record(&p_osmt->log, p_rec, OSM_LOG_DEBUG); - *p_out_rec = *p_rec; + } Exit: osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_name_and_key: " - "Expected num of records is : %d, Found number of records : %d\n", - rec_num,num_recs); + "Expected and found %d records\n", + rec_num ); + if( context.result.p_result_madw != NULL ) { osm_mad_pool_put( &p_osmt->mad_pool, context.result.p_result_madw ); @@ -874,12 +922,10 @@ osmt_get_service_by_name( IN osmtest_t * OSM_LOG_ENTER( &p_osmt->log, osmt_get_service_by_name ); if( osm_log_is_active( &p_osmt->log, OSM_LOG_VERBOSE ) ) - { osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_name: " - "Getting Service Record by Name:%s.\n", + "Getting service record: name:%s\n", sr_name); - } /* * Do a blocking query for this record in the subnet. @@ -893,78 +939,90 @@ osmt_get_service_by_name( IN osmtest_t * context.p_osmt = p_osmt; + /* prepare the data used for this query */ req.query_type = OSMV_QUERY_SVC_REC_BY_NAME; req.timeout_ms = p_osmt->opt.transaction_timeout; req.retry_cnt = p_osmt->opt.retry_count; req.flags = OSM_SA_FLAGS_SYNC; req.query_context = &context; req.pfn_query_cb = osmtest_query_res_cb; + req.sm_key = 0; + cl_memclr(service_name, sizeof(service_name)); cl_memcpy(service_name, sr_name, (strlen(sr_name)+1)*sizeof(char)); req.p_query_input = service_name; - req.sm_key = 0; status = osmv_query_sa( p_osmt->h_bind, &req ); if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_service_by_name: ERR 0365: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_get_service_by_name: ERR 4A0E: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); goto Exit; } status = context.result.status; + num_recs = context.result.result_cnt; if( status != IB_SUCCESS ) { - /* The context struct is not init OR result with illegal number of records */ - num_recs = 0; - if (status != IB_INVALID_PARAMETER) - { - osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_service_by_name: ERR 0370: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); - } + char mad_stat_err[256]; + /* If the failure is due to IB_SA_MAD_STATUS_NO_RECORDS and rec_num is 0, + then this is fine. */ if( status == IB_REMOTE_ERROR ) + strcpy(mad_stat_err, ib_get_mad_status_str( + osm_madw_get_mad_ptr(context.result.p_result_madw) ) ); + else + strcpy(mad_stat_err, ib_get_err_str(status) ); + + if( status == IB_REMOTE_ERROR && + !strcmp(mad_stat_err, "IB_SA_MAD_STATUS_NO_RECORDS") && + rec_num == 0 ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_get_service_by_name: " - "Remote error = %s.\n", - ib_get_mad_status_str( osm_madw_get_mad_ptr - ( context.result. - p_result_madw ) ) ); - } - goto Exit; + "IS EXPECTED ERROR ^^^^\n"); + status = IB_SUCCESS; } else { - num_recs = context.result.result_cnt; + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_get_service_by_name: ERR 4A0F: " + "Query failed:%s (%s)\n", + ib_get_err_str(status), + mad_stat_err ); + goto Exit; + } } - if ( num_recs != rec_num ) + if ( rec_num && num_recs != rec_num ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_service_by_name: " - "Unmatched record number, Expeceted : %d, Got : %u.\n", + "osmt_get_service_by_name: ERR 4A10: " + "Unmatched number of records: expected:%d, received:%u\n", rec_num, num_recs); status = IB_REMOTE_ERROR; goto Exit; } p_rec = osmv_get_query_svc_rec( context.result.p_result_madw, 0 ); + *p_out_rec = *p_rec; + + if (num_recs) + { osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_name: " - "Found Service Record by Name:%s ID:0x%016" PRIx64 ".\n", + "Found service record: name:%s id:0x%016" PRIx64 "\n", sr_name, cl_ntoh64(p_rec->service_id)); osm_dump_service_record(&p_osmt->log, p_rec, OSM_LOG_DEBUG); - *p_out_rec = *p_rec; + } Exit: osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_service_by_name: " - "Expected num of records is : %d, Found number of records : %u\n", - rec_num,num_recs); + "Expected and found %d records\n", + rec_num ); if( context.result.p_result_madw != NULL ) { @@ -1002,7 +1060,7 @@ osmt_get_all_services_and_check_names( I { osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_all_services_and_check_names: " - "Getting All Service Records\n"); + "Getting all service records\n"); } /* * Do a blocking query for this record in the subnet. @@ -1028,8 +1086,8 @@ osmt_get_all_services_and_check_names( I if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_all_services_and_check_names: ERR 0371: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_get_all_services_and_check_names: ERR 4A12: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); goto Exit; } @@ -1040,14 +1098,14 @@ osmt_get_all_services_and_check_names( I if (status != IB_INVALID_PARAMETER) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_all_services_and_check_names: ERR 0372: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_get_all_services_and_check_names: ERR 4A13: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); } if( status == IB_REMOTE_ERROR ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_get_all_services_and_check_names: " - "Remote error = %s.\n", + "Remote error = %s\n", ib_get_mad_status_str( osm_madw_get_mad_ptr ( context.result. p_result_madw ) ) ); @@ -1058,14 +1116,14 @@ osmt_get_all_services_and_check_names( I num_recs = context.result.result_cnt; osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_all_services_and_check_names: " - "Received %u records.\n", num_recs ); + "Received %u records\n", num_recs ); for( i = 0; i < num_recs; i++ ) { p_rec = osmv_get_query_svc_rec( context.result.p_result_madw, i ); osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_all_services_and_check_names: " - "Found Service Record by Name:%s ID:0x%016" PRIx64 ".\n", + "Found service record: name:%s id:0x%016" PRIx64 "\n", p_rec->service_name, cl_ntoh64(p_rec->service_id)); osm_dump_service_record(&p_osmt->log, p_rec, OSM_LOG_VERBOSE); for ( j = 0; j < num_of_valid_names; j++) @@ -1091,8 +1149,8 @@ osmt_get_all_services_and_check_names( I if (p_checked_names[j] == 0) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_get_all_services_and_check_names: ERR 0377: " - "Missing Valid Service Name:%s\n",p_valid_service_names_arr[j]); + "osmt_get_all_services_and_check_names: ERR 4A14: " + "Missing valid service: name:%s\n",p_valid_service_names_arr[j]); status = IB_ERROR; goto Exit; } @@ -1124,7 +1182,7 @@ osmt_delete_service_by_name(IN osmtest_t osm_log( &p_osmt->log, OSM_LOG_INFO, "osmt_delete_service_by_name: " - "Trying to Delete Service: Name:%s.\n", + "Trying to Delete service name:%s\n", sr_name); cl_memclr( &svc_rec, sizeof( svc_rec ) ); @@ -1132,17 +1190,10 @@ osmt_delete_service_by_name(IN osmtest_t status = osmt_get_service_by_name(p_osmt, sr_name,rec_num, &svc_rec); if (status != IB_SUCCESS) { - if (IsServiceExist) osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_delete_service_by_name: ERR 001 " - "Nothing to delete - failed to find service by name: %s \n", sr_name); - else - { - osm_log( &p_osmt->log, OSM_LOG_INFO, - "osmt_delete_service_by_name: " - "Record should not exist, i.e. BAD flow\n"); - status = IB_SUCCESS; - } + "osmt_delete_service_by_name: ERR 4A15: " + "Failed to get service: name:%s\n", + sr_name ); goto ExitNoDel; } @@ -1175,29 +1226,57 @@ osmt_delete_service_by_name(IN osmtest_t if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_delete_service_by_name: ERR 0373: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_delete_service_by_name: ERR 4A16: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); goto Exit; } status = context.result.status; - + if ( IsServiceExist ) + { + /* If IsServiceExist = 1 then we should succeed here */ if( status != IB_SUCCESS ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_delete_service_by_name: ERR 0374: " - "ib_query failed (%s).\n", ib_get_err_str( status ) ); + "osmt_delete_service_by_name: ERR 4A17: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); if( status == IB_REMOTE_ERROR ) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_delete_service_by_name: " - "Remote error = %s.\n", + "osmt_delete_service_by_name: ERR 4A18: " + "Remote error = %s\n", ib_get_mad_status_str( osm_madw_get_mad_ptr ( context.result. p_result_madw ) ) ); } - goto Exit; + } + } + else + { + /* If IsServiceExist = 0 then we should fail here */ + if ( status == IB_SUCCESS ) + { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_delete_service_by_name: ERR 4A19: " + "Succeeded to delete service:%s which " + "shouldn't exist", + sr_name ); + status = IB_ERROR; + } + else + { + /* The deletion should have failed, since the service_name + shouldn't exist. */ + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: " + "IS EXPECTED ERROR ^^^^\n"); + osm_log( &p_osmt->log, OSM_LOG_INFO, + "osmt_delete_service_by_name: " + "Failed to delete service_name:%s\n", + sr_name ); + status = IB_SUCCESS; + } } Exit: @@ -1362,144 +1441,256 @@ osmt_run_service_records_flow( IN osmtes /* Let OpenSM handle it */ usleep(100); + /* Make sure service_name[0] exists */ status = osmt_get_service_by_name(p_osmt, (char*)service_name[0],1, &srv_rec); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A1A: " + "Fail to find service: name:%s\n", + (char*)service_name[0] ); + status = IB_ERROR; goto Exit; } + /* Make sure service_name[1] exists */ status = osmt_get_service_by_name(p_osmt, (char*)service_name[1],1, &srv_rec); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A1B: " + "Fail to find service: name:%s\n", + (char*)service_name[1] ); + status = IB_ERROR; goto Exit; } + /* Make sure service_name[2] exists */ status = osmt_get_service_by_name(p_osmt, (char*)service_name[2],1, &srv_rec); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A1C: " + "Fail to find service: name:%s\n", + (char*)service_name[2] ); + status = IB_ERROR; goto Exit; } - /* Try to get osmt.srvc.4 b4 (there should be 1 record) and after 10 sec - It should be deleted */ + /* Make sure service_name[3] exists. */ + /* After 10 seconds the service should not exist: service_lease = 10 */ status = osmt_get_service_by_name(p_osmt, (char*)service_name[3],1, &srv_rec); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A1D: " + "Fail to find service: name:%s\n", + (char*)service_name[3] ); + status = IB_ERROR; goto Exit; } + sleep(10); + status = osmt_get_service_by_name(p_osmt, (char*)service_name[3],0, &srv_rec); - if (status == IB_SUCCESS) + if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A1E: " + "Found service: name:%s that should have been " + "deleted due to service lease expiring\n", + (char*)service_name[3] ); + status = IB_ERROR; goto Exit; } - /* Check that for the current Service ID only one record exists */ + + /* Check that for service: id[5] only one record exists */ status = osmt_get_service_by_id(p_osmt, 1, cl_ntoh64(id[5]),&srv_rec); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A1F: " + "Found number of records!=1 for " + "service: id:0x%016" PRIx64 "\n", + id[5] ); + status = IB_ERROR; goto Exit; } - /* Bad Flow of Get with invalid Service ID */ + /* Bad Flow of Get with invalid Service ID: id[7] */ status = osmt_get_service_by_id(p_osmt, 0,cl_ntoh64(id[7]),&srv_rec); - if (status == IB_SUCCESS) + if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A20: " + "Found service: id:0x%016 " PRIx64 + "that is invalid\n", + id[7] ); + status = IB_ERROR; goto Exit; } - /* Check that for correct name and ID we get record set b4 */ + + /* Check by both id and service name: id[0], service_name[0] */ status = osmt_get_service_by_id_and_name(p_osmt, 1, cl_ntoh64(id[0]), (char*)service_name[0], &srv_rec); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A21: " + "Fail to find service: id:0x%016 " PRIx64 + "name:%s\n", + id[0], + (char*)service_name[0] ); + status = IB_ERROR; goto Exit; } + + /* Check by both id and service name: id[5], service_name[6] */ status = osmt_get_service_by_id_and_name(p_osmt, 1, cl_ntoh64(id[5]), (char*)service_name[6], &srv_rec); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A22: " + "Fail to find service: id:0x%016 " PRIx64 + "name:%s\n", + id[5], + (char*)service_name[6] ); + status = IB_ERROR; goto Exit; } - /* Bad Flow of Get with valid name and invalid ID */ + /* Bad Flow of Get with invalid name(service_name[3]) and valid ID(id[0]) */ status = osmt_get_service_by_id_and_name(p_osmt, 0, cl_ntoh64(id[0]), (char*)service_name[3], &srv_rec); - if (status == IB_SUCCESS) + if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A23: " + "Found service: id:0x%016" PRIx64 + "name:%s which is an invalid service\n", + id[0], + (char*)service_name[3] ); + status = IB_ERROR; goto Exit; } - /* Bad Flow of Get with unmatched name (exists but not with the following ID) and valid ID */ + + /* Bad Flow of Get with unmatched name(service_name[5]) and id(id[3]) (both valid) */ status = osmt_get_service_by_id_and_name(p_osmt, 0, cl_ntoh64(id[3]), (char*)service_name[5], &srv_rec); - if (status == IB_SUCCESS) + if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A24: " + "Found service: id:0x%016" PRIx64 + "name:%s which is an invalid service\n", + id[3], + (char*)service_name[5] ); + status = IB_ERROR; goto Exit; } + /* Bad Flow of Get with service name that doesn't exist (service_name[4]) */ status = osmt_get_service_by_name(p_osmt, (char*)service_name[4],0, &srv_rec); - if (status == IB_SUCCESS) + if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A25: " + "Found service: name:%s that shouldn't exist\n", + (char*)service_name[4] ); + status = IB_ERROR; goto Exit; } - /* Bad Flow : Check that getting osmt.srvc.6 brings no records since another service has been updated with the same ID - osmt.srvc.7 */ + + /* Bad Flow : Check that getting service_name[5] brings no records since another service + has been updated with the same ID (service_name[6] */ status = osmt_get_service_by_name(p_osmt, (char*)service_name[5],0, &srv_rec); - if (status == IB_SUCCESS) + if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A26: " + "Found service: name:%s which is an " + "invalid service\n", + (char*)service_name[5] ); + status = IB_ERROR; goto Exit; } - /* Check that getting osmt.srvc.7 by name ONLY is valid since we do not support key&name association, also trusted queries */ + + /* Check that getting service_name[6] by name ONLY is valid, + since we do not support key&name association, also trusted queries */ status = osmt_get_service_by_name(p_osmt, (char*)service_name[6],1, &srv_rec); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A27: " + "Fail to find service: name:%s\n", + (char*)service_name[6] ); + status = IB_ERROR; goto Exit; } /* Test Service Key */ cl_memclr(service_key,16*sizeof(uint8_t)); + + /* Check for service_name[5] with service_key=0 - the service shouldn't + exist with this name. */ status = osmt_get_service_by_name_and_key (p_osmt, (char*)service_name[5], 0, service_key,&srv_rec); - if (status == IB_SUCCESS) + if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A28: " + "Found service: name:%s key:0 which is an " + "invalid service (wrong name)\n", + (char*)service_name[5] ); + status = IB_ERROR; goto Exit; } - else - { - status = IB_SUCCESS; - } + + /* Check for service_name[6] with service_key=0 - the service should + exist with different key. */ status = osmt_get_service_by_name_and_key (p_osmt, (char*)service_name[6], 0, service_key,&srv_rec); - if (status == IB_SUCCESS) + if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A29: " + "Found service: name:%s key:0 which is an " + "invalid service (wrong service_key)\n", + (char*)service_name[6] ); + status = IB_ERROR; goto Exit; } - else - { - status = IB_SUCCESS; - } + /* check for service_name[6] with the correct service_key */ for (i=0;i <= 15;i++) service_key[i]=i + 1; status = osmt_get_service_by_name_and_key (p_osmt, (char*)service_name[6], - 0, service_key,&srv_rec); - if (status == IB_SUCCESS) + 1, service_key, &srv_rec); + if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A2A: " + "Fail to find service: name:%s with " + "correct service key\n", + (char*)service_name[6] ); + status = IB_ERROR; goto Exit; } - else - { - status = IB_SUCCESS; - } #ifdef VENDOR_RMPP_SUPPORT + /* These ar the only service_names which are valid */ cl_memcpy(&service_valid_names[0],&service_name[0],sizeof(uint8_t)*64); cl_memcpy(&service_valid_names[1],&service_name[2],sizeof(uint8_t)*64); cl_memcpy(&service_valid_names[2],&service_name[6],sizeof(uint8_t)*64); @@ -1507,79 +1698,101 @@ osmt_run_service_records_flow( IN osmtes status = osmt_get_all_services_and_check_names(p_osmt,service_valid_names,3, &num_recs); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A2B: " + "Fail to find all services that should exist\n" ); + status = IB_ERROR; goto Exit; } #endif + /* Delete service_name[0] */ status = osmt_delete_service_by_name(p_osmt,1, (char*)service_name[0],1); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A2C: " + "Fail to delete service: name:%s\n", + (char*)service_name[0] ); + status = IB_ERROR; goto Exit; } + /* Make sure deletion of service_name[0] succeeded */ status = osmt_get_service_by_name(p_osmt, - (char*)service_name[0],1, &srv_rec); - if (status == IB_SUCCESS) + (char*)service_name[0],0, &srv_rec); + if (status != IB_SUCCESS) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_run_service_records_flow: ERR 001 " - "Expected not to find osmt.srvc.1 \n"); + "osmt_run_service_records_flow: ERR 4A2D: " + "Found service: name:%s that was deleted\n", + (char*)service_name[0] ); status = IB_ERROR; goto Exit; } - osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_run_service_records_flow: " - "IS EXPECTED ERR ^^^^\n"); - - sleep(3); + /* Make sure service_name[1] doesn't exist (expired service lease) */ status = osmt_get_service_by_name(p_osmt, - (char*)service_name[1],1, &srv_rec); - if (status == IB_SUCCESS) + (char*)service_name[1],0, &srv_rec); + if (status != IB_SUCCESS) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_run_service_records_flow: ERR 001 " - "Expected not to find osmt.srvc.2 \n"); + "osmt_run_service_records_flow: ERR 4A2E: " + "Found service: name:%s that should have expired\n", + (char*)service_name[1] ); status = IB_ERROR; goto Exit; } - else - { - osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_run_service_records_flow: " - "IS EXPECTED ERR ^^^^\n"); - status = IB_SUCCESS; - } + /* Make sure service_name[2] exists */ status = osmt_get_service_by_name(p_osmt, (char*)service_name[2],1, &srv_rec); if (status != IB_SUCCESS) { osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_run_service_records_flow: ERR 001 " - "Expected to find service osmt.srvc.3\n" ); + "osmt_run_service_records_flow: ERR 4A2F: " + "Fail to find service: name:%s\n", + (char*)service_name[2] ); + status = IB_ERROR; goto Exit; } - status = osmt_delete_service_by_name(p_osmt,1, - (char*)service_name[6],1); + + /* Bad Flow - try to delete non-existent service_name[5] */ + status = osmt_delete_service_by_name(p_osmt,0, + (char*)service_name[5],0); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A30: " + "Succeed to delete non-existent service: name:%s\n", + (char*)service_name[5] ); + status = IB_ERROR; goto Exit; } - /* Bad Flow */ - status = osmt_delete_service_by_name(p_osmt,0, - (char*)service_name[5],1); - if (status == IB_SUCCESS) + /* Delete service_name[2] */ + status = osmt_delete_service_by_name(p_osmt,1, + (char*)service_name[2],1); + if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A31: " + "Fail to delete service: name:%s\n", + (char*)service_name[2] ); + status = IB_ERROR; goto Exit; } + /* Delete service_name[6] */ status = osmt_delete_service_by_name(p_osmt,1, - (char*)service_name[2],1); + (char*)service_name[6],1); if (status != IB_SUCCESS) { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_service_records_flow: ERR 4A32: " + "Failed to delete service name:%s\n", + (char*)service_name[6] ); goto Exit; } Index: opensm/osm_sa_service_record.c =================================================================== --- opensm/osm_sa_service_record.c (revision 5403) +++ opensm/osm_sa_service_record.c (working copy) @@ -798,6 +798,13 @@ osm_sr_rcv_process_get_method( p_recvd_service_rec = (ib_service_record_t*)ib_sa_mad_get_payload_ptr( p_sa_mad ); + if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) + { + osm_dump_service_record( p_rcv->p_log, + p_recvd_service_rec, + OSM_LOG_DEBUG ); + } + cl_qlist_init(&sr_match_item.sr_list); sr_match_item.p_service_rec = p_recvd_service_rec; sr_match_item.comp_mask = p_sa_mad->comp_mask; From ogerlitz at voltaire.com Wed Feb 15 03:08:51 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 15 Feb 2006 13:08:51 +0200 (IST) Subject: [openib-general] [PATCH] iser: bugfix related to non start/end page aligned SG Message-ID: bugfix related to SGs whose first/last element are not start/end page aligned Signed-off-by: Or Gerlitz Index: iser_verbs.c =================================================================== --- iser_verbs.c (revision 5414) +++ iser_verbs.c (working copy) @@ -152,7 +152,9 @@ static int iser_create_ib_conn_res(struc p_iser_adaptor = p_iser_conn->p_adaptor; params.page_shift = PAGE_SHIFT; - params.max_pages_per_fmr = ISCSI_ISER_SG_TABLESIZE; + /* when the first/last SG element are not start/end * + * page aligned, the map whould be of N+1 pages */ + params.max_pages_per_fmr = ISCSI_ISER_SG_TABLESIZE + 1; params.pool_size = ISCSI_ISER_XMIT_CMDS_MAX; params.dirty_watermark = 32; params.cache = 0; From halr at voltaire.com Wed Feb 15 03:56:14 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Feb 2006 06:56:14 -0500 Subject: [openib-general] OpenSM realloc error In-Reply-To: <1139964683.5941.10.camel@beast.terraplex.com> References: <1139945214.2169.4.camel@beast.terraplex.com> <1139963140.4333.16065.camel@hal.voltaire.com> <1139964683.5941.10.camel@beast.terraplex.com> Message-ID: <1140004573.4333.19957.camel@hal.voltaire.com> Hi Owen, On Tue, 2006-02-14 at 19:51, Owen Stampflee wrote: > On Tue, 2006-02-14 at 19:25 -0500, Hal Rosenstock wrote: > > Hi Owen, > > > > On Tue, 2006-02-14 at 14:26, Owen Stampflee wrote: > > > [root at m1 ~]# opensm > > > ------------------------------------------------- > > > OpenSM Rev:openib-1.1.0 > > > Based on OpenIB svn 5411 > > > Command Line Arguments: > > > Log File: /var/log/osm.log > > > ------------------------------------------------- > > > *** glibc detected *** realloc(): invalid next size: 0x000000001007ae90 > > ^^^^^^^^^^^^^^^^^ > > That's a pretty big size (268,938,896). Any idea what was going on ? Hmm... I don't see where OpenSM directly calls realloc anywhere. Perhaps this is used under the covers in glibc. > > How far do you get with opensm ? Is there anything in the log ? > > > > Out of curiousity, how big is your subnet ? > Small, two nodes directly connected, I'll see if it works better with a switch in the middle. I doubt that adding a switch will change things. > > > *** > > > Aborted > > > > Is this reproducible ? > Yes, everytime opensm runs. > > Only thing is the log is this: > Feb 14 13:18:48 488484 [A9020] -> OpenSM Rev:openib-1.1.0 OpenIB svn > 5411 Can you strace it and provide the output ? Thanks. -- Hal > > > Installed components are: > > > * kernel-g5-smp-2.6.15-1.yhpc.1 > > > * gcc-3.4.4-2.ydl.2 > > > * glibc-2.3.4-2.13.ydl.0 > > > > Is this PowerPC ? > Yup, PowerMac G5s. > > > > * SBS mthca adapter > > > > > > Everything is built 64-bit, with -fPIC. Any idea on what could be going > > > on? > > > > The default build doesn't use -fPIC but I wouldn't think that should > > have anything to do with it. > I tried without -fPIC, also rebuild the Fedora FC5 RPMs (svn4265) to see > if that helped, but still no luck. I'm using svn5411. From yael at mellanox.co.il Wed Feb 15 04:41:42 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 15 Feb 2006 14:41:42 +0200 Subject: [openib-general] [PATCH] Opensm - osmtest.c - add windows support Message-ID: <5zr764kd4p.fsf@mtl066.yok.mtl.com> Hi Hal, The following patch adds includes changes to support windows compilation. The changes are of files to include, and a casting fix. Yael Signed-off-by: Yael Kalka Index: osmtest/osmtest.c =================================================================== --- osmtest/osmtest.c (revision 5403) +++ osmtest/osmtest.c (working copy) @@ -47,18 +47,28 @@ * $Revision: 1.10 $ */ +#ifdef __WIN__ +#pragma warning(disable : 4996) +#endif + /* next error code: 16A */ #include #include +#ifdef __WIN__ +#include +#include +#else #include +#include +#endif #include #include #include "osmtest.h" -#include - +#ifndef __WIN__ #define strnicmp strncasecmp +#endif #define POOL_MIN_ITEMS 64 #define GUID_ARRAY_SIZE 64 @@ -2653,7 +2663,7 @@ osmtest_stress_large_rmpp_pr( IN osmtest if (num_recs == 0) ratio = 0; else - ratio = num_queries / num_recs; + ratio = (float)(num_queries / num_recs); printf( "-I- Queries to Record Ratio is %" PRIu64 " records, %" PRIu64 " queries : %.2f \n", num_recs, num_queries, ratio); print_freq = 0; From halr at voltaire.com Wed Feb 15 05:21:42 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Feb 2006 08:21:42 -0500 Subject: [openib-general] Re: [PATCH] Opensm - osmtest.c - add windows support In-Reply-To: <5zr764kd4p.fsf@mtl066.yok.mtl.com> References: <5zr764kd4p.fsf@mtl066.yok.mtl.com> Message-ID: <1140009701.4333.20616.camel@hal.voltaire.com> On Wed, 2006-02-15 at 07:41, Yael Kalka wrote: > Hi Hal, > > The following patch adds includes changes to support windows > compilation. The changes are of files to include, and a casting fix. > > Yael > > Signed-off-by: Yael Kalka Thanks. Applied. From eli at mellanox.co.il Wed Feb 15 05:31:33 2006 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 15 Feb 2006 15:31:33 +0200 Subject: [openib-general] PATCH] mthca - command interface - revised Message-ID: <1140010293.4601.8.camel@mtls03.yok.mtl.com> Roland, this patch is modified according to your comments. It also adds a kernel configuration option which selects whether to use posting commands through doorbells. The option is off by default. Signed-off-by: Eli Cohen Index: linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_dev.h =================================================================== --- linux-2.6.14.2.orig/drivers/infiniband/hw/mthca/mthca_dev.h +++ linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_dev.h @@ -117,9 +117,18 @@ enum { MTHCA_OPCODE_INVALID = 0xff }; +enum { + MTHCA_CMD_USE_EVENTS = (1 << 0), + MTHCA_CMD_CAN_POST_DOORBELLS = (1 << 1), + MTHCA_CMD_POST_DOORBELLS = (1 << 2) +}; + +enum { + MTHCA_CMD_NUM_DBELL_DWORDS = 8 +}; + struct mthca_cmd { struct pci_pool *pool; - int use_events; struct mutex hcr_mutex; struct semaphore poll_sem; struct semaphore event_sem; @@ -128,6 +137,10 @@ struct mthca_cmd { int free_head; struct mthca_cmd_context *context; u16 token_mask; + u32 flags; + void __iomem *dbell_map; + u64 dbell_base; + u16 dbell_offsets[MTHCA_CMD_NUM_DBELL_DWORDS]; }; struct mthca_limits { Index: linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_cmd.c =================================================================== --- linux-2.6.14.2.orig/drivers/infiniband/hw/mthca/mthca_cmd.c +++ linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_cmd.c @@ -188,19 +188,57 @@ static inline int go_bit(struct mthca_de swab32(1 << HCR_GO_BIT); } -static int mthca_cmd_post(struct mthca_dev *dev, - u64 in_param, - u64 out_param, - u32 in_modifier, - u8 op_modifier, - u16 op, - u16 token, - int event) + +static void mthca_cmd_post_dbell(struct mthca_dev *dev, + u64 in_param, + u64 out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + u16 token) { - int err = 0; + void __iomem *ptr = dev->cmd.dbell_map; + u16 *offs = dev->cmd.dbell_offsets; + + __raw_writel((__force u32) cpu_to_be32(in_param >> 32), + ptr + offs[0]); + wmb(); + __raw_writel((__force u32) cpu_to_be32(in_param & 0xfffffffful), + ptr + offs[1]); + wmb(); + __raw_writel((__force u32) cpu_to_be32(in_modifier), + ptr + offs[2]); + wmb(); + __raw_writel((__force u32) cpu_to_be32(out_param >> 32), + ptr + offs[3]); + wmb(); + __raw_writel((__force u32) cpu_to_be32(out_param & 0xfffffffful), + ptr + offs[4]); + wmb(); + __raw_writel((__force u32) cpu_to_be32(token << 16), + ptr + offs[5]); + wmb(); + __raw_writel((__force u32) cpu_to_be32((1 << HCR_GO_BIT) | + (1 << HCA_E_BIT) | + (op_modifier << HCR_OPMOD_SHIFT) | + op), + ptr + offs[6]); + wmb(); + __raw_writel((__force u32) 0, + ptr + offs[7]); + wmb(); +} - mutex_lock(&dev->cmd.hcr_mutex); +static int mthca_cmd_post_hcr(struct mthca_dev *dev, + u64 in_param, + u64 out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + u16 token, + int event) +{ if (event) { unsigned long end = jiffies + GO_BIT_TIMEOUT; @@ -210,10 +248,8 @@ static int mthca_cmd_post(struct mthca_d } } - if (go_bit(dev)) { - err = -EAGAIN; - goto out; - } + if (go_bit(dev)) + return -EAGAIN; /* * We use writel (instead of something like memcpy_toio) @@ -236,7 +272,29 @@ static int mthca_cmd_post(struct mthca_d (op_modifier << HCR_OPMOD_SHIFT) | op), dev->hcr + 6 * 4); -out: + return 0; +} + +static int mthca_cmd_post(struct mthca_dev *dev, + u64 in_param, + u64 out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + u16 token, + int event) +{ + int err = 0; + + mutex_lock(&dev->cmd.hcr_mutex); + + if (dev->cmd.flags & MTHCA_CMD_POST_DOORBELLS && event) + mthca_cmd_post_dbell(dev, in_param, out_param, in_modifier, + op_modifier, op, token); + else + err = mthca_cmd_post_hcr(dev, in_param, out_param, in_modifier, + op_modifier, op, token, event); + mutex_unlock(&dev->cmd.hcr_mutex); return err; } @@ -386,7 +444,7 @@ static int mthca_cmd_box(struct mthca_de unsigned long timeout, u8 *status) { - if (dev->cmd.use_events) + if (dev->cmd.flags & MTHCA_CMD_USE_EVENTS) return mthca_cmd_wait(dev, in_param, &out_param, 0, in_modifier, op_modifier, op, timeout, status); @@ -423,7 +481,7 @@ static int mthca_cmd_imm(struct mthca_de unsigned long timeout, u8 *status) { - if (dev->cmd.use_events) + if (dev->cmd.flags & MTHCA_CMD_USE_EVENTS) return mthca_cmd_wait(dev, in_param, out_param, 1, in_modifier, op_modifier, op, timeout, status); @@ -437,7 +495,7 @@ int mthca_cmd_init(struct mthca_dev *dev { mutex_init(&dev->cmd.hcr_mutex); sema_init(&dev->cmd.poll_sem, 1); - dev->cmd.use_events = 0; + dev->cmd.flags &= ~MTHCA_CMD_USE_EVENTS; dev->hcr = ioremap(pci_resource_start(dev->pdev, 0) + MTHCA_HCR_BASE, MTHCA_HCR_SIZE); @@ -461,6 +519,8 @@ void mthca_cmd_cleanup(struct mthca_dev { pci_pool_destroy(dev->cmd.pool); iounmap(dev->hcr); + if (dev->cmd.flags & MTHCA_CMD_POST_DOORBELLS) + iounmap(dev->cmd.dbell_map); } /* @@ -498,12 +558,49 @@ int mthca_cmd_use_events(struct mthca_de ; /* nothing */ --dev->cmd.token_mask; - dev->cmd.use_events = 1; + dev->cmd.flags |= MTHCA_CMD_USE_EVENTS; + + down(&dev->cmd.poll_sem); return 0; } + +#ifdef CONFIG_INFINIBAND_MTHCA_CMD_UAR +/* + * attempt to post commands through doorbells + */ +int mthca_use_cmd_doorbells(struct mthca_dev *dev) +{ + int i; + u16 max_off = 0; + unsigned long pg1, pg2; + + if (!dev->cmd.flags & MTHCA_CMD_CAN_POST_DOORBELLS) + return -ENODEV; + + for (i=0; i<8; ++i) + if (dev->cmd.dbell_offsets[i] > max_off) + max_off = dev->cmd.dbell_offsets[i]; + + pg1 = dev->cmd.dbell_base & PAGE_MASK; + pg2 = (dev->cmd.dbell_base + max_off) & PAGE_MASK; + + if (pg1 != pg2) + return -ENOMEM; + + dev->cmd.dbell_map = ioremap(dev->cmd.dbell_base, max_off + sizeof(u32)); + if (!dev->cmd.dbell_map) + return -ENOMEM; + + dev->cmd.flags |= MTHCA_CMD_POST_DOORBELLS; + mthca_dbg(dev, "posting commands through doorbell\n"); + + return 0; +} +#endif + /* * Switch back to polling (used when shutting down the device) */ @@ -511,7 +608,7 @@ void mthca_cmd_use_polling(struct mthca_ { int i; - dev->cmd.use_events = 0; + dev->cmd.flags &= ~MTHCA_CMD_USE_EVENTS; for (i = 0; i < dev->cmd.max_cmds; ++i) down(&dev->cmd.event_sem); @@ -665,8 +762,10 @@ int mthca_QUERY_FW(struct mthca_dev *dev { struct mthca_mailbox *mailbox; u32 *outbox; + u32 tmp; int err = 0; u8 lg; + int i; #define QUERY_FW_OUT_SIZE 0x100 #define QUERY_FW_VER_OFFSET 0x00 @@ -674,6 +773,11 @@ int mthca_QUERY_FW(struct mthca_dev *dev #define QUERY_FW_ERR_START_OFFSET 0x30 #define QUERY_FW_ERR_SIZE_OFFSET 0x38 + +#define QUERY_FW_CMD_DB_EN_OFFSET 0x10 +#define QUERY_FW_CMD_DB_OFFSET 0x50 +#define QUERY_FW_CMD_DB_BASE 0x60 + #define QUERY_FW_START_OFFSET 0x20 #define QUERY_FW_END_OFFSET 0x28 @@ -706,6 +810,15 @@ int mthca_QUERY_FW(struct mthca_dev *dev dev->cmd.max_cmds = 1 << lg; MTHCA_GET(dev->catas_err.addr, outbox, QUERY_FW_ERR_START_OFFSET); MTHCA_GET(dev->catas_err.size, outbox, QUERY_FW_ERR_SIZE_OFFSET); + MTHCA_GET(tmp, outbox, QUERY_FW_CMD_DB_EN_OFFSET); + if (tmp & 0x1) { + mthca_dbg(dev, "FW supports commands through doorbells\n"); + dev->cmd.flags |= MTHCA_CMD_CAN_POST_DOORBELLS; + } + MTHCA_GET(dev->cmd.dbell_base, outbox, QUERY_FW_CMD_DB_BASE); + for (i=0; icmd.dbell_offsets[i], outbox, + QUERY_FW_CMD_DB_OFFSET + (i << 1)); mthca_dbg(dev, "FW version %012llx, max commands %d\n", (unsigned long long) dev->fw_ver, dev->cmd.max_cmds); @@ -1659,7 +1772,6 @@ int mthca_MODIFY_QP(struct mthca_dev *de if (0) { int i; mthca_dbg(dev, "Dumping QP context:\n"); - printk(" opt param mask: %08x\n", be32_to_cpup(mailbox->buf)); for (i = 0; i < 0x100 / 4; ++i) { if (i % 8 == 0) printk(" [%02x] ", i * 4); Index: linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_cmd.h =================================================================== --- linux-2.6.14.2.orig/drivers/infiniband/hw/mthca/mthca_cmd.h +++ linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_cmd.h @@ -244,6 +244,14 @@ struct mthca_set_ib_param { int mthca_cmd_init(struct mthca_dev *dev); void mthca_cmd_cleanup(struct mthca_dev *dev); int mthca_cmd_use_events(struct mthca_dev *dev); +#if defined(CONFIG_INFINIBAND_MTHCA_CMD_UAR) +int mthca_use_cmd_doorbells(struct mthca_dev *dev); +#else +static inline int mthca_use_cmd_doorbells(struct mthca_dev *dev) +{ + return -ENOSYS; +} +#endif void mthca_cmd_use_polling(struct mthca_dev *dev); void mthca_cmd_event(struct mthca_dev *dev, u16 token, u8 status, u64 out_param); Index: linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_main.c =================================================================== --- linux-2.6.14.2.orig/drivers/infiniband/hw/mthca/mthca_main.c +++ linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_main.c @@ -754,6 +754,11 @@ static int __devinit mthca_setup_hca(str goto err_eq_table_free; } + err = mthca_use_cmd_doorbells(dev); + if (err) + mthca_dbg(dev, "not using commands through doorbells\n"); + + err = mthca_NOP(dev, &status); if (err || status) { mthca_err(dev, "NOP command failed to generate interrupt (IRQ %d), aborting.\n", Index: linux-2.6.14.2/drivers/infiniband/hw/mthca/Kconfig =================================================================== --- linux-2.6.14.2.orig/drivers/infiniband/hw/mthca/Kconfig +++ linux-2.6.14.2/drivers/infiniband/hw/mthca/Kconfig @@ -14,3 +14,12 @@ config INFINIBAND_MTHCA_DEBUG This option causes the mthca driver produce a bunch of debug messages. Select this is you are developing the driver or trying to diagnose a problem. + +config INFINIBAND_MTHCA_CMD_UAR + bool "Post commands through uar0" + depends on INFINIBAND_MTHCA + default n + ---help--- + This option will check if the device supports issuing commands + by writing to the UAR area. This can result in better CPU + utilization. From glebn at voltaire.com Wed Feb 15 06:25:29 2006 From: glebn at voltaire.com (Gleb Natapov) Date: Wed, 15 Feb 2006 16:25:29 +0200 Subject: [openib-general] Re: Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: <20060215101448.GJ12974@mellanox.co.il> References: <20060213190206.GC12458@mellanox.co.il> <20060214065145.GE24524@minantech.com> <20060214164236.GB12974@mellanox.co.il> <20060215073452.GI24524@minantech.com> <20060215082331.GE10026@mellanox.co.il> <20060215083115.GJ24524@minantech.com> <20060215090250.GF12974@mellanox.co.il> <20060215093007.GK24524@minantech.com> <20060215101448.GJ12974@mellanox.co.il> Message-ID: <20060215142529.GM24524@minantech.com> On Wed, Feb 15, 2006 at 12:14:48PM +0200, Michael S. Tsirkin wrote: > Quoting r. Gleb Natapov : > > > Clarification: as I see it, longer term we want to add a flag to make > > > get_user_pages trigger an immediate page copy on fork (rather than > > > copy_ptes). > > > > Can you elaborate? Do you mean one more VMA flag (VM_COPYONFORK)? > > This should hopefully solve more than just the reg_mr issue, and not > specific to infiniband. See e.g. here: http://lkml.org/lkml/2005/12/12/30 > So no, this will have to be a per-page flag: set by get_user_pages when > passed some new option, and cleared by put_page when the page ref count > drops to page map count. > Yes this is very serious issue I wonder why aio users don't complain all over the lklm. (or should aio buffers have to be aligned?) > BTW, I dont know when I will get around to working on it, so any help > would be appreciated. Do you think new page flag is a viable solution? With the holy war against new (and old) page flags. Besides fork will have to go from pte to struct page to check flags for each mapped page in the process! > > > > > Remember that we should gracefully handle overlapping registrations. > > > > > Right, and madvise doesnt do any refcouting. That's one reason not to > > > include it in reg_mr. > > > > I beg to differ. I think this is exactly the reason to include it in > > reg_mr. Otherwise each application should reinvent refcounting logic. It > > is much better to do it right once instead of doing it wrong many times. > > Talking about applications developed directly for infiniband again? Is this a banned subject? Or is this not recommended for application programmers to work directly with verbs? > But why do you think they always use overlapping regions? > I don't know. They should not care about this mundane details. > > > Another is that madvise only works for full pages. > > > > Everything in VM works only for full pages. Unix don't try to hide this > > from user. > > ibv_reg_mr works fine for sub-page regions. Doesnt it? > Not really. It gives you the impression that it works by not returning an error and aligning address and lengths for you. Same case with mmap(). You can provide nonaligned length and it will not fail. > > > Applications should be aware of these limitations, and I think the easiest > > > way to achieve this is by asking them to use madvise directly. > > > > The problem not in madvice but in refcounting that each application must > > maintain. > > I dont really see a good way around this. Why not do it only once in the library that each RDMA application will have to use. -- Gleb. From mst at mellanox.co.il Wed Feb 15 06:36:22 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Feb 2006 16:36:22 +0200 Subject: [openib-general] Re: Re: Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: <20060215142529.GM24524@minantech.com> References: <20060213190206.GC12458@mellanox.co.il> <20060214065145.GE24524@minantech.com> <20060214164236.GB12974@mellanox.co.il> <20060215073452.GI24524@minantech.com> <20060215082331.GE10026@mellanox.co.il> <20060215083115.GJ24524@minantech.com> <20060215090250.GF12974@mellanox.co.il> <20060215093007.GK24524@minantech.com> <20060215101448.GJ12974@mellanox.co.il> <20060215142529.GM24524@minantech.com> Message-ID: <20060215143622.GV12974@mellanox.co.il> Quoting r. Gleb Natapov : > > ibv_reg_mr works fine for sub-page regions. Doesnt it? > > > Not really. It gives you the impression that it works by not returning an > error and aligning address and lengths for you. Same case with mmap(). You > can provide nonaligned length and it will not fail. As far as I know hardware supports non-aligned regions and so does ibv_reg_mr: try to access outside the region and you'll get completion with error. No? If not, let me know - it should be easy to fix. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From glebn at voltaire.com Wed Feb 15 06:46:46 2006 From: glebn at voltaire.com (Gleb Natapov) Date: Wed, 15 Feb 2006 16:46:46 +0200 Subject: [openib-general] Re: Re: Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: <20060215143622.GV12974@mellanox.co.il> References: <20060214065145.GE24524@minantech.com> <20060214164236.GB12974@mellanox.co.il> <20060215073452.GI24524@minantech.com> <20060215082331.GE10026@mellanox.co.il> <20060215083115.GJ24524@minantech.com> <20060215090250.GF12974@mellanox.co.il> <20060215093007.GK24524@minantech.com> <20060215101448.GJ12974@mellanox.co.il> <20060215142529.GM24524@minantech.com> <20060215143622.GV12974@mellanox.co.il> Message-ID: <20060215144646.GN24524@minantech.com> On Wed, Feb 15, 2006 at 04:36:22PM +0200, Michael S. Tsirkin wrote: > Quoting r. Gleb Natapov : > > > ibv_reg_mr works fine for sub-page regions. Doesnt it? > > > > > Not really. It gives you the impression that it works by not returning an > > error and aligning address and lengths for you. Same case with mmap(). You > > can provide nonaligned length and it will not fail. > > As far as I know hardware supports non-aligned regions and so does ibv_reg_mr: > try to access outside the region and you'll get completion with error. > No? If not, let me know - it should be easy to fix. > Of cause you are right about reg_mr effect in regards to infiniband protocol. I was talked about the effect on VM subsystem. Any real program can't ignore this. I just want to move complexity to one place. -- Gleb. From mst at mellanox.co.il Wed Feb 15 06:52:51 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Feb 2006 16:52:51 +0200 Subject: [openib-general] Re: Re: Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: <20060215144646.GN24524@minantech.com> References: <20060214164236.GB12974@mellanox.co.il> <20060215073452.GI24524@minantech.com> <20060215082331.GE10026@mellanox.co.il> <20060215083115.GJ24524@minantech.com> <20060215090250.GF12974@mellanox.co.il> <20060215093007.GK24524@minantech.com> <20060215101448.GJ12974@mellanox.co.il> <20060215142529.GM24524@minantech.com> <20060215143622.GV12974@mellanox.co.il> <20060215144646.GN24524@minantech.com> Message-ID: <20060215145251.GW12974@mellanox.co.il> Quoting r. Gleb Natapov : > > > > ibv_reg_mr works fine for sub-page regions. Doesnt it? > > > > > > Not really. It gives you the impression that it works by not returning an > > > error and aligning address and lengths for you. Same case with mmap(). You > > > can provide nonaligned length and it will not fail. > > > > As far as I know hardware supports non-aligned regions and so does > > ibv_reg_mr: try to access outside the region and you'll get completion with > > error. No? If not, let me know - it should be easy to fix. > > Of cause you are right about reg_mr effect in regards to infiniband protocol. > I was talked about the effect on VM subsystem. Any real program can't > ignore this. It cant? Why does it care? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From glebn at voltaire.com Wed Feb 15 06:54:08 2006 From: glebn at voltaire.com (Gleb Natapov) Date: Wed, 15 Feb 2006 16:54:08 +0200 Subject: [openib-general] Re: Re: Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: <20060215145251.GW12974@mellanox.co.il> References: <20060215073452.GI24524@minantech.com> <20060215082331.GE10026@mellanox.co.il> <20060215083115.GJ24524@minantech.com> <20060215090250.GF12974@mellanox.co.il> <20060215093007.GK24524@minantech.com> <20060215101448.GJ12974@mellanox.co.il> <20060215142529.GM24524@minantech.com> <20060215143622.GV12974@mellanox.co.il> <20060215144646.GN24524@minantech.com> <20060215145251.GW12974@mellanox.co.il> Message-ID: <20060215145408.GO24524@minantech.com> On Wed, Feb 15, 2006 at 04:52:51PM +0200, Michael S. Tsirkin wrote: > Quoting r. Gleb Natapov : > > > > > ibv_reg_mr works fine for sub-page regions. Doesnt it? > > > > > > > > Not really. It gives you the impression that it works by not returning an > > > > error and aligning address and lengths for you. Same case with mmap(). You > > > > can provide nonaligned length and it will not fail. > > > > > > As far as I know hardware supports non-aligned regions and so does > > > ibv_reg_mr: try to access outside the region and you'll get completion with > > > error. No? If not, let me know - it should be easy to fix. > > > > Of cause you are right about reg_mr effect in regards to infiniband protocol. > > I was talked about the effect on VM subsystem. Any real program can't > > ignore this. > > It cant? Why does it care? > Because the program should be careful to not put the data it needs in the same page with registered buffer. -- Gleb. From mst at mellanox.co.il Wed Feb 15 06:58:48 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Feb 2006 16:58:48 +0200 Subject: [openib-general] Re: Re: Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: <20060215145408.GO24524@minantech.com> References: <20060215082331.GE10026@mellanox.co.il> <20060215083115.GJ24524@minantech.com> <20060215090250.GF12974@mellanox.co.il> <20060215093007.GK24524@minantech.com> <20060215101448.GJ12974@mellanox.co.il> <20060215142529.GM24524@minantech.com> <20060215143622.GV12974@mellanox.co.il> <20060215144646.GN24524@minantech.com> <20060215145251.GW12974@mellanox.co.il> <20060215145408.GO24524@minantech.com> Message-ID: <20060215145848.GX12974@mellanox.co.il> Quoting r. Gleb Natapov : > Subject: Re: [openib-general] Re: Re: Re: madvise MADV_DONTFORK/MADV_DOFORK > > On Wed, Feb 15, 2006 at 04:52:51PM +0200, Michael S. Tsirkin wrote: > > Quoting r. Gleb Natapov : > > > > > > ibv_reg_mr works fine for sub-page regions. Doesnt it? > > > > > > > > > > Not really. It gives you the impression that it works by not returning an > > > > > error and aligning address and lengths for you. Same case with mmap(). You > > > > > can provide nonaligned length and it will not fail. > > > > > > > > As far as I know hardware supports non-aligned regions and so does > > > > ibv_reg_mr: try to access outside the region and you'll get completion with > > > > error. No? If not, let me know - it should be easy to fix. > > > > > > Of cause you are right about reg_mr effect in regards to infiniband protocol. > > > I was talked about the effect on VM subsystem. Any real program can't > > > ignore this. > > > > It cant? Why does it care? > > Because the program should be careful to not put the data it needs in the > same page with registered buffer. It should? Why should it? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From glebn at voltaire.com Wed Feb 15 07:04:37 2006 From: glebn at voltaire.com (Gleb Natapov) Date: Wed, 15 Feb 2006 17:04:37 +0200 Subject: [openib-general] Re: Re: Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: <20060215145848.GX12974@mellanox.co.il> References: <20060215083115.GJ24524@minantech.com> <20060215090250.GF12974@mellanox.co.il> <20060215093007.GK24524@minantech.com> <20060215101448.GJ12974@mellanox.co.il> <20060215142529.GM24524@minantech.com> <20060215143622.GV12974@mellanox.co.il> <20060215144646.GN24524@minantech.com> <20060215145251.GW12974@mellanox.co.il> <20060215145408.GO24524@minantech.com> <20060215145848.GX12974@mellanox.co.il> Message-ID: <20060215150437.GQ24524@minantech.com> On Wed, Feb 15, 2006 at 04:58:48PM +0200, Michael S. Tsirkin wrote: > Quoting r. Gleb Natapov : > > Subject: Re: [openib-general] Re: Re: Re: madvise MADV_DONTFORK/MADV_DOFORK > > > > On Wed, Feb 15, 2006 at 04:52:51PM +0200, Michael S. Tsirkin wrote: > > > Quoting r. Gleb Natapov : > > > > > > > ibv_reg_mr works fine for sub-page regions. Doesnt it? > > > > > > > > > > > > Not really. It gives you the impression that it works by not returning an > > > > > > error and aligning address and lengths for you. Same case with mmap(). You > > > > > > can provide nonaligned length and it will not fail. > > > > > > > > > > As far as I know hardware supports non-aligned regions and so does > > > > > ibv_reg_mr: try to access outside the region and you'll get completion with > > > > > error. No? If not, let me know - it should be easy to fix. > > > > > > > > Of cause you are right about reg_mr effect in regards to infiniband protocol. > > > > I was talked about the effect on VM subsystem. Any real program can't > > > > ignore this. > > > > > > It cant? Why does it care? > > > > Because the program should be careful to not put the data it needs in the > > same page with registered buffer. > > It should? Why should it? Because after fork it may not find it. (But somehow I think you know that.) -- Gleb. From mst at mellanox.co.il Wed Feb 15 07:16:49 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Feb 2006 17:16:49 +0200 Subject: [openib-general] [PATCH] add asm-generic/mman.h Message-ID: <20060215151649.GA12090@mellanox.co.il> How does the following look (against gc3-git)? asm-alpha/mman.h | 8 +++++--- asm-arm/mman.h | 31 +------------------------------ asm-arm26/mman.h | 31 +------------------------------ asm-cris/mman.h | 31 +------------------------------ asm-frv/mman.h | 31 +------------------------------ asm-generic/mman.h | 37 +++++++++++++++++++++++++++++++++++++ asm-h8300/mman.h | 31 +------------------------------ asm-i386/mman.h | 31 +------------------------------ asm-ia64/mman.h | 31 +------------------------------ asm-m32r/mman.h | 33 ++------------------------------- asm-m68k/mman.h | 31 +------------------------------ asm-mips/mman.h | 22 ++++++++++++---------- asm-parisc/mman.h | 8 +++++--- asm-powerpc/mman.h | 32 ++------------------------------ asm-s390/mman.h | 31 +------------------------------ asm-sh/mman.h | 31 +------------------------------ asm-sparc/mman.h | 31 ++----------------------------- asm-sparc64/mman.h | 31 ++----------------------------- asm-v850/mman.h | 30 +----------------------------- asm-x86_64/mman.h | 30 +----------------------------- asm-xtensa/mman.h | 22 ++++++++++++---------- 21 files changed, 91 insertions(+), 503 deletions(-) Tested on x86_64. --- Make new MADV_REMOVE, MADV_DONTFORK, MADV_DOFORK consistent across all arches. The idea is to make it possible to use them portably even before distros include them in libc headers. Move common flags to asm-generic/mman.h Signed-off-by: Michael S. Tsirkin Index: linux-2.6.16-rc3/include/asm-powerpc/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-powerpc/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-powerpc/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -1,6 +1,8 @@ #ifndef _ASM_POWERPC_MMAN_H #define _ASM_POWERPC_MMAN_H +#include + /* * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License @@ -8,19 +10,6 @@ * 2 of the License, or (at your option) any later version. */ -#define PROT_READ 0x1 /* page can be read */ -#define PROT_WRITE 0x2 /* page can be written */ -#define PROT_EXEC 0x4 /* page can be executed */ -#define PROT_SEM 0x8 /* page may be used for atomic ops */ -#define PROT_NONE 0x0 /* page can not be accessed */ -#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ -#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ - -#define MAP_SHARED 0x01 /* Share changes */ -#define MAP_PRIVATE 0x02 /* Changes are private */ -#define MAP_TYPE 0x0f /* Mask for type of mapping */ -#define MAP_FIXED 0x10 /* Interpret addr exactly */ -#define MAP_ANONYMOUS 0x20 /* don't use a file */ #define MAP_RENAME MAP_ANONYMOUS /* In SunOS terminology */ #define MAP_NORESERVE 0x40 /* don't reserve swap pages */ #define MAP_LOCKED 0x80 @@ -29,27 +18,10 @@ #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ #define MAP_EXECUTABLE 0x1000 /* mark it as an executable */ -#define MS_ASYNC 1 /* sync memory asynchronously */ -#define MS_INVALIDATE 2 /* invalidate the caches */ -#define MS_SYNC 4 /* synchronous memory sync */ - #define MCL_CURRENT 0x2000 /* lock all currently mapped pages */ #define MCL_FUTURE 0x4000 /* lock all additions to address space */ #define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ -#define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ - -/* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 - #endif /* _ASM_POWERPC_MMAN_H */ Index: linux-2.6.16-rc3/include/asm-cris/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-cris/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-cris/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -3,19 +3,7 @@ /* verbatim copy of asm-i386/ version */ -#define PROT_READ 0x1 /* page can be read */ -#define PROT_WRITE 0x2 /* page can be written */ -#define PROT_EXEC 0x4 /* page can be executed */ -#define PROT_SEM 0x8 /* page may be used for atomic ops */ -#define PROT_NONE 0x0 /* page can not be accessed */ -#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ -#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ - -#define MAP_SHARED 0x01 /* Share changes */ -#define MAP_PRIVATE 0x02 /* Changes are private */ -#define MAP_TYPE 0x0f /* Mask for type of mapping */ -#define MAP_FIXED 0x10 /* Interpret addr exactly */ -#define MAP_ANONYMOUS 0x20 /* don't use a file */ +#include #define MAP_GROWSDOWN 0x0100 /* stack-like segment */ #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ @@ -25,24 +13,7 @@ #define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ -#define MS_ASYNC 1 /* sync memory asynchronously */ -#define MS_INVALIDATE 2 /* invalidate the caches */ -#define MS_SYNC 4 /* synchronous memory sync */ - #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ -#define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ - -/* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 - #endif /* __CRIS_MMAN_H__ */ Index: linux-2.6.16-rc3/include/asm-arm26/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-arm26/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-arm26/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -1,19 +1,7 @@ #ifndef __ARM_MMAN_H__ #define __ARM_MMAN_H__ -#define PROT_READ 0x1 /* page can be read */ -#define PROT_WRITE 0x2 /* page can be written */ -#define PROT_EXEC 0x4 /* page can be executed */ -#define PROT_SEM 0x8 /* page may be used for atomic ops */ -#define PROT_NONE 0x0 /* page can not be accessed */ -#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ -#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ - -#define MAP_SHARED 0x01 /* Share changes */ -#define MAP_PRIVATE 0x02 /* Changes are private */ -#define MAP_TYPE 0x0f /* Mask for type of mapping */ -#define MAP_FIXED 0x10 /* Interpret addr exactly */ -#define MAP_ANONYMOUS 0x20 /* don't use a file */ +#include #define MAP_GROWSDOWN 0x0100 /* stack-like segment */ #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ @@ -23,24 +11,7 @@ #define MAP_POPULATE 0x8000 /* populate (prefault) page tables */ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ -#define MS_ASYNC 1 /* sync memory asynchronously */ -#define MS_INVALIDATE 2 /* invalidate the caches */ -#define MS_SYNC 4 /* synchronous memory sync */ - #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ -#define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ - -/* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 - #endif /* __ARM_MMAN_H__ */ Index: linux-2.6.16-rc3/include/asm-alpha/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-alpha/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-alpha/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -42,9 +42,11 @@ #define MADV_WILLNEED 3 /* will need these pages */ #define MADV_SPACEAVAIL 5 /* ensure resources are available */ #define MADV_DONTNEED 6 /* don't need these pages */ -#define MADV_REMOVE 7 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ + +/* common/generic parameters */ +#define MADV_REMOVE 9 /* remove these pages & resources */ +#define MADV_DONTFORK 10 /* don't inherit across fork */ +#define MADV_DOFORK 11 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc3/include/asm-m68k/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-m68k/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-m68k/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -1,19 +1,7 @@ #ifndef __M68K_MMAN_H__ #define __M68K_MMAN_H__ -#define PROT_READ 0x1 /* page can be read */ -#define PROT_WRITE 0x2 /* page can be written */ -#define PROT_EXEC 0x4 /* page can be executed */ -#define PROT_SEM 0x8 /* page may be used for atomic ops */ -#define PROT_NONE 0x0 /* page can not be accessed */ -#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ -#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ - -#define MAP_SHARED 0x01 /* Share changes */ -#define MAP_PRIVATE 0x02 /* Changes are private */ -#define MAP_TYPE 0x0f /* Mask for type of mapping */ -#define MAP_FIXED 0x10 /* Interpret addr exactly */ -#define MAP_ANONYMOUS 0x20 /* don't use a file */ +#include #define MAP_GROWSDOWN 0x0100 /* stack-like segment */ #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ @@ -23,24 +11,7 @@ #define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ -#define MS_ASYNC 1 /* sync memory asynchronously */ -#define MS_INVALIDATE 2 /* invalidate the caches */ -#define MS_SYNC 4 /* synchronous memory sync */ - #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ -#define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ - -/* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 - #endif /* __M68K_MMAN_H__ */ Index: linux-2.6.16-rc3/include/asm-xtensa/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-xtensa/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-xtensa/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -67,17 +67,19 @@ #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ -#define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ +#define MADV_NORMAL 0 /* no further special treatment */ +#define MADV_RANDOM 1 /* expect random page references */ +#define MADV_SEQUENTIAL 2 /* expect sequential page references */ +#define MADV_WILLNEED 3 /* will need these pages */ +#define MADV_DONTNEED 4 /* don't need these pages */ + +/* common parameters: try to keep these consistent across architectures */ +#define MADV_REMOVE 9 /* remove these pages & resources */ +#define MADV_DONTFORK 10 /* don't inherit across fork */ +#define MADV_DOFORK 11 /* do inherit across fork */ /* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 +#define MAP_ANON MAP_ANONYMOUS +#define MAP_FILE 0 #endif /* _XTENSA_MMAN_H */ Index: linux-2.6.16-rc3/include/asm-mips/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-mips/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-mips/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -60,17 +60,19 @@ #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ -#define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ +#define MADV_NORMAL 0 /* no further special treatment */ +#define MADV_RANDOM 1 /* expect random page references */ +#define MADV_SEQUENTIAL 2 /* expect sequential page references */ +#define MADV_WILLNEED 3 /* will need these pages */ +#define MADV_DONTNEED 4 /* don't need these pages */ + +/* common parameters: try to keep these consistent across architectures */ +#define MADV_REMOVE 9 /* remove these pages & resources */ +#define MADV_DONTFORK 10 /* don't inherit across fork */ +#define MADV_DOFORK 11 /* do inherit across fork */ /* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 +#define MAP_ANON MAP_ANONYMOUS +#define MAP_FILE 0 #endif /* _ASM_MMAN_H */ Index: linux-2.6.16-rc3/include/asm-sparc64/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-sparc64/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-sparc64/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -2,21 +2,10 @@ #ifndef __SPARC64_MMAN_H__ #define __SPARC64_MMAN_H__ +#include + /* SunOS'ified... */ -#define PROT_READ 0x1 /* page can be read */ -#define PROT_WRITE 0x2 /* page can be written */ -#define PROT_EXEC 0x4 /* page can be executed */ -#define PROT_SEM 0x8 /* page may be used for atomic ops */ -#define PROT_NONE 0x0 /* page can not be accessed */ -#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ -#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ - -#define MAP_SHARED 0x01 /* Share changes */ -#define MAP_PRIVATE 0x02 /* Changes are private */ -#define MAP_TYPE 0x0f /* Mask for type of mapping */ -#define MAP_FIXED 0x10 /* Interpret addr exactly */ -#define MAP_ANONYMOUS 0x20 /* don't use a file */ #define MAP_RENAME MAP_ANONYMOUS /* In SunOS terminology */ #define MAP_NORESERVE 0x40 /* don't reserve swap pages */ #define MAP_INHERIT 0x80 /* SunOS doesn't do this, but... */ @@ -27,10 +16,6 @@ #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ #define MAP_EXECUTABLE 0x1000 /* mark it as an executable */ -#define MS_ASYNC 1 /* sync memory asynchronously */ -#define MS_INVALIDATE 2 /* invalidate the caches */ -#define MS_SYNC 4 /* synchronous memory sync */ - #define MCL_CURRENT 0x2000 /* lock all currently mapped pages */ #define MCL_FUTURE 0x4000 /* lock all additions to address space */ @@ -48,18 +33,6 @@ #define MC_LOCKAS 5 /* Lock an entire address space of the calling process */ #define MC_UNLOCKAS 6 /* Unlock entire address space of calling process */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_FREE 0x5 /* (Solaris) contents can be freed */ -#define MADV_REMOVE 0x6 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ - -/* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 #endif /* __SPARC64_MMAN_H__ */ Index: linux-2.6.16-rc3/include/asm-v850/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-v850/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-v850/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -1,18 +1,7 @@ #ifndef __V850_MMAN_H__ #define __V850_MMAN_H__ -#define PROT_READ 0x1 /* page can be read */ -#define PROT_WRITE 0x2 /* page can be written */ -#define PROT_EXEC 0x4 /* page can be executed */ -#define PROT_NONE 0x0 /* page can not be accessed */ -#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ -#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ - -#define MAP_SHARED 0x01 /* Share changes */ -#define MAP_PRIVATE 0x02 /* Changes are private */ -#define MAP_TYPE 0x0f /* Mask for type of mapping */ -#define MAP_FIXED 0x10 /* Interpret addr exactly */ -#define MAP_ANONYMOUS 0x20 /* don't use a file */ +#include #define MAP_GROWSDOWN 0x0100 /* stack-like segment */ #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ @@ -20,24 +9,7 @@ #define MAP_LOCKED 0x2000 /* pages are locked */ #define MAP_NORESERVE 0x4000 /* don't check for reservations */ -#define MS_ASYNC 1 /* sync memory asynchronously */ -#define MS_INVALIDATE 2 /* invalidate the caches */ -#define MS_SYNC 4 /* synchronous memory sync */ - #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ -#define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ - -/* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 - #endif /* __V850_MMAN_H__ */ Index: linux-2.6.16-rc3/include/asm-s390/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-s390/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-s390/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -9,19 +9,7 @@ #ifndef __S390_MMAN_H__ #define __S390_MMAN_H__ -#define PROT_READ 0x1 /* page can be read */ -#define PROT_WRITE 0x2 /* page can be written */ -#define PROT_EXEC 0x4 /* page can be executed */ -#define PROT_SEM 0x8 /* page may be used for atomic ops */ -#define PROT_NONE 0x0 /* page can not be accessed */ -#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ -#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ - -#define MAP_SHARED 0x01 /* Share changes */ -#define MAP_PRIVATE 0x02 /* Changes are private */ -#define MAP_TYPE 0x0f /* Mask for type of mapping */ -#define MAP_FIXED 0x10 /* Interpret addr exactly */ -#define MAP_ANONYMOUS 0x20 /* don't use a file */ +#include #define MAP_GROWSDOWN 0x0100 /* stack-like segment */ #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ @@ -31,24 +19,7 @@ #define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ -#define MS_ASYNC 1 /* sync memory asynchronously */ -#define MS_INVALIDATE 2 /* invalidate the caches */ -#define MS_SYNC 4 /* synchronous memory sync */ - #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ -#define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ - -/* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 - #endif /* __S390_MMAN_H__ */ Index: linux-2.6.16-rc3/include/asm-parisc/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-parisc/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-parisc/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -38,7 +38,11 @@ #define MADV_SPACEAVAIL 5 /* insure that resources are reserved */ #define MADV_VPS_PURGE 6 /* Purge pages from VM page cache */ #define MADV_VPS_INHERIT 7 /* Inherit parents page size */ -#define MADV_REMOVE 8 /* remove these pages & resources */ + +/* common/generic parameters */ +#define MADV_REMOVE 9 /* remove these pages & resources */ +#define MADV_DONTFORK 10 /* don't inherit across fork */ +#define MADV_DOFORK 11 /* do inherit across fork */ /* The range 12-64 is reserved for page size specification. */ #define MADV_4K_PAGES 12 /* Use 4K pages */ @@ -49,8 +53,6 @@ #define MADV_4M_PAGES 22 /* Use 4 Megabyte pages */ #define MADV_16M_PAGES 24 /* Use 16 Megabyte pages */ #define MADV_64M_PAGES 26 /* Use 64 Megabyte pages */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc3/include/asm-i386/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-i386/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-i386/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -1,19 +1,7 @@ #ifndef __I386_MMAN_H__ #define __I386_MMAN_H__ -#define PROT_READ 0x1 /* page can be read */ -#define PROT_WRITE 0x2 /* page can be written */ -#define PROT_EXEC 0x4 /* page can be executed */ -#define PROT_SEM 0x8 /* page may be used for atomic ops */ -#define PROT_NONE 0x0 /* page can not be accessed */ -#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ -#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ - -#define MAP_SHARED 0x01 /* Share changes */ -#define MAP_PRIVATE 0x02 /* Changes are private */ -#define MAP_TYPE 0x0f /* Mask for type of mapping */ -#define MAP_FIXED 0x10 /* Interpret addr exactly */ -#define MAP_ANONYMOUS 0x20 /* don't use a file */ +#include #define MAP_GROWSDOWN 0x0100 /* stack-like segment */ #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ @@ -23,24 +11,7 @@ #define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ -#define MS_ASYNC 1 /* sync memory asynchronously */ -#define MS_INVALIDATE 2 /* invalidate the caches */ -#define MS_SYNC 4 /* synchronous memory sync */ - #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ -#define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ - -/* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 - #endif /* __I386_MMAN_H__ */ Index: linux-2.6.16-rc3/include/asm-sh/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-sh/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-sh/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -1,19 +1,7 @@ #ifndef __ASM_SH_MMAN_H #define __ASM_SH_MMAN_H -#define PROT_READ 0x1 /* page can be read */ -#define PROT_WRITE 0x2 /* page can be written */ -#define PROT_EXEC 0x4 /* page can be executed */ -#define PROT_SEM 0x8 /* page may be used for atomic ops */ -#define PROT_NONE 0x0 /* page can not be accessed */ -#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ -#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ - -#define MAP_SHARED 0x01 /* Share changes */ -#define MAP_PRIVATE 0x02 /* Changes are private */ -#define MAP_TYPE 0x0f /* Mask for type of mapping */ -#define MAP_FIXED 0x10 /* Interpret addr exactly */ -#define MAP_ANONYMOUS 0x20 /* don't use a file */ +#include #define MAP_GROWSDOWN 0x0100 /* stack-like segment */ #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ @@ -23,24 +11,7 @@ #define MAP_POPULATE 0x8000 /* populate (prefault) page tables */ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ -#define MS_ASYNC 1 /* sync memory asynchronously */ -#define MS_INVALIDATE 2 /* invalidate the caches */ -#define MS_SYNC 4 /* synchronous memory sync */ - #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ -#define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ - -/* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 - #endif /* __ASM_SH_MMAN_H */ Index: linux-2.6.16-rc3/include/asm-ia64/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-ia64/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-ia64/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -8,19 +8,7 @@ * David Mosberger-Tang , Hewlett-Packard Co */ -#define PROT_READ 0x1 /* page can be read */ -#define PROT_WRITE 0x2 /* page can be written */ -#define PROT_EXEC 0x4 /* page can be executed */ -#define PROT_SEM 0x8 /* page may be used for atomic ops */ -#define PROT_NONE 0x0 /* page can not be accessed */ -#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ -#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ - -#define MAP_SHARED 0x01 /* Share changes */ -#define MAP_PRIVATE 0x02 /* Changes are private */ -#define MAP_TYPE 0x0f /* Mask for type of mapping */ -#define MAP_FIXED 0x10 /* Interpret addr exactly */ -#define MAP_ANONYMOUS 0x20 /* don't use a file */ +#include #define MAP_GROWSDOWN 0x00100 /* stack-like segment */ #define MAP_GROWSUP 0x00200 /* register stack-like segment */ @@ -31,24 +19,7 @@ #define MAP_POPULATE 0x08000 /* populate (prefault) pagetables */ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ -#define MS_ASYNC 1 /* sync memory asynchronously */ -#define MS_INVALIDATE 2 /* invalidate the caches */ -#define MS_SYNC 4 /* synchronous memory sync */ - #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ -#define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ - -/* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 - #endif /* _ASM_IA64_MMAN_H */ Index: linux-2.6.16-rc3/include/asm-generic/mman.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.16-rc3/include/asm-generic/mman.h 2006-02-15 19:59:41.000000000 +0200 @@ -0,0 +1,42 @@ +#ifndef _ASM_GENERIC_MMAN_H +#define _ASM_GENERIC_MMAN_H + +/* + Author: Michael S. Tsirkin , Mellanox Technologies Ltd. + Based on: asm-xxx/mman.h +*/ + +#define PROT_READ 0x1 /* page can be read */ +#define PROT_WRITE 0x2 /* page can be written */ +#define PROT_EXEC 0x4 /* page can be executed */ +#define PROT_SEM 0x8 /* page may be used for atomic ops */ +#define PROT_NONE 0x0 /* page can not be accessed */ +#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ +#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ + +#define MAP_SHARED 0x01 /* Share changes */ +#define MAP_PRIVATE 0x02 /* Changes are private */ +#define MAP_TYPE 0x0f /* Mask for type of mapping */ +#define MAP_FIXED 0x10 /* Interpret addr exactly */ +#define MAP_ANONYMOUS 0x20 /* don't use a file */ + +#define MS_ASYNC 1 /* sync memory asynchronously */ +#define MS_SYNC 2 /* synchronous memory sync */ +#define MS_INVALIDATE 4 /* invalidate the caches */ + +#define MADV_NORMAL 0 /* no further special treatment */ +#define MADV_RANDOM 1 /* expect random page references */ +#define MADV_SEQUENTIAL 2 /* expect sequential page references */ +#define MADV_WILLNEED 3 /* will need these pages */ +#define MADV_DONTNEED 4 /* don't need these pages */ + +/* common parameters: try to keep these consistent across architectures */ +#define MADV_REMOVE 9 /* remove these pages & resources */ +#define MADV_DONTFORK 10 /* don't inherit across fork */ +#define MADV_DOFORK 11 /* do inherit across fork */ + +/* compatibility flags */ +#define MAP_ANON MAP_ANONYMOUS +#define MAP_FILE 0 + +#endif Index: linux-2.6.16-rc3/include/asm-sparc/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-sparc/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-sparc/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -2,21 +2,10 @@ #ifndef __SPARC_MMAN_H__ #define __SPARC_MMAN_H__ +#include + /* SunOS'ified... */ -#define PROT_READ 0x1 /* page can be read */ -#define PROT_WRITE 0x2 /* page can be written */ -#define PROT_EXEC 0x4 /* page can be executed */ -#define PROT_SEM 0x8 /* page may be used for atomic ops */ -#define PROT_NONE 0x0 /* page can not be accessed */ -#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ -#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ - -#define MAP_SHARED 0x01 /* Share changes */ -#define MAP_PRIVATE 0x02 /* Changes are private */ -#define MAP_TYPE 0x0f /* Mask for type of mapping */ -#define MAP_FIXED 0x10 /* Interpret addr exactly */ -#define MAP_ANONYMOUS 0x20 /* don't use a file */ #define MAP_RENAME MAP_ANONYMOUS /* In SunOS terminology */ #define MAP_NORESERVE 0x40 /* don't reserve swap pages */ #define MAP_INHERIT 0x80 /* SunOS doesn't do this, but... */ @@ -27,10 +16,6 @@ #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ #define MAP_EXECUTABLE 0x1000 /* mark it as an executable */ -#define MS_ASYNC 1 /* sync memory asynchronously */ -#define MS_INVALIDATE 2 /* invalidate the caches */ -#define MS_SYNC 4 /* synchronous memory sync */ - #define MCL_CURRENT 0x2000 /* lock all currently mapped pages */ #define MCL_FUTURE 0x4000 /* lock all additions to address space */ @@ -48,18 +33,6 @@ #define MC_LOCKAS 5 /* Lock an entire address space of the calling process */ #define MC_UNLOCKAS 6 /* Unlock entire address space of calling process */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_FREE 0x5 /* (Solaris) contents can be freed */ -#define MADV_REMOVE 0x6 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ - -/* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 #endif /* __SPARC_MMAN_H__ */ Index: linux-2.6.16-rc3/include/asm-m32r/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-m32r/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-m32r/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -1,21 +1,9 @@ #ifndef __M32R_MMAN_H__ #define __M32R_MMAN_H__ -/* orig : i386 2.6.0-test6 */ - -#define PROT_READ 0x1 /* page can be read */ -#define PROT_WRITE 0x2 /* page can be written */ -#define PROT_EXEC 0x4 /* page can be executed */ -#define PROT_SEM 0x8 /* page may be used for atomic ops */ -#define PROT_NONE 0x0 /* page can not be accessed */ -#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ -#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ +#include -#define MAP_SHARED 0x01 /* Share changes */ -#define MAP_PRIVATE 0x02 /* Changes are private */ -#define MAP_TYPE 0x0f /* Mask for type of mapping */ -#define MAP_FIXED 0x10 /* Interpret addr exactly */ -#define MAP_ANONYMOUS 0x20 /* don't use a file */ +/* orig : i386 2.6.0-test6 */ #define MAP_GROWSDOWN 0x0100 /* stack-like segment */ #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ @@ -25,24 +13,7 @@ #define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ -#define MS_ASYNC 1 /* sync memory asynchronously */ -#define MS_INVALIDATE 2 /* invalidate the caches */ -#define MS_SYNC 4 /* synchronous memory sync */ - #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ -#define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ - -/* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 - #endif /* __M32R_MMAN_H__ */ Index: linux-2.6.16-rc3/include/asm-frv/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-frv/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-frv/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -1,19 +1,7 @@ #ifndef __ASM_MMAN_H__ #define __ASM_MMAN_H__ -#define PROT_READ 0x1 /* page can be read */ -#define PROT_WRITE 0x2 /* page can be written */ -#define PROT_EXEC 0x4 /* page can be executed */ -#define PROT_SEM 0x8 /* page may be used for atomic ops */ -#define PROT_NONE 0x0 /* page can not be accessed */ -#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ -#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ - -#define MAP_SHARED 0x01 /* Share changes */ -#define MAP_PRIVATE 0x02 /* Changes are private */ -#define MAP_TYPE 0x0f /* Mask for type of mapping */ -#define MAP_FIXED 0x10 /* Interpret addr exactly */ -#define MAP_ANONYMOUS 0x20 /* don't use a file */ +#include #define MAP_GROWSDOWN 0x0100 /* stack-like segment */ #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ @@ -23,25 +11,8 @@ #define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ -#define MS_ASYNC 1 /* sync memory asynchronously */ -#define MS_INVALIDATE 2 /* invalidate the caches */ -#define MS_SYNC 4 /* synchronous memory sync */ - #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ -#define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ - -/* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 - #endif /* __ASM_MMAN_H__ */ Index: linux-2.6.16-rc3/include/asm-h8300/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-h8300/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-h8300/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -1,19 +1,7 @@ #ifndef __H8300_MMAN_H__ #define __H8300_MMAN_H__ -#define PROT_READ 0x1 /* page can be read */ -#define PROT_WRITE 0x2 /* page can be written */ -#define PROT_EXEC 0x4 /* page can be executed */ -#define PROT_SEM 0x8 /* page may be used for atomic ops */ -#define PROT_NONE 0x0 /* page can not be accessed */ -#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ -#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ - -#define MAP_SHARED 0x01 /* Share changes */ -#define MAP_PRIVATE 0x02 /* Changes are private */ -#define MAP_TYPE 0x0f /* Mask for type of mapping */ -#define MAP_FIXED 0x10 /* Interpret addr exactly */ -#define MAP_ANONYMOUS 0x20 /* don't use a file */ +#include #define MAP_GROWSDOWN 0x0100 /* stack-like segment */ #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ @@ -23,24 +11,7 @@ #define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ -#define MS_ASYNC 1 /* sync memory asynchronously */ -#define MS_INVALIDATE 2 /* invalidate the caches */ -#define MS_SYNC 4 /* synchronous memory sync */ - #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ -#define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ - -/* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 - #endif /* __H8300_MMAN_H__ */ Index: linux-2.6.16-rc3/include/asm-arm/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-arm/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-arm/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -1,19 +1,7 @@ #ifndef __ARM_MMAN_H__ #define __ARM_MMAN_H__ -#define PROT_READ 0x1 /* page can be read */ -#define PROT_WRITE 0x2 /* page can be written */ -#define PROT_EXEC 0x4 /* page can be executed */ -#define PROT_SEM 0x8 /* page may be used for atomic ops */ -#define PROT_NONE 0x0 /* page can not be accessed */ -#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ -#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ - -#define MAP_SHARED 0x01 /* Share changes */ -#define MAP_PRIVATE 0x02 /* Changes are private */ -#define MAP_TYPE 0x0f /* Mask for type of mapping */ -#define MAP_FIXED 0x10 /* Interpret addr exactly */ -#define MAP_ANONYMOUS 0x20 /* don't use a file */ +#include #define MAP_GROWSDOWN 0x0100 /* stack-like segment */ #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ @@ -23,24 +11,7 @@ #define MAP_POPULATE 0x8000 /* populate (prefault) page tables */ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ -#define MS_ASYNC 1 /* sync memory asynchronously */ -#define MS_INVALIDATE 2 /* invalidate the caches */ -#define MS_SYNC 4 /* synchronous memory sync */ - #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ -#define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ - -/* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 - #endif /* __ARM_MMAN_H__ */ Index: linux-2.6.16-rc3/include/asm-x86_64/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-x86_64/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-x86_64/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -1,19 +1,8 @@ #ifndef __X8664_MMAN_H__ #define __X8664_MMAN_H__ -#define PROT_READ 0x1 /* page can be read */ -#define PROT_WRITE 0x2 /* page can be written */ -#define PROT_EXEC 0x4 /* page can be executed */ -#define PROT_NONE 0x0 /* page can not be accessed */ -#define PROT_SEM 0x8 -#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ -#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ +#include -#define MAP_SHARED 0x01 /* Share changes */ -#define MAP_PRIVATE 0x02 /* Changes are private */ -#define MAP_TYPE 0x0f /* Mask for type of mapping */ -#define MAP_FIXED 0x10 /* Interpret addr exactly */ -#define MAP_ANONYMOUS 0x20 /* don't use a file */ #define MAP_32BIT 0x40 /* only give out 32bit addresses */ #define MAP_GROWSDOWN 0x0100 /* stack-like segment */ @@ -24,24 +13,7 @@ #define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ -#define MS_ASYNC 1 /* sync memory asynchronously */ -#define MS_INVALIDATE 2 /* invalidate the caches */ -#define MS_SYNC 4 /* synchronous memory sync */ - #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ -#define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ - -/* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 - #endif -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Wed Feb 15 07:18:42 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Feb 2006 17:18:42 +0200 Subject: [openib-general] Re: Re: Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: <20060215150437.GQ24524@minantech.com> References: <20060215090250.GF12974@mellanox.co.il> <20060215093007.GK24524@minantech.com> <20060215101448.GJ12974@mellanox.co.il> <20060215142529.GM24524@minantech.com> <20060215143622.GV12974@mellanox.co.il> <20060215144646.GN24524@minantech.com> <20060215145251.GW12974@mellanox.co.il> <20060215145408.GO24524@minantech.com> <20060215145848.GX12974@mellanox.co.il> <20060215150437.GQ24524@minantech.com> Message-ID: <20060215151842.GY12974@mellanox.co.il> Quoting r. Gleb Natapov : > > > Because the program should be careful to not put the data it needs in the > > > same page with registered buffer. > > > > It should? Why should it? > > Because after fork it may not find it. (But somehow I think you know > that.) I dont. The private data will get COWed properly - its only with DMA data that we have a problem. No? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From glebn at voltaire.com Wed Feb 15 07:27:12 2006 From: glebn at voltaire.com (Gleb Natapov) Date: Wed, 15 Feb 2006 17:27:12 +0200 Subject: [openib-general] Re: Re: Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: <20060215151842.GY12974@mellanox.co.il> References: <20060215093007.GK24524@minantech.com> <20060215101448.GJ12974@mellanox.co.il> <20060215142529.GM24524@minantech.com> <20060215143622.GV12974@mellanox.co.il> <20060215144646.GN24524@minantech.com> <20060215145251.GW12974@mellanox.co.il> <20060215145408.GO24524@minantech.com> <20060215145848.GX12974@mellanox.co.il> <20060215150437.GQ24524@minantech.com> <20060215151842.GY12974@mellanox.co.il> Message-ID: <20060215152711.GR24524@minantech.com> On Wed, Feb 15, 2006 at 05:18:42PM +0200, Michael S. Tsirkin wrote: > Quoting r. Gleb Natapov : > > > > Because the program should be careful to not put the data it needs in the > > > > same page with registered buffer. > > > > > > It should? Why should it? > > > > Because after fork it may not find it. (But somehow I think you know > > that.) > > I dont. The private data will get COWed properly - its only with DMA > data that we have a problem. No? > Suppose you have this code: char buf[1000]; char *prog="/bin/true"; main() { reg_mr (buf, 1000); madvise (buf, 1000, DONTCOPY); system (prog); } if buf and prog are on the same page (most certainly) "/bin/true" will never run. -- Gleb. From mst at mellanox.co.il Wed Feb 15 07:33:07 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Feb 2006 17:33:07 +0200 Subject: [openib-general] Re: Re: Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: <20060215152711.GR24524@minantech.com> References: <20060215101448.GJ12974@mellanox.co.il> <20060215142529.GM24524@minantech.com> <20060215143622.GV12974@mellanox.co.il> <20060215144646.GN24524@minantech.com> <20060215145251.GW12974@mellanox.co.il> <20060215145408.GO24524@minantech.com> <20060215145848.GX12974@mellanox.co.il> <20060215150437.GQ24524@minantech.com> <20060215151842.GY12974@mellanox.co.il> <20060215152711.GR24524@minantech.com> Message-ID: <20060215153307.GA12974@mellanox.co.il> Quoting r. Gleb Natapov : > Suppose you have this code: > > char buf[1000]; > char *prog="/bin/true"; > > > main() > { > reg_mr (buf, 1000); > madvise (buf, 1000, DONTCOPY); > system (prog); > } > > if buf and prog are on the same page (most certainly) "/bin/true" will > never run. Right, so if you hide madvise inside reg_mr you create a problem. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From arkady at netapp.com Wed Feb 15 07:34:18 2006 From: arkady at netapp.com (Arkady Kanevsky) Date: Wed, 15 Feb 2006 10:34:18 -0500 Subject: [PATCH] Re: [openib-general] cache.c In-Reply-To: References: Message-ID: <200602151034.18716.arkady@netapp.com> Agreed. Pointer is a pointer. Here is a patch. svn diff --diff-cmd "/usr/bin/diff" -x -up cache.c Index: cache.c =================================================================== --- cache.c (revision 5407) +++ cache.c (working copy) @@ -3,6 +3,7 @@ * Copyright (c) 2005 Intel Corporation. All rights reserved. * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. * Copyright (c) 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2006 Network Appliance, Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -302,7 +303,7 @@ static void ib_cache_setup_one(struct ib kmalloc(sizeof *device->cache.pkey_cache * (end_port(device) - start_port(device) + 1), GFP_KERNEL); device->cache.gid_cache = - kmalloc(sizeof *device->cache.pkey_cache * + kmalloc(sizeof *device->cache.gid_cache * (end_port(device) - start_port(device) + 1), GFP_KERNEL); if (!device->cache.pkey_cache || !device->cache.gid_cache) { ******************************************* Arkady On Tuesday 14 February 2006 09:44 am, Roland Dreier wrote: > Arkady> Roland, in core/cache.c > > Arkady> should > device-> cache.gid_cache = > Arkady> kmalloc(sizeof *device->cache.pkey_cache * > Arkady> (end_port(device) - start_port(device) + 1), GFP_KERNEL); > > Arkady> be > > device-> cache.gid_cache = > Arkady> kmalloc(sizeof *device->cache.gid_cache * > Arkady> (end_port(device) - start_port(device) + 1), GFP_KERNEL); > > Yes, I guess so. It makes no practical difference since all pointers > are always going to be the same size, but we might as well get it > right. Care to send a patch? > > Thanks, > Roland From glebn at voltaire.com Wed Feb 15 07:40:00 2006 From: glebn at voltaire.com (Gleb Natapov) Date: Wed, 15 Feb 2006 17:40:00 +0200 Subject: [openib-general] Re: Re: Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: <20060215153307.GA12974@mellanox.co.il> References: <20060215142529.GM24524@minantech.com> <20060215143622.GV12974@mellanox.co.il> <20060215144646.GN24524@minantech.com> <20060215145251.GW12974@mellanox.co.il> <20060215145408.GO24524@minantech.com> <20060215145848.GX12974@mellanox.co.il> <20060215150437.GQ24524@minantech.com> <20060215151842.GY12974@mellanox.co.il> <20060215152711.GR24524@minantech.com> <20060215153307.GA12974@mellanox.co.il> Message-ID: <20060215154000.GT24524@minantech.com> On Wed, Feb 15, 2006 at 05:33:07PM +0200, Michael S. Tsirkin wrote: > Quoting r. Gleb Natapov : > > Suppose you have this code: > > > > char buf[1000]; > > char *prog="/bin/true"; > > > > > > main() > > { > > reg_mr (buf, 1000); > > madvise (buf, 1000, DONTCOPY); > > system (prog); > > } > > > > if buf and prog are on the same page (most certainly) "/bin/true" will > > never run. > > Right, so if you hide madvise inside reg_mr you create a problem. > The problem exists no matter who calls madvise and unfortunately programmer should be aware of it. To make each rdma developer to reimplement refcounting only to be sure he is aware about the problem is cruel IMHO :) -- Gleb. From pbadari at us.ibm.com Wed Feb 15 07:58:08 2006 From: pbadari at us.ibm.com (Badari Pulavarty) Date: Wed, 15 Feb 2006 07:58:08 -0800 Subject: [openib-general] Re: [PATCH] add asm-generic/mman.h In-Reply-To: <20060215151649.GA12090@mellanox.co.il> References: <20060215151649.GA12090@mellanox.co.il> Message-ID: <1140019088.21448.3.camel@dyn9047017100.beaverton.ibm.com> On Wed, 2006-02-15 at 17:16 +0200, Michael S. Tsirkin wrote: > How does the following look (against gc3-git)? I tried to do the same earlier (while doing MADV_REMOVE) and got ugly (I was trying to completely get rid of asm-specific ones), so I gave up. Anyway, > Index: linux-2.6.16-rc3/include/asm-generic/mman.h > =================================================================== > --- /dev/null 1970-01-01 00:00:00.000000000 +0000 > +++ linux-2.6.16-rc3/include/asm-generic/mman.h 2006-02-15 19:59:41.000000000 +0200 .. > +#define MS_ASYNC 1 /* sync memory asynchronously */ > +#define MS_SYNC 2 /* synchronous memory sync */ > +#define MS_INVALIDATE 4 /* invalidate the caches */ Shouldn't this be ? +#define MS_ASYNC 1 /* sync memory asynchronously */ +#define MS_INVALIDATE 2 /* invalidate the caches */ +#define MS_SYNC 4 /* synchronous memory sync */ Thanks, Badari From ostampflee at terrasoftsolutions.com Wed Feb 15 08:43:08 2006 From: ostampflee at terrasoftsolutions.com (Owen Stampflee) Date: Wed, 15 Feb 2006 08:43:08 -0800 Subject: [openib-general] OpenSM realloc error In-Reply-To: <1140004573.4333.19957.camel@hal.voltaire.com> References: <1139945214.2169.4.camel@beast.terraplex.com> <1139963140.4333.16065.camel@hal.voltaire.com> <1139964683.5941.10.camel@beast.terraplex.com> <1140004573.4333.19957.camel@hal.voltaire.com> Message-ID: <1140021789.21679.2.camel@beast.terraplex.com> > Can you strace it and provide the output ? Thanks. > > -- Hal http://cvs.terraplex.com/~owen/opensm.strace From mohitka at hcl.in Wed Feb 15 08:45:23 2006 From: mohitka at hcl.in (Mohit Katiyar, Noida) Date: Wed, 15 Feb 2006 22:15:23 +0530 Subject: [openib-general] iSER Doc Message-ID: <3E6BB9CEE261E2428AD25D0D553DC49702F7674D@HSDLNTD1110010.noida.hcltech.com> Hi, I am looking for iSER API documentation. Is anybody aware of such an info?? Thanks and Regards Mohit Katiyar HCL Technologies From dhowells at redhat.com Wed Feb 15 08:47:23 2006 From: dhowells at redhat.com (David Howells) Date: Wed, 15 Feb 2006 16:47:23 +0000 Subject: [openib-general] Re: [PATCH] add asm-generic/mman.h In-Reply-To: <20060215151649.GA12090@mellanox.co.il> References: <20060215151649.GA12090@mellanox.co.il> Message-ID: <30034.1140022043@warthog.cambridge.redhat.com> Michael S. Tsirkin wrote: > How does the following look (against gc3-git)? Fine by me. Acked-By: David Howells From mst at mellanox.co.il Wed Feb 15 08:50:16 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Feb 2006 18:50:16 +0200 Subject: [openib-general] Re: [PATCH] add asm-generic/mman.h In-Reply-To: <1140019088.21448.3.camel@dyn9047017100.beaverton.ibm.com> References: <20060215151649.GA12090@mellanox.co.il> <1140019088.21448.3.camel@dyn9047017100.beaverton.ibm.com> Message-ID: <20060215165016.GD12974@mellanox.co.il> Quoting r. Badari Pulavarty : > Subject: Re: [PATCH] add asm-generic/mman.h > > On Wed, 2006-02-15 at 17:16 +0200, Michael S. Tsirkin wrote: > > How does the following look (against gc3-git)? > > I tried to do the same earlier (while doing MADV_REMOVE) and got > ugly (I was trying to completely get rid of asm-specific ones), > so I gave up. > > Anyway, > > > > Index: linux-2.6.16-rc3/include/asm-generic/mman.h > > =================================================================== > > --- /dev/null 1970-01-01 00:00:00.000000000 +0000 > > +++ linux-2.6.16-rc3/include/asm-generic/mman.h 2006-02-15 19:59:41.000000000 +0200 > .. > > +#define MS_ASYNC 1 /* sync memory asynchronously */ > > +#define MS_SYNC 2 /* synchronous memory sync */ > > +#define MS_INVALIDATE 4 /* invalidate the caches */ > > Shouldn't this be ? > > +#define MS_ASYNC 1 /* sync memory asynchronously */ > +#define MS_INVALIDATE 2 /* invalidate the caches */ > +#define MS_SYNC 4 /* synchronous memory sync */ > > Thanks, > Badari > Note that this only looks misaligned in the patch. When you apply, + disappears and numbers get aligned. Other stuff in asm-xx/mman.h is aligned by tabs and not by spaces, so why should these options be aligned by spaces? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From pbadari at us.ibm.com Wed Feb 15 08:52:57 2006 From: pbadari at us.ibm.com (Badari Pulavarty) Date: Wed, 15 Feb 2006 08:52:57 -0800 Subject: [openib-general] Re: [PATCH] add asm-generic/mman.h In-Reply-To: <20060215165016.GD12974@mellanox.co.il> References: <20060215151649.GA12090@mellanox.co.il> <1140019088.21448.3.camel@dyn9047017100.beaverton.ibm.com> <20060215165016.GD12974@mellanox.co.il> Message-ID: <1140022377.21448.6.camel@dyn9047017100.beaverton.ibm.com> On Wed, 2006-02-15 at 18:50 +0200, Michael S. Tsirkin wrote: > Quoting r. Badari Pulavarty : > > Subject: Re: [PATCH] add asm-generic/mman.h > > > > On Wed, 2006-02-15 at 17:16 +0200, Michael S. Tsirkin wrote: > > > How does the following look (against gc3-git)? > > > > I tried to do the same earlier (while doing MADV_REMOVE) and got > > ugly (I was trying to completely get rid of asm-specific ones), > > so I gave up. > > > > Anyway, > > > > > > > Index: linux-2.6.16-rc3/include/asm-generic/mman.h > > > =================================================================== > > > --- /dev/null 1970-01-01 00:00:00.000000000 +0000 > > > +++ linux-2.6.16-rc3/include/asm-generic/mman.h 2006-02-15 19:59:41.000000000 +0200 > > .. > > > +#define MS_ASYNC 1 /* sync memory asynchronously */ > > > +#define MS_SYNC 2 /* synchronous memory sync */ > > > +#define MS_INVALIDATE 4 /* invalidate the caches */ > > > > Shouldn't this be ? > > > > +#define MS_ASYNC 1 /* sync memory asynchronously */ > > +#define MS_INVALIDATE 2 /* invalidate the caches */ > > +#define MS_SYNC 4 /* synchronous memory sync */ > > > > Thanks, > > Badari > > > > Note that this only looks misaligned in the patch. When you apply, + > disappears and numbers get aligned. > Other stuff in asm-xx/mman.h is aligned by tabs and not by spaces, > so why should these options be aligned by spaces? I am not talking about alignment or spaces. What I meant was .. MS_SYNC should be 4 MS_INVALIDATE should be 2 You got it other way in your patch. Thanks, Badari From torvalds at osdl.org Wed Feb 15 09:01:48 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Wed, 15 Feb 2006 09:01:48 -0800 (PST) Subject: [openib-general] Re: [PATCH] add asm-generic/mman.h In-Reply-To: <20060215151649.GA12090@mellanox.co.il> References: <20060215151649.GA12090@mellanox.co.il> Message-ID: On Wed, 15 Feb 2006, Michael S. Tsirkin wrote: > > How does the following look (against gc3-git)? NACK! This changes the values, and the values are visible to user space. Different architectures really _do_ have different values, even if (a) it's sad and unnecessary and (b) 99% of all apps will never use these values and thus never care. You've changed MS_INVALIDATE from 2 to 4 here. Linus From mst at mellanox.co.il Wed Feb 15 09:09:35 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Feb 2006 19:09:35 +0200 Subject: [openib-general] Re: [PATCH] add asm-generic/mman.h In-Reply-To: <1140022377.21448.6.camel@dyn9047017100.beaverton.ibm.com> References: <20060215151649.GA12090@mellanox.co.il> <1140019088.21448.3.camel@dyn9047017100.beaverton.ibm.com> <20060215165016.GD12974@mellanox.co.il> <1140022377.21448.6.camel@dyn9047017100.beaverton.ibm.com> Message-ID: <20060215170935.GE12974@mellanox.co.il> Quoting r. Badari Pulavarty : > MS_SYNC should be 4 > MS_INVALIDATE should be 2 Good catch, thanks! Other numbers look right, dont they? --- Make new MADV_REMOVE, MADV_DONTFORK, MADV_DOFORK consistent across all arches. The idea is to make it possible to use them portably even before distros include them in libc headers. Move common flags to asm-generic/mman.h Signed-off-by: Michael S. Tsirkin Index: linux-2.6.16-rc3/include/asm-powerpc/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-powerpc/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-powerpc/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -1,6 +1,8 @@ #ifndef _ASM_POWERPC_MMAN_H #define _ASM_POWERPC_MMAN_H +#include + /* * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License @@ -8,19 +10,6 @@ * 2 of the License, or (at your option) any later version. */ -#define PROT_READ 0x1 /* page can be read */ -#define PROT_WRITE 0x2 /* page can be written */ -#define PROT_EXEC 0x4 /* page can be executed */ -#define PROT_SEM 0x8 /* page may be used for atomic ops */ -#define PROT_NONE 0x0 /* page can not be accessed */ -#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ -#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ - -#define MAP_SHARED 0x01 /* Share changes */ -#define MAP_PRIVATE 0x02 /* Changes are private */ -#define MAP_TYPE 0x0f /* Mask for type of mapping */ -#define MAP_FIXED 0x10 /* Interpret addr exactly */ -#define MAP_ANONYMOUS 0x20 /* don't use a file */ #define MAP_RENAME MAP_ANONYMOUS /* In SunOS terminology */ #define MAP_NORESERVE 0x40 /* don't reserve swap pages */ #define MAP_LOCKED 0x80 @@ -29,27 +18,10 @@ #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ #define MAP_EXECUTABLE 0x1000 /* mark it as an executable */ -#define MS_ASYNC 1 /* sync memory asynchronously */ -#define MS_INVALIDATE 2 /* invalidate the caches */ -#define MS_SYNC 4 /* synchronous memory sync */ - #define MCL_CURRENT 0x2000 /* lock all currently mapped pages */ #define MCL_FUTURE 0x4000 /* lock all additions to address space */ #define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ -#define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ - -/* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 - #endif /* _ASM_POWERPC_MMAN_H */ Index: linux-2.6.16-rc3/include/asm-cris/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-cris/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-cris/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -3,19 +3,7 @@ /* verbatim copy of asm-i386/ version */ -#define PROT_READ 0x1 /* page can be read */ -#define PROT_WRITE 0x2 /* page can be written */ -#define PROT_EXEC 0x4 /* page can be executed */ -#define PROT_SEM 0x8 /* page may be used for atomic ops */ -#define PROT_NONE 0x0 /* page can not be accessed */ -#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ -#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ - -#define MAP_SHARED 0x01 /* Share changes */ -#define MAP_PRIVATE 0x02 /* Changes are private */ -#define MAP_TYPE 0x0f /* Mask for type of mapping */ -#define MAP_FIXED 0x10 /* Interpret addr exactly */ -#define MAP_ANONYMOUS 0x20 /* don't use a file */ +#include #define MAP_GROWSDOWN 0x0100 /* stack-like segment */ #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ @@ -25,24 +13,7 @@ #define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ -#define MS_ASYNC 1 /* sync memory asynchronously */ -#define MS_INVALIDATE 2 /* invalidate the caches */ -#define MS_SYNC 4 /* synchronous memory sync */ - #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ -#define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ - -/* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 - #endif /* __CRIS_MMAN_H__ */ Index: linux-2.6.16-rc3/include/asm-arm26/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-arm26/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-arm26/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -1,19 +1,7 @@ #ifndef __ARM_MMAN_H__ #define __ARM_MMAN_H__ -#define PROT_READ 0x1 /* page can be read */ -#define PROT_WRITE 0x2 /* page can be written */ -#define PROT_EXEC 0x4 /* page can be executed */ -#define PROT_SEM 0x8 /* page may be used for atomic ops */ -#define PROT_NONE 0x0 /* page can not be accessed */ -#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ -#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ - -#define MAP_SHARED 0x01 /* Share changes */ -#define MAP_PRIVATE 0x02 /* Changes are private */ -#define MAP_TYPE 0x0f /* Mask for type of mapping */ -#define MAP_FIXED 0x10 /* Interpret addr exactly */ -#define MAP_ANONYMOUS 0x20 /* don't use a file */ +#include #define MAP_GROWSDOWN 0x0100 /* stack-like segment */ #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ @@ -23,24 +11,7 @@ #define MAP_POPULATE 0x8000 /* populate (prefault) page tables */ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ -#define MS_ASYNC 1 /* sync memory asynchronously */ -#define MS_INVALIDATE 2 /* invalidate the caches */ -#define MS_SYNC 4 /* synchronous memory sync */ - #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ -#define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ - -/* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 - #endif /* __ARM_MMAN_H__ */ Index: linux-2.6.16-rc3/include/asm-alpha/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-alpha/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-alpha/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -42,9 +42,11 @@ #define MADV_WILLNEED 3 /* will need these pages */ #define MADV_SPACEAVAIL 5 /* ensure resources are available */ #define MADV_DONTNEED 6 /* don't need these pages */ -#define MADV_REMOVE 7 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ + +/* common/generic parameters */ +#define MADV_REMOVE 9 /* remove these pages & resources */ +#define MADV_DONTFORK 10 /* don't inherit across fork */ +#define MADV_DOFORK 11 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc3/include/asm-m68k/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-m68k/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-m68k/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -1,19 +1,7 @@ #ifndef __M68K_MMAN_H__ #define __M68K_MMAN_H__ -#define PROT_READ 0x1 /* page can be read */ -#define PROT_WRITE 0x2 /* page can be written */ -#define PROT_EXEC 0x4 /* page can be executed */ -#define PROT_SEM 0x8 /* page may be used for atomic ops */ -#define PROT_NONE 0x0 /* page can not be accessed */ -#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ -#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ - -#define MAP_SHARED 0x01 /* Share changes */ -#define MAP_PRIVATE 0x02 /* Changes are private */ -#define MAP_TYPE 0x0f /* Mask for type of mapping */ -#define MAP_FIXED 0x10 /* Interpret addr exactly */ -#define MAP_ANONYMOUS 0x20 /* don't use a file */ +#include #define MAP_GROWSDOWN 0x0100 /* stack-like segment */ #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ @@ -23,24 +11,7 @@ #define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ -#define MS_ASYNC 1 /* sync memory asynchronously */ -#define MS_INVALIDATE 2 /* invalidate the caches */ -#define MS_SYNC 4 /* synchronous memory sync */ - #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ -#define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ - -/* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 - #endif /* __M68K_MMAN_H__ */ Index: linux-2.6.16-rc3/include/asm-xtensa/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-xtensa/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-xtensa/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -67,17 +67,19 @@ #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ -#define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ +#define MADV_NORMAL 0 /* no further special treatment */ +#define MADV_RANDOM 1 /* expect random page references */ +#define MADV_SEQUENTIAL 2 /* expect sequential page references */ +#define MADV_WILLNEED 3 /* will need these pages */ +#define MADV_DONTNEED 4 /* don't need these pages */ + +/* common parameters: try to keep these consistent across architectures */ +#define MADV_REMOVE 9 /* remove these pages & resources */ +#define MADV_DONTFORK 10 /* don't inherit across fork */ +#define MADV_DOFORK 11 /* do inherit across fork */ /* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 +#define MAP_ANON MAP_ANONYMOUS +#define MAP_FILE 0 #endif /* _XTENSA_MMAN_H */ Index: linux-2.6.16-rc3/include/asm-mips/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-mips/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-mips/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -60,17 +60,19 @@ #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ -#define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ +#define MADV_NORMAL 0 /* no further special treatment */ +#define MADV_RANDOM 1 /* expect random page references */ +#define MADV_SEQUENTIAL 2 /* expect sequential page references */ +#define MADV_WILLNEED 3 /* will need these pages */ +#define MADV_DONTNEED 4 /* don't need these pages */ + +/* common parameters: try to keep these consistent across architectures */ +#define MADV_REMOVE 9 /* remove these pages & resources */ +#define MADV_DONTFORK 10 /* don't inherit across fork */ +#define MADV_DOFORK 11 /* do inherit across fork */ /* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 +#define MAP_ANON MAP_ANONYMOUS +#define MAP_FILE 0 #endif /* _ASM_MMAN_H */ Index: linux-2.6.16-rc3/include/asm-sparc64/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-sparc64/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-sparc64/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -2,21 +2,10 @@ #ifndef __SPARC64_MMAN_H__ #define __SPARC64_MMAN_H__ +#include + /* SunOS'ified... */ -#define PROT_READ 0x1 /* page can be read */ -#define PROT_WRITE 0x2 /* page can be written */ -#define PROT_EXEC 0x4 /* page can be executed */ -#define PROT_SEM 0x8 /* page may be used for atomic ops */ -#define PROT_NONE 0x0 /* page can not be accessed */ -#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ -#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ - -#define MAP_SHARED 0x01 /* Share changes */ -#define MAP_PRIVATE 0x02 /* Changes are private */ -#define MAP_TYPE 0x0f /* Mask for type of mapping */ -#define MAP_FIXED 0x10 /* Interpret addr exactly */ -#define MAP_ANONYMOUS 0x20 /* don't use a file */ #define MAP_RENAME MAP_ANONYMOUS /* In SunOS terminology */ #define MAP_NORESERVE 0x40 /* don't reserve swap pages */ #define MAP_INHERIT 0x80 /* SunOS doesn't do this, but... */ @@ -27,10 +16,6 @@ #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ #define MAP_EXECUTABLE 0x1000 /* mark it as an executable */ -#define MS_ASYNC 1 /* sync memory asynchronously */ -#define MS_INVALIDATE 2 /* invalidate the caches */ -#define MS_SYNC 4 /* synchronous memory sync */ - #define MCL_CURRENT 0x2000 /* lock all currently mapped pages */ #define MCL_FUTURE 0x4000 /* lock all additions to address space */ @@ -48,18 +33,6 @@ #define MC_LOCKAS 5 /* Lock an entire address space of the calling process */ #define MC_UNLOCKAS 6 /* Unlock entire address space of calling process */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_FREE 0x5 /* (Solaris) contents can be freed */ -#define MADV_REMOVE 0x6 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ - -/* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 #endif /* __SPARC64_MMAN_H__ */ Index: linux-2.6.16-rc3/include/asm-v850/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-v850/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-v850/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -1,18 +1,7 @@ #ifndef __V850_MMAN_H__ #define __V850_MMAN_H__ -#define PROT_READ 0x1 /* page can be read */ -#define PROT_WRITE 0x2 /* page can be written */ -#define PROT_EXEC 0x4 /* page can be executed */ -#define PROT_NONE 0x0 /* page can not be accessed */ -#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ -#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ - -#define MAP_SHARED 0x01 /* Share changes */ -#define MAP_PRIVATE 0x02 /* Changes are private */ -#define MAP_TYPE 0x0f /* Mask for type of mapping */ -#define MAP_FIXED 0x10 /* Interpret addr exactly */ -#define MAP_ANONYMOUS 0x20 /* don't use a file */ +#include #define MAP_GROWSDOWN 0x0100 /* stack-like segment */ #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ @@ -20,24 +9,7 @@ #define MAP_LOCKED 0x2000 /* pages are locked */ #define MAP_NORESERVE 0x4000 /* don't check for reservations */ -#define MS_ASYNC 1 /* sync memory asynchronously */ -#define MS_INVALIDATE 2 /* invalidate the caches */ -#define MS_SYNC 4 /* synchronous memory sync */ - #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ -#define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ - -/* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 - #endif /* __V850_MMAN_H__ */ Index: linux-2.6.16-rc3/include/asm-s390/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-s390/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-s390/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -9,19 +9,7 @@ #ifndef __S390_MMAN_H__ #define __S390_MMAN_H__ -#define PROT_READ 0x1 /* page can be read */ -#define PROT_WRITE 0x2 /* page can be written */ -#define PROT_EXEC 0x4 /* page can be executed */ -#define PROT_SEM 0x8 /* page may be used for atomic ops */ -#define PROT_NONE 0x0 /* page can not be accessed */ -#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ -#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ - -#define MAP_SHARED 0x01 /* Share changes */ -#define MAP_PRIVATE 0x02 /* Changes are private */ -#define MAP_TYPE 0x0f /* Mask for type of mapping */ -#define MAP_FIXED 0x10 /* Interpret addr exactly */ -#define MAP_ANONYMOUS 0x20 /* don't use a file */ +#include #define MAP_GROWSDOWN 0x0100 /* stack-like segment */ #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ @@ -31,24 +19,7 @@ #define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ -#define MS_ASYNC 1 /* sync memory asynchronously */ -#define MS_INVALIDATE 2 /* invalidate the caches */ -#define MS_SYNC 4 /* synchronous memory sync */ - #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ -#define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ - -/* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 - #endif /* __S390_MMAN_H__ */ Index: linux-2.6.16-rc3/include/asm-parisc/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-parisc/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-parisc/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -38,7 +38,11 @@ #define MADV_SPACEAVAIL 5 /* insure that resources are reserved */ #define MADV_VPS_PURGE 6 /* Purge pages from VM page cache */ #define MADV_VPS_INHERIT 7 /* Inherit parents page size */ -#define MADV_REMOVE 8 /* remove these pages & resources */ + +/* common/generic parameters */ +#define MADV_REMOVE 9 /* remove these pages & resources */ +#define MADV_DONTFORK 10 /* don't inherit across fork */ +#define MADV_DOFORK 11 /* do inherit across fork */ /* The range 12-64 is reserved for page size specification. */ #define MADV_4K_PAGES 12 /* Use 4K pages */ @@ -49,8 +53,6 @@ #define MADV_4M_PAGES 22 /* Use 4 Megabyte pages */ #define MADV_16M_PAGES 24 /* Use 16 Megabyte pages */ #define MADV_64M_PAGES 26 /* Use 64 Megabyte pages */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ /* compatibility flags */ #define MAP_ANON MAP_ANONYMOUS Index: linux-2.6.16-rc3/include/asm-i386/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-i386/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-i386/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -1,19 +1,7 @@ #ifndef __I386_MMAN_H__ #define __I386_MMAN_H__ -#define PROT_READ 0x1 /* page can be read */ -#define PROT_WRITE 0x2 /* page can be written */ -#define PROT_EXEC 0x4 /* page can be executed */ -#define PROT_SEM 0x8 /* page may be used for atomic ops */ -#define PROT_NONE 0x0 /* page can not be accessed */ -#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ -#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ - -#define MAP_SHARED 0x01 /* Share changes */ -#define MAP_PRIVATE 0x02 /* Changes are private */ -#define MAP_TYPE 0x0f /* Mask for type of mapping */ -#define MAP_FIXED 0x10 /* Interpret addr exactly */ -#define MAP_ANONYMOUS 0x20 /* don't use a file */ +#include #define MAP_GROWSDOWN 0x0100 /* stack-like segment */ #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ @@ -23,24 +11,7 @@ #define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ -#define MS_ASYNC 1 /* sync memory asynchronously */ -#define MS_INVALIDATE 2 /* invalidate the caches */ -#define MS_SYNC 4 /* synchronous memory sync */ - #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ -#define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ - -/* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 - #endif /* __I386_MMAN_H__ */ Index: linux-2.6.16-rc3/include/asm-sh/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-sh/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-sh/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -1,19 +1,7 @@ #ifndef __ASM_SH_MMAN_H #define __ASM_SH_MMAN_H -#define PROT_READ 0x1 /* page can be read */ -#define PROT_WRITE 0x2 /* page can be written */ -#define PROT_EXEC 0x4 /* page can be executed */ -#define PROT_SEM 0x8 /* page may be used for atomic ops */ -#define PROT_NONE 0x0 /* page can not be accessed */ -#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ -#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ - -#define MAP_SHARED 0x01 /* Share changes */ -#define MAP_PRIVATE 0x02 /* Changes are private */ -#define MAP_TYPE 0x0f /* Mask for type of mapping */ -#define MAP_FIXED 0x10 /* Interpret addr exactly */ -#define MAP_ANONYMOUS 0x20 /* don't use a file */ +#include #define MAP_GROWSDOWN 0x0100 /* stack-like segment */ #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ @@ -23,24 +11,7 @@ #define MAP_POPULATE 0x8000 /* populate (prefault) page tables */ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ -#define MS_ASYNC 1 /* sync memory asynchronously */ -#define MS_INVALIDATE 2 /* invalidate the caches */ -#define MS_SYNC 4 /* synchronous memory sync */ - #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ -#define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ - -/* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 - #endif /* __ASM_SH_MMAN_H */ Index: linux-2.6.16-rc3/include/asm-ia64/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-ia64/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-ia64/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -8,19 +8,7 @@ * David Mosberger-Tang , Hewlett-Packard Co */ -#define PROT_READ 0x1 /* page can be read */ -#define PROT_WRITE 0x2 /* page can be written */ -#define PROT_EXEC 0x4 /* page can be executed */ -#define PROT_SEM 0x8 /* page may be used for atomic ops */ -#define PROT_NONE 0x0 /* page can not be accessed */ -#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ -#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ - -#define MAP_SHARED 0x01 /* Share changes */ -#define MAP_PRIVATE 0x02 /* Changes are private */ -#define MAP_TYPE 0x0f /* Mask for type of mapping */ -#define MAP_FIXED 0x10 /* Interpret addr exactly */ -#define MAP_ANONYMOUS 0x20 /* don't use a file */ +#include #define MAP_GROWSDOWN 0x00100 /* stack-like segment */ #define MAP_GROWSUP 0x00200 /* register stack-like segment */ @@ -31,24 +19,7 @@ #define MAP_POPULATE 0x08000 /* populate (prefault) pagetables */ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ -#define MS_ASYNC 1 /* sync memory asynchronously */ -#define MS_INVALIDATE 2 /* invalidate the caches */ -#define MS_SYNC 4 /* synchronous memory sync */ - #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ -#define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ - -/* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 - #endif /* _ASM_IA64_MMAN_H */ Index: linux-2.6.16-rc3/include/asm-generic/mman.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.16-rc3/include/asm-generic/mman.h 2006-02-15 21:52:25.000000000 +0200 @@ -0,0 +1,42 @@ +#ifndef _ASM_GENERIC_MMAN_H +#define _ASM_GENERIC_MMAN_H + +/* + Author: Michael S. Tsirkin , Mellanox Technologies Ltd. + Based on: asm-xxx/mman.h +*/ + +#define PROT_READ 0x1 /* page can be read */ +#define PROT_WRITE 0x2 /* page can be written */ +#define PROT_EXEC 0x4 /* page can be executed */ +#define PROT_SEM 0x8 /* page may be used for atomic ops */ +#define PROT_NONE 0x0 /* page can not be accessed */ +#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ +#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ + +#define MAP_SHARED 0x01 /* Share changes */ +#define MAP_PRIVATE 0x02 /* Changes are private */ +#define MAP_TYPE 0x0f /* Mask for type of mapping */ +#define MAP_FIXED 0x10 /* Interpret addr exactly */ +#define MAP_ANONYMOUS 0x20 /* don't use a file */ + +#define MS_ASYNC 1 /* sync memory asynchronously */ +#define MS_INVALIDATE 2 /* invalidate the caches */ +#define MS_SYNC 4 /* synchronous memory sync */ + +#define MADV_NORMAL 0 /* no further special treatment */ +#define MADV_RANDOM 1 /* expect random page references */ +#define MADV_SEQUENTIAL 2 /* expect sequential page references */ +#define MADV_WILLNEED 3 /* will need these pages */ +#define MADV_DONTNEED 4 /* don't need these pages */ + +/* common parameters: try to keep these consistent across architectures */ +#define MADV_REMOVE 9 /* remove these pages & resources */ +#define MADV_DONTFORK 10 /* don't inherit across fork */ +#define MADV_DOFORK 11 /* do inherit across fork */ + +/* compatibility flags */ +#define MAP_ANON MAP_ANONYMOUS +#define MAP_FILE 0 + +#endif Index: linux-2.6.16-rc3/include/asm-sparc/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-sparc/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-sparc/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -2,21 +2,10 @@ #ifndef __SPARC_MMAN_H__ #define __SPARC_MMAN_H__ +#include + /* SunOS'ified... */ -#define PROT_READ 0x1 /* page can be read */ -#define PROT_WRITE 0x2 /* page can be written */ -#define PROT_EXEC 0x4 /* page can be executed */ -#define PROT_SEM 0x8 /* page may be used for atomic ops */ -#define PROT_NONE 0x0 /* page can not be accessed */ -#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ -#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ - -#define MAP_SHARED 0x01 /* Share changes */ -#define MAP_PRIVATE 0x02 /* Changes are private */ -#define MAP_TYPE 0x0f /* Mask for type of mapping */ -#define MAP_FIXED 0x10 /* Interpret addr exactly */ -#define MAP_ANONYMOUS 0x20 /* don't use a file */ #define MAP_RENAME MAP_ANONYMOUS /* In SunOS terminology */ #define MAP_NORESERVE 0x40 /* don't reserve swap pages */ #define MAP_INHERIT 0x80 /* SunOS doesn't do this, but... */ @@ -27,10 +16,6 @@ #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ #define MAP_EXECUTABLE 0x1000 /* mark it as an executable */ -#define MS_ASYNC 1 /* sync memory asynchronously */ -#define MS_INVALIDATE 2 /* invalidate the caches */ -#define MS_SYNC 4 /* synchronous memory sync */ - #define MCL_CURRENT 0x2000 /* lock all currently mapped pages */ #define MCL_FUTURE 0x4000 /* lock all additions to address space */ @@ -48,18 +33,6 @@ #define MC_LOCKAS 5 /* Lock an entire address space of the calling process */ #define MC_UNLOCKAS 6 /* Unlock entire address space of calling process */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ #define MADV_FREE 0x5 /* (Solaris) contents can be freed */ -#define MADV_REMOVE 0x6 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ - -/* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 #endif /* __SPARC_MMAN_H__ */ Index: linux-2.6.16-rc3/include/asm-m32r/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-m32r/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-m32r/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -1,21 +1,9 @@ #ifndef __M32R_MMAN_H__ #define __M32R_MMAN_H__ -/* orig : i386 2.6.0-test6 */ - -#define PROT_READ 0x1 /* page can be read */ -#define PROT_WRITE 0x2 /* page can be written */ -#define PROT_EXEC 0x4 /* page can be executed */ -#define PROT_SEM 0x8 /* page may be used for atomic ops */ -#define PROT_NONE 0x0 /* page can not be accessed */ -#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ -#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ +#include -#define MAP_SHARED 0x01 /* Share changes */ -#define MAP_PRIVATE 0x02 /* Changes are private */ -#define MAP_TYPE 0x0f /* Mask for type of mapping */ -#define MAP_FIXED 0x10 /* Interpret addr exactly */ -#define MAP_ANONYMOUS 0x20 /* don't use a file */ +/* orig : i386 2.6.0-test6 */ #define MAP_GROWSDOWN 0x0100 /* stack-like segment */ #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ @@ -25,24 +13,7 @@ #define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ -#define MS_ASYNC 1 /* sync memory asynchronously */ -#define MS_INVALIDATE 2 /* invalidate the caches */ -#define MS_SYNC 4 /* synchronous memory sync */ - #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ -#define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ - -/* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 - #endif /* __M32R_MMAN_H__ */ Index: linux-2.6.16-rc3/include/asm-frv/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-frv/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-frv/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -1,19 +1,7 @@ #ifndef __ASM_MMAN_H__ #define __ASM_MMAN_H__ -#define PROT_READ 0x1 /* page can be read */ -#define PROT_WRITE 0x2 /* page can be written */ -#define PROT_EXEC 0x4 /* page can be executed */ -#define PROT_SEM 0x8 /* page may be used for atomic ops */ -#define PROT_NONE 0x0 /* page can not be accessed */ -#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ -#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ - -#define MAP_SHARED 0x01 /* Share changes */ -#define MAP_PRIVATE 0x02 /* Changes are private */ -#define MAP_TYPE 0x0f /* Mask for type of mapping */ -#define MAP_FIXED 0x10 /* Interpret addr exactly */ -#define MAP_ANONYMOUS 0x20 /* don't use a file */ +#include #define MAP_GROWSDOWN 0x0100 /* stack-like segment */ #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ @@ -23,25 +11,8 @@ #define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ -#define MS_ASYNC 1 /* sync memory asynchronously */ -#define MS_INVALIDATE 2 /* invalidate the caches */ -#define MS_SYNC 4 /* synchronous memory sync */ - #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ -#define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ - -/* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 - #endif /* __ASM_MMAN_H__ */ Index: linux-2.6.16-rc3/include/asm-h8300/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-h8300/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-h8300/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -1,19 +1,7 @@ #ifndef __H8300_MMAN_H__ #define __H8300_MMAN_H__ -#define PROT_READ 0x1 /* page can be read */ -#define PROT_WRITE 0x2 /* page can be written */ -#define PROT_EXEC 0x4 /* page can be executed */ -#define PROT_SEM 0x8 /* page may be used for atomic ops */ -#define PROT_NONE 0x0 /* page can not be accessed */ -#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ -#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ - -#define MAP_SHARED 0x01 /* Share changes */ -#define MAP_PRIVATE 0x02 /* Changes are private */ -#define MAP_TYPE 0x0f /* Mask for type of mapping */ -#define MAP_FIXED 0x10 /* Interpret addr exactly */ -#define MAP_ANONYMOUS 0x20 /* don't use a file */ +#include #define MAP_GROWSDOWN 0x0100 /* stack-like segment */ #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ @@ -23,24 +11,7 @@ #define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ -#define MS_ASYNC 1 /* sync memory asynchronously */ -#define MS_INVALIDATE 2 /* invalidate the caches */ -#define MS_SYNC 4 /* synchronous memory sync */ - #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ -#define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ - -/* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 - #endif /* __H8300_MMAN_H__ */ Index: linux-2.6.16-rc3/include/asm-arm/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-arm/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-arm/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -1,19 +1,7 @@ #ifndef __ARM_MMAN_H__ #define __ARM_MMAN_H__ -#define PROT_READ 0x1 /* page can be read */ -#define PROT_WRITE 0x2 /* page can be written */ -#define PROT_EXEC 0x4 /* page can be executed */ -#define PROT_SEM 0x8 /* page may be used for atomic ops */ -#define PROT_NONE 0x0 /* page can not be accessed */ -#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ -#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ - -#define MAP_SHARED 0x01 /* Share changes */ -#define MAP_PRIVATE 0x02 /* Changes are private */ -#define MAP_TYPE 0x0f /* Mask for type of mapping */ -#define MAP_FIXED 0x10 /* Interpret addr exactly */ -#define MAP_ANONYMOUS 0x20 /* don't use a file */ +#include #define MAP_GROWSDOWN 0x0100 /* stack-like segment */ #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ @@ -23,24 +11,7 @@ #define MAP_POPULATE 0x8000 /* populate (prefault) page tables */ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ -#define MS_ASYNC 1 /* sync memory asynchronously */ -#define MS_INVALIDATE 2 /* invalidate the caches */ -#define MS_SYNC 4 /* synchronous memory sync */ - #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ -#define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ - -/* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 - #endif /* __ARM_MMAN_H__ */ Index: linux-2.6.16-rc3/include/asm-x86_64/mman.h =================================================================== --- linux-2.6.16-rc3.orig/include/asm-x86_64/mman.h 2006-02-15 18:59:13.000000000 +0200 +++ linux-2.6.16-rc3/include/asm-x86_64/mman.h 2006-02-15 19:01:32.000000000 +0200 @@ -1,19 +1,8 @@ #ifndef __X8664_MMAN_H__ #define __X8664_MMAN_H__ -#define PROT_READ 0x1 /* page can be read */ -#define PROT_WRITE 0x2 /* page can be written */ -#define PROT_EXEC 0x4 /* page can be executed */ -#define PROT_NONE 0x0 /* page can not be accessed */ -#define PROT_SEM 0x8 -#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ -#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ +#include -#define MAP_SHARED 0x01 /* Share changes */ -#define MAP_PRIVATE 0x02 /* Changes are private */ -#define MAP_TYPE 0x0f /* Mask for type of mapping */ -#define MAP_FIXED 0x10 /* Interpret addr exactly */ -#define MAP_ANONYMOUS 0x20 /* don't use a file */ #define MAP_32BIT 0x40 /* only give out 32bit addresses */ #define MAP_GROWSDOWN 0x0100 /* stack-like segment */ @@ -24,24 +13,7 @@ #define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ -#define MS_ASYNC 1 /* sync memory asynchronously */ -#define MS_INVALIDATE 2 /* invalidate the caches */ -#define MS_SYNC 4 /* synchronous memory sync */ - #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ -#define MADV_REMOVE 0x5 /* remove these pages & resources */ -#define MADV_DONTFORK 0x30 /* dont inherit across fork */ -#define MADV_DOFORK 0x31 /* do inherit across fork */ - -/* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 - #endif -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Wed Feb 15 09:11:01 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Feb 2006 19:11:01 +0200 Subject: [openib-general] Re: [PATCH] add asm-generic/mman.h In-Reply-To: References: <20060215151649.GA12090@mellanox.co.il> Message-ID: <20060215171101.GF12974@mellanox.co.il> Quoting r. Linus Torvalds : > You've changed MS_INVALIDATE from 2 to 4 here. It was just a typo, sorry. I've just resent it with MS_INVALIDATE fixed to 4. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From pbadari at us.ibm.com Wed Feb 15 09:17:13 2006 From: pbadari at us.ibm.com (Badari Pulavarty) Date: Wed, 15 Feb 2006 09:17:13 -0800 Subject: [openib-general] Re: [PATCH] add asm-generic/mman.h In-Reply-To: <20060215170935.GE12974@mellanox.co.il> References: <20060215151649.GA12090@mellanox.co.il> <1140019088.21448.3.camel@dyn9047017100.beaverton.ibm.com> <20060215165016.GD12974@mellanox.co.il> <1140022377.21448.6.camel@dyn9047017100.beaverton.ibm.com> <20060215170935.GE12974@mellanox.co.il> Message-ID: <1140023833.21448.12.camel@dyn9047017100.beaverton.ibm.com> On Wed, 2006-02-15 at 19:09 +0200, Michael S. Tsirkin wrote: > Quoting r. Badari Pulavarty : > > MS_SYNC should be 4 > > MS_INVALIDATE should be 2 > > Good catch, thanks! > Other numbers look right, dont they? > Yes. Others look good. BTW, can we decide on this quickly ? This patch actually changes MADV_REMOVE value - which was added recently, so no app should be using it, yet. But a major distro release is basing on current mainline (2.6.16ish). I hate to see distro release having differnt values for MADV_ from mainline - distro's will be stuck maintaining that crap for ever :( Ack-by: Badari Pulavarty Thanks, Badari From halr at voltaire.com Wed Feb 15 09:17:23 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Feb 2006 12:17:23 -0500 Subject: [openib-general] OpenSM realloc error In-Reply-To: <1140021789.21679.2.camel@beast.terraplex.com> References: <1139945214.2169.4.camel@beast.terraplex.com> <1139963140.4333.16065.camel@hal.voltaire.com> <1139964683.5941.10.camel@beast.terraplex.com> <1140004573.4333.19957.camel@hal.voltaire.com> <1140021789.21679.2.camel@beast.terraplex.com> Message-ID: <1140023843.4333.21715.camel@hal.voltaire.com> Hi Owen, On Wed, 2006-02-15 at 11:43, Owen Stampflee wrote: > > Can you strace it and provide the output ? Thanks. > > > > -- Hal > http://cvs.terraplex.com/~owen/opensm.strace I can see the initial write to send a MAD here and it fails after that. One more try: can you send an osm.log from opensm -V ? Thanks again. -- Hal From torvalds at osdl.org Wed Feb 15 09:28:30 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Wed, 15 Feb 2006 09:28:30 -0800 (PST) Subject: [openib-general] Re: [PATCH] add asm-generic/mman.h In-Reply-To: <20060215170935.GE12974@mellanox.co.il> References: <20060215151649.GA12090@mellanox.co.il> <1140019088.21448.3.camel@dyn9047017100.beaverton.ibm.com> <20060215165016.GD12974@mellanox.co.il> <1140022377.21448.6.camel@dyn9047017100.beaverton.ibm.com> <20060215170935.GE12974@mellanox.co.il> Message-ID: On Wed, 15 Feb 2006, Michael S. Tsirkin wrote: > > Other numbers look right, dont they? Suggestion: for each macro name, do grep "macroname" patch and if you see anything that looks even half-way suspicious, check it. Here's a pipeline from hell which shows that you broke at least MADV_REMOVE (which has values 5-9 depending on architecture). sed -n '/^[-+].*define[ ]*/ { s/.*define[ ]*\([A-Za-z_0-9]*\).*/\1/ ; p}' patch | sort -u | while read i do echo $i: grep "^[-+].*$i" patch done | less -S Rule #1: use tools instead of eyeballs whenever you possibly can. Humans are bad at noticing changes like this. Linus From mst at mellanox.co.il Wed Feb 15 09:39:48 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Feb 2006 19:39:48 +0200 Subject: [openib-general] Re: [PATCH] add asm-generic/mman.h In-Reply-To: References: <20060215151649.GA12090@mellanox.co.il> <1140019088.21448.3.camel@dyn9047017100.beaverton.ibm.com> <20060215165016.GD12974@mellanox.co.il> <1140022377.21448.6.camel@dyn9047017100.beaverton.ibm.com> <20060215170935.GE12974@mellanox.co.il> Message-ID: <20060215173948.GJ12974@mellanox.co.il> Quoting r. Linus Torvalds : > Subject: Re: [PATCH] add asm-generic/mman.h > > > > On Wed, 15 Feb 2006, Michael S. Tsirkin wrote: > > > > Other numbers look right, dont they? > > Suggestion: for each macro name, do > > grep "macroname" patch > > and if you see anything that looks even half-way suspicious, check it. > > Here's a pipeline from hell which shows that you broke at least > MADV_REMOVE (which has values 5-9 depending on architecture). > > sed -n '/^[-+].*define[ ]*/ > { s/.*define[ ]*\([A-Za-z_0-9]*\).*/\1/ ; p}' > patch | > sort -u | > while read i > do > echo $i: > grep "^[-+].*$i" patch > done | > less -S This change was intentional: MADV_REMOVE wasnt in any mainline kernels: it was added recently, so no app should be using it, yet. > Rule #1: use tools instead of eyeballs whenever you possibly can. Humans > are bad at noticing changes like this. Right. I'll go check again. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From pbadari at us.ibm.com Wed Feb 15 09:40:50 2006 From: pbadari at us.ibm.com (Badari Pulavarty) Date: Wed, 15 Feb 2006 09:40:50 -0800 Subject: [openib-general] Re: [PATCH] add asm-generic/mman.h In-Reply-To: References: <20060215151649.GA12090@mellanox.co.il> <1140019088.21448.3.camel@dyn9047017100.beaverton.ibm.com> <20060215165016.GD12974@mellanox.co.il> <1140022377.21448.6.camel@dyn9047017100.beaverton.ibm.com> <20060215170935.GE12974@mellanox.co.il> Message-ID: <1140025250.21448.15.camel@dyn9047017100.beaverton.ibm.com> On Wed, 2006-02-15 at 09:28 -0800, Linus Torvalds wrote: > > On Wed, 15 Feb 2006, Michael S. Tsirkin wrote: > > > > Other numbers look right, dont they? > > Suggestion: for each macro name, do > > grep "macroname" patch > > and if you see anything that looks even half-way suspicious, check it. > > Here's a pipeline from hell which shows that you broke at least > MADV_REMOVE (which has values 5-9 depending on architecture). Yes. I did that earlier and checked everything. MADV_REMOVE is a known change. Since it added very recently, I guess is okay to fix it for real. But if we are going to change it, I am hoping to see it very soon in mainline. (Before distros fork-off). Thanks, Badari From ostampflee at terrasoftsolutions.com Wed Feb 15 09:41:23 2006 From: ostampflee at terrasoftsolutions.com (Owen Stampflee) Date: Wed, 15 Feb 2006 09:41:23 -0800 Subject: [openib-general] OpenSM realloc error In-Reply-To: <1140023843.4333.21715.camel@hal.voltaire.com> References: <1139945214.2169.4.camel@beast.terraplex.com> <1139963140.4333.16065.camel@hal.voltaire.com> <1139964683.5941.10.camel@beast.terraplex.com> <1140004573.4333.19957.camel@hal.voltaire.com> <1140021789.21679.2.camel@beast.terraplex.com> <1140023843.4333.21715.camel@hal.voltaire.com> Message-ID: <1140025284.22080.2.camel@beast.terraplex.com> This doesnt help much... at all... no new info to report. [root at m1 ~]# rm /var/log/osm.log [root at m1 ~]# opensm -v ------------------------------------------------- OpenSM Rev:openib-1.1.0 Command Line Arguments: Verbose option -v (log flags = 0x7) Log File: /var/log/osm.log ------------------------------------------------- *** glibc detected *** realloc(): invalid next size: 0x0000000010085e90 *** Aborted [root at m1 ~]# cat /var/log/osm.log Feb 15 10:39:12 161337 [16940] -> OpenSM Rev:openib-1.1.0 [root at m1 ~]# same log output with -V too. On Wed, 2006-02-15 at 12:17 -0500, Hal Rosenstock wrote: > Hi Owen, > > On Wed, 2006-02-15 at 11:43, Owen Stampflee wrote: > > > Can you strace it and provide the output ? Thanks. > > > > > > -- Hal > > http://cvs.terraplex.com/~owen/opensm.strace > > > I can see the initial write to send a MAD here and it fails after that. > One more try: can you send an osm.log from opensm -V ? Thanks again. > > -- Hal > > > !DSPAM:43f36457262369367420173! From mst at mellanox.co.il Wed Feb 15 09:51:18 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Feb 2006 19:51:18 +0200 Subject: [openib-general] Re: [PATCH] add asm-generic/mman.h In-Reply-To: References: <20060215151649.GA12090@mellanox.co.il> <1140019088.21448.3.camel@dyn9047017100.beaverton.ibm.com> <20060215165016.GD12974@mellanox.co.il> <1140022377.21448.6.camel@dyn9047017100.beaverton.ibm.com> <20060215170935.GE12974@mellanox.co.il> Message-ID: <20060215175118.GK12974@mellanox.co.il> Quoting r. Linus Torvalds : > Rule #1: use tools instead of eyeballs whenever you possibly can. Humans > are bad at noticing changes like this. OK, this #!/usr/bin/perl my %new_values; my @old_names; my @old_values; while (<>) { if (m/^\+#define\s+(\S+)\s+(\S+)\s+.*$/) { $new_values{$1}=$2; } if (m/^\-#define\s+(\S+)\s+(\S+)\s+.*$/) { push @old_names, $1; push @old_values, $2; } } for (my $i = 0; $i <= $#old_values; $i++) { if (oct($new_values{$old_names[$i]}) != oct($old_values[$i])) { print "Changed: $old_names[$i] $old_values[$i] " . " to $new_values{$old_names[$i]}\n"; } } Shows that the only numbers changed are MADV_REMOVE MADV_DONTFORK MADV_DOFORK As was intended. OK now? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From halr at voltaire.com Wed Feb 15 09:59:47 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Feb 2006 12:59:47 -0500 Subject: [openib-general] Re: [PATCH] Opensm - osmt_service.c fixes - take #2 In-Reply-To: <5zslqkkiyp.fsf@mtl066.yok.mtl.com> References: <5zslqkkiyp.fsf@mtl066.yok.mtl.com> Message-ID: <1140024818.4333.21822.camel@hal.voltaire.com> On Wed, 2006-02-15 at 05:35, Yael Kalka wrote: > Hi Hal, > > The following patch add the following to osmt_service.c: > 1. Currently, the flow sometimes exits with "TEST PATH" although the test > actually fails. > 2. Added cleanup of all the services created at the end of the test. > 3. Cosmetic cleanups of the code. > > Thanks, > Yael > > Signed-off-by: Yael Kalka Thanks. I applied this in 2 pieces (osm_sa_service_record.c and then osmt_service.c). I fixed a couple of minor nits noted below. > Index: osmtest/osmt_service.c > =================================================================== > --- osmtest/osmt_service.c (revision 5403) > +++ osmtest/osmt_service.c (working copy) [snip...] > @@ -539,62 +542,73 @@ osmt_get_service_by_id_and_name ( IN osm > if( status != IB_SUCCESS ) > { > osm_log( &p_osmt->log, OSM_LOG_ERROR, > - "osmt_get_service_by_id_and_name: ERR 0365: " > - "ib_query failed (%s).\n", ib_get_err_str( status ) ); > + "osmt_get_service_by_id_and_name: ERR 4A07: " > + "ib_query failed (%s)\n", ib_get_err_str( status ) ); > goto Exit; > } > > status = context.result.status; > + num_recs = context.result.result_cnt; > > if( status != IB_SUCCESS ) > { > - num_recs = 0; > - if (status != IB_INVALID_PARAMETER) > - { > - osm_log( &p_osmt->log, OSM_LOG_ERROR, > - "osmt_get_service_by_id_and_name: ERR 0370: " > - "ib_query failed (%s).\n", ib_get_err_str( status ) ); > - } > + char mad_stat_err[256]; > + /* If the failure is due to IB_SA_MAD_STATUS_NO_RECORDS and rec_num is 0, > + then this is fine. */ > if( status == IB_REMOTE_ERROR ) > + strcpy(mad_stat_err, ib_get_mad_status_str( > + osm_madw_get_mad_ptr(context.result.p_result_madw) ) ); > + else > + strcpy(mad_stat_err, ib_get_err_str(status) ); > + if( status == IB_REMOTE_ERROR && > + !strcmp(mad_stat_err, "IB_SA_MAD_STATUS_NO_RECORDS") && > + rec_num == 0 ) > { > osm_log( &p_osmt->log, OSM_LOG_ERROR, > "osmt_get_service_by_id_and_name: " > - "Remote error = %s.\n", > - ib_get_mad_status_str( osm_madw_get_mad_ptr > - ( context.result. > - p_result_madw ) ) ); > - } > - goto Exit; > + "IS EXPECTED ERROR ^^^^\n"); > + status = IB_SUCCESS; > } > else > { > - num_recs = context.result.result_cnt; > + osm_log( &p_osmt->log, OSM_LOG_ERROR, > + "osmt_get_service_by_id_and_name: ERR 4A08: " > + "Query failed:%s (%s)\n", > + ib_get_err_str(status), > + mad_stat_err ); > + goto Exit; > + } > } > > - if ( num_recs != rec_num ) > + if ( rec_num && num_recs != rec_num ) > { > osm_log( &p_osmt->log, OSM_LOG_ERROR, > "osmt_get_service_by_id_and_name: " > - "Unmatched record number, Expeceted : %d, Got : %d.\n", > + "Unmatched number of records: expeceted:%d, received:%d\n", > rec_num, num_recs); > status = IB_REMOTE_ERROR; > goto Exit; > } > > p_rec = osmv_get_query_svc_rec( context.result.p_result_madw, 0 ); > + *p_out_rec = *p_rec; > + > + if (num_recs) > + { > osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > "osmt_get_service_by_id_and_name: " > - "Found Service Record by Name:%s ID:0x%016" PRIx64 ".\n", > + "Found service record: name:%s id:0x%016" PRIx64 "\n", > p_rec->service_name, cl_ntoh64(p_rec->service_id)); > > osm_dump_service_record(&p_osmt->log, p_rec, OSM_LOG_DEBUG); > - *p_out_rec = *p_rec; > + } > > Exit: > osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > "osmt_get_service_by_id_and_name: " > - "Expected num of records is : %d, Found number of records : %d\n", > - rec_num,num_recs); > + "Expected and found $d records\n", ^^^ %d > + rec_num ); > + > if( context.result.p_result_madw != NULL ) > { > osm_mad_pool_put( &p_osmt->mad_pool, context.result.p_result_madw ); [snip...] > @@ -893,78 +939,90 @@ osmt_get_service_by_name( IN osmtest_t * > > context.p_osmt = p_osmt; > > + /* prepare the data used for this query */ > req.query_type = OSMV_QUERY_SVC_REC_BY_NAME; > req.timeout_ms = p_osmt->opt.transaction_timeout; > req.retry_cnt = p_osmt->opt.retry_count; > req.flags = OSM_SA_FLAGS_SYNC; > req.query_context = &context; > req.pfn_query_cb = osmtest_query_res_cb; > + req.sm_key = 0; > + > cl_memclr(service_name, sizeof(service_name)); > cl_memcpy(service_name, sr_name, (strlen(sr_name)+1)*sizeof(char)); > req.p_query_input = service_name; > - req.sm_key = 0; > > status = osmv_query_sa( p_osmt->h_bind, &req ); > if( status != IB_SUCCESS ) > { > osm_log( &p_osmt->log, OSM_LOG_ERROR, > - "osmt_get_service_by_name: ERR 0365: " > - "ib_query failed (%s).\n", ib_get_err_str( status ) ); > + "osmt_get_service_by_name: ERR 4A0E: " > + "ib_query failed (%s)\n", ib_get_err_str( status ) ); > goto Exit; > } > > status = context.result.status; > + num_recs = context.result.result_cnt; > > if( status != IB_SUCCESS ) > { > - /* The context struct is not init OR result with illegal number of records */ > - num_recs = 0; > - if (status != IB_INVALID_PARAMETER) > - { > - osm_log( &p_osmt->log, OSM_LOG_ERROR, > - "osmt_get_service_by_name: ERR 0370: " > - "ib_query failed (%s).\n", ib_get_err_str( status ) ); > - } > + char mad_stat_err[256]; > + /* If the failure is due to IB_SA_MAD_STATUS_NO_RECORDS and rec_num is 0, > + then this is fine. */ > if( status == IB_REMOTE_ERROR ) > + strcpy(mad_stat_err, ib_get_mad_status_str( > + osm_madw_get_mad_ptr(context.result.p_result_madw) ) ); > + else > + strcpy(mad_stat_err, ib_get_err_str(status) ); > + > + if( status == IB_REMOTE_ERROR && > + !strcmp(mad_stat_err, "IB_SA_MAD_STATUS_NO_RECORDS") && > + rec_num == 0 ) > { > osm_log( &p_osmt->log, OSM_LOG_ERROR, > "osmt_get_service_by_name: " > - "Remote error = %s.\n", > - ib_get_mad_status_str( osm_madw_get_mad_ptr > - ( context.result. > - p_result_madw ) ) ); > - } > - goto Exit; > + "IS EXPECTED ERROR ^^^^\n"); > + status = IB_SUCCESS; > } > else > { > - num_recs = context.result.result_cnt; > + osm_log( &p_osmt->log, OSM_LOG_ERROR, > + "osmt_get_service_by_name: ERR 4A0F: " > + "Query failed:%s (%s)\n", > + ib_get_err_str(status), > + mad_stat_err ); > + goto Exit; > + } > } > > - if ( num_recs != rec_num ) > + if ( rec_num && num_recs != rec_num ) > { > osm_log( &p_osmt->log, OSM_LOG_ERROR, > - "osmt_get_service_by_name: " > - "Unmatched record number, Expeceted : %d, Got : %u.\n", > + "osmt_get_service_by_name: ERR 4A10: " > + "Unmatched number of records: expected:%d, received:%u\n", > rec_num, num_recs); > status = IB_REMOTE_ERROR; > goto Exit; > } > > p_rec = osmv_get_query_svc_rec( context.result.p_result_madw, 0 ); > + *p_out_rec = *p_rec; > + > + if (num_recs) > + { > osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > "osmt_get_service_by_name: " > - "Found Service Record by Name:%s ID:0x%016" PRIx64 ".\n", > + "Found service record: name:%s id:0x%016" PRIx64 "\n", > sr_name, cl_ntoh64(p_rec->service_id)); > > osm_dump_service_record(&p_osmt->log, p_rec, OSM_LOG_DEBUG); > - *p_out_rec = *p_rec; > + } > > Exit: > osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > "osmt_get_service_by_name: " > - "Expected num of records is : %d, Found number of records : %u\n", ^^ %d > - rec_num,num_recs); > + "Expected and found %d records\n", > + rec_num ); > > if( context.result.p_result_madw != NULL ) > { > @@ -1002,7 +1060,7 @@ osmt_get_all_services_and_check_names( I > { > osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > "osmt_get_all_services_and_check_names: " > - "Getting All Service Records\n"); > + "Getting all service records\n"); > } > /* > * Do a blocking query for this record in the subnet. [snip...] From mst at mellanox.co.il Wed Feb 15 10:15:36 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Feb 2006 20:15:36 +0200 Subject: [openib-general] Re: [PATCH] add asm-generic/mman.h In-Reply-To: <1140025250.21448.15.camel@dyn9047017100.beaverton.ibm.com> References: <20060215151649.GA12090@mellanox.co.il> <1140019088.21448.3.camel@dyn9047017100.beaverton.ibm.com> <20060215165016.GD12974@mellanox.co.il> <1140022377.21448.6.camel@dyn9047017100.beaverton.ibm.com> <20060215170935.GE12974@mellanox.co.il> <1140025250.21448.15.camel@dyn9047017100.beaverton.ibm.com> Message-ID: <20060215181536.GN12974@mellanox.co.il> Quoting r. Badari Pulavarty : > Yes. I did that earlier and checked everything. > > MADV_REMOVE is a known change. Since it added very recently, > I guess is okay to fix it for real. But if we are going to > change it, I am hoping to see it very soon in mainline. > (Before distros fork-off). I agree. Andrew, could take this into -mm? This replaces Roland's fix-up-madv_dontfork-madv_dofork-definitions.patch -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From danb at voltaire.com Wed Feb 15 10:30:41 2006 From: danb at voltaire.com (Dan Bar Dov) Date: Wed, 15 Feb 2006 20:30:41 +0200 Subject: [openib-general] iSER Doc Message-ID: I'm not sure what iser API you refer to. The openIB iser API to the open-iscsi iscsi_transport module is identical to the iscsi_tcp API. The datamover API is not used. Beyond that, you can read the iser .h files. Dan > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Mohit > Katiyar, Noida > Sent: Wednesday, February 15, 2006 6:45 PM > To: openib-general at openib.org > Subject: [openib-general] iSER Doc > > Hi, > I am looking for iSER API documentation. Is anybody aware of such an > info?? > > Thanks and Regards > Mohit Katiyar > HCL Technologies > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From sashak at voltaire.com Wed Feb 15 10:47:11 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 15 Feb 2006 20:47:11 +0200 Subject: [openib-general] OpenSM realloc error In-Reply-To: <1140025284.22080.2.camel@beast.terraplex.com> References: <1139945214.2169.4.camel@beast.terraplex.com> <1139963140.4333.16065.camel@hal.voltaire.com> <1139964683.5941.10.camel@beast.terraplex.com> <1140004573.4333.19957.camel@hal.voltaire.com> <1140021789.21679.2.camel@beast.terraplex.com> <1140023843.4333.21715.camel@hal.voltaire.com> <1140025284.22080.2.camel@beast.terraplex.com> Message-ID: <20060215184711.GE12172@sashak.voltaire.com> On 09:41 Wed 15 Feb , Owen Stampflee wrote: > This doesnt help much... at all... no new info to report. > > [root at m1 ~]# rm /var/log/osm.log > [root at m1 ~]# opensm -v > ------------------------------------------------- > OpenSM Rev:openib-1.1.0 > Command Line Arguments: > Verbose option -v (log flags = 0x7) > Log File: /var/log/osm.log > ------------------------------------------------- > *** glibc detected *** realloc(): invalid next size: 0x0000000010085e90 > *** > Aborted Could you run opensm under gdb and see backtrace after failure? You may want to rebuild opensm with debug flag, for this in osm directory run: CFLAGS=-g ./configure && make Sasha. From halr at voltaire.com Wed Feb 15 10:41:49 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Feb 2006 13:41:49 -0500 Subject: [openib-general] OpenSM realloc error In-Reply-To: <20060215184711.GE12172@sashak.voltaire.com> References: <1139945214.2169.4.camel@beast.terraplex.com> <1139963140.4333.16065.camel@hal.voltaire.com> <1139964683.5941.10.camel@beast.terraplex.com> <1140004573.4333.19957.camel@hal.voltaire.com> <1140021789.21679.2.camel@beast.terraplex.com> <1140023843.4333.21715.camel@hal.voltaire.com> <1140025284.22080.2.camel@beast.terraplex.com> <20060215184711.GE12172@sashak.voltaire.com> Message-ID: <1140028908.4333.22226.camel@hal.voltaire.com> On Wed, 2006-02-15 at 13:47, Sasha Khapyorsky wrote: > On 09:41 Wed 15 Feb , Owen Stampflee wrote: > > This doesnt help much... at all... no new info to report. > > > > [root at m1 ~]# rm /var/log/osm.log > > [root at m1 ~]# opensm -v > > ------------------------------------------------- > > OpenSM Rev:openib-1.1.0 > > Command Line Arguments: > > Verbose option -v (log flags = 0x7) > > Log File: /var/log/osm.log > > ------------------------------------------------- > > *** glibc detected *** realloc(): invalid next size: 0x0000000010085e90 > > *** > > Aborted > > Could you run opensm under gdb and see backtrace after failure? > You may want to rebuild opensm with debug flag, for this in osm > directory run: CFLAGS=-g ./configure && make ./configure --enable-debug is another way. > > Sasha. From swise at opengridcomputing.com Wed Feb 15 11:00:57 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 15 Feb 2006 13:00:57 -0600 Subject: [openib-general] ibv_cmd_create_qp() question Message-ID: <1140030057.21595.34.camel@stevo-desktop> I have a provider in the works that needs provider-specific information passed from the kernel create_qp verb back to the provider library. It seems that ibv_cmd_create_qp() doesn't allow for provider-specific data to be passed out of the kernel back to the provider lib. ibv_cmd_create_cq() does support this. I can come up with a patch to the code to support this, but I wanted to query the group to make sure I'm not missing something. Is it a reasonable extension to add this to create_qp()? Thanks, Steve. From rdreier at cisco.com Wed Feb 15 13:13:14 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 15 Feb 2006 13:13:14 -0800 Subject: [openib-general] ibv_cmd_create_qp() question In-Reply-To: <1140030057.21595.34.camel@stevo-desktop> (Steve Wise's message of "Wed, 15 Feb 2006 13:00:57 -0600") References: <1140030057.21595.34.camel@stevo-desktop> Message-ID: Steve> I have a provider in the works that needs provider-specific Steve> information passed from the kernel create_qp verb back to Steve> the provider library. It seems that ibv_cmd_create_qp() Steve> doesn't allow for provider-specific data to be passed out Steve> of the kernel back to the provider lib. Steve> ibv_cmd_create_cq() does support this. Steve> I can come up with a patch to the code to support this, but Steve> I wanted to query the group to make sure I'm not missing Steve> something. Steve> Is it a reasonable extension to add this to create_qp()? Yes, but you have perfect timing: I just made a 1.0-rc6 release last night (I was just writing the announcement now) with the idea that all the APIs and ABIs were frozen. This change would of course break that freeze. However I would be inclined to accept your patch if you can code it up by the end of the week, and also promise me that you thought hard about your driver and that you won't need any more incompatible changes. That way I can do a 1.0-rc7 release early next week with just this change, and not slip my 1.0 final release very much at all. If that schedule is too tight for you, then I think this will probably need to wait for 1.1.0, which I would expect in six months or so. - R. From rdreier at cisco.com Wed Feb 15 13:18:43 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 15 Feb 2006 13:18:43 -0800 Subject: [openib-general] Re: PATCH] mthca - command interface - revised In-Reply-To: <1140010293.4601.8.camel@mtls03.yok.mtl.com> (Eli Cohen's message of "Wed, 15 Feb 2006 15:31:33 +0200") References: <1140010293.4601.8.camel@mtls03.yok.mtl.com> Message-ID: Eli> Roland, this patch is modified according to your comments. It Eli> also adds a kernel configuration option which selects whether Eli> to use posting commands through doorbells. The option is off Eli> by default. Why make a config option? Is there any reason why someone would want to disable this feature, assuming the firmware supports it? If there is some reason to turn this off, then a module parameter would make more sense to me -- otherwise a kernel config option means that a recompile is necessary to change the setting, and distros won't ship drivers with the option turned on. - R. From info at schihei.de Wed Feb 15 13:18:43 2006 From: info at schihei.de (Heiko J Schick) Date: Wed, 15 Feb 2006 22:18:43 +0100 Subject: [openib-general] ibv_cmd_create_qp() question In-Reply-To: <1140030057.21595.34.camel@stevo-desktop> References: <1140030057.21595.34.camel@stevo-desktop> Message-ID: <8887D240-23A7-4CA4-848B-C8AB29588F7A@schihei.de> Hello Steven, we had this problem, too. For ibv_cmd_create_qp we pass via our cmd struct the address of the response block to the kernel.The kernel can copies then all values into the response block (from kernel space to user space). The provider library can then the provider specific information which are passed from the kernel. See ehcau_create_qp in ehca_umain.c for more information. Regards, Heiko On Feb 15, 2006, at 8:00 PM, Steve Wise wrote: > I have a provider in the works that needs provider-specific > information > passed from the kernel create_qp verb back to the provider > library. It > seems that ibv_cmd_create_qp() doesn't allow for provider-specific > data > to be passed out of the kernel back to the provider lib. > > ibv_cmd_create_cq() does support this. > > I can come up with a patch to the code to support this, but I > wanted to > query the group to make sure I'm not missing something. > > Is it a reasonable extension to add this to create_qp()? > > Thanks, > > Steve. > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/ > openib-general > From mst at mellanox.co.il Wed Feb 15 13:24:03 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Feb 2006 23:24:03 +0200 Subject: [openib-general] Re: [PATCH] change Mellanox SDP workaround to a moduleparameter In-Reply-To: <1139015153.475.20.camel@brick.internal.keyresearch.com> References: <1139015153.475.20.camel@brick.internal.keyresearch.com> Message-ID: <20060215212403.GB14424@mellanox.co.il> Quoting r. Ralph Campbell : > Subject: [PATCH] change Mellanox SDP workaround to a moduleparameter > > This patch changes the hardwired MTU limit of 1024 in SDP > into a module parameter so it can be disabled for HCAs > without the RC performance problem. > > Signed-off-by: Ralph Campbell I thought about it some more: what happens if nodes with different max MTU values try to connect? My understanding of the spec: - passive side should send REJ indicating error and passing the maximum MTU it supports - active side should retry the connection with a lower MTU Right? By the way, this handling seems to be missing in CMA. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From info at schihei.de Wed Feb 15 13:25:41 2006 From: info at schihei.de (Heiko J Schick) Date: Wed, 15 Feb 2006 22:25:41 +0100 Subject: [openib-general] SDP and Linux 2.6.16-rc2 Message-ID: <714B5872-0F6C-4DA3-8542-3EB29709F0F7@schihei.de> Hello, I figured out that Linux 2.6.16-rc2 don't export two symbols which are needed by SDP. These symbols are: - dev_ioctl - ip_dev_find Will be this symbols exported in Linux 2.6.16 again, because without is not possible to load ib_sdp. Regards, Heiko From mst at mellanox.co.il Wed Feb 15 13:27:48 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Feb 2006 23:27:48 +0200 Subject: [openib-general] Re: PATCH] mthca - command interface - revised In-Reply-To: References: <1140010293.4601.8.camel@mtls03.yok.mtl.com> Message-ID: <20060215212748.GC14424@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: PATCH] mthca - command interface - revised > > Eli> Roland, this patch is modified according to your comments. It > Eli> also adds a kernel configuration option which selects whether > Eli> to use posting commands through doorbells. The option is off > Eli> by default. > > Why make a config option? Is there any reason why someone would want > to disable this feature, assuming the firmware supports it? AFAIK, which of the two options gives better performance might depend on the application and the specific system. For now, Eli made the simpler option the default. > If there is some reason to turn this off, then a module parameter > would make more sense to me -- otherwise a kernel config option means > that a recompile is necessary to change the setting, and distros won't > ship drivers with the option turned on. > > - R. Good idea. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Wed Feb 15 13:32:09 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Feb 2006 23:32:09 +0200 Subject: [openib-general] Re: SDP and Linux 2.6.16-rc2 In-Reply-To: <714B5872-0F6C-4DA3-8542-3EB29709F0F7@schihei.de> References: <714B5872-0F6C-4DA3-8542-3EB29709F0F7@schihei.de> Message-ID: <20060215213209.GD14424@mellanox.co.il> Quoting r. Heiko J Schick : > Subject: SDP and Linux 2.6.16-rc2 > > Hello, > > I figured out that Linux 2.6.16-rc2 don't export two symbols which are > needed by SDP. > > These symbols are: > - dev_ioctl > - ip_dev_find > > Will be this symbols exported in Linux 2.6.16 again, because without > is not possible to load ib_sdp. > > Regards, > Heiko For ip_dev_find, you can apply this patch: https://openib.org/svn/trunk/contrib/mellanox/gen2/patches/sdp_ip_dev_find.patch I'll have to look into dev_ioctl usage, for now I think you can just remove the line that uses it in sdp_inet.c, its not too important. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From rdreier at cisco.com Wed Feb 15 13:30:52 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 15 Feb 2006 13:30:52 -0800 Subject: [openib-general] ibv_cmd_create_qp() question In-Reply-To: <8887D240-23A7-4CA4-848B-C8AB29588F7A@schihei.de> (Heiko J. Schick's message of "Wed, 15 Feb 2006 22:18:43 +0100") References: <1140030057.21595.34.camel@stevo-desktop> <8887D240-23A7-4CA4-848B-C8AB29588F7A@schihei.de> Message-ID: Heiko> Hello Steven, we had this problem, too. For Heiko> ibv_cmd_create_qp we pass via our cmd struct the address of Heiko> the response block to the kernel.The kernel can copies then Heiko> all values into the response block (from kernel space to Heiko> user space). Heiko> The provider library can then the provider specific Heiko> information which are passed from the kernel. Oh man, I didn't notice that before. Please don't work around the existing code like that -- just fix the interface to do what you need. I really don't want low-level drivers using extra userspace pointers to stick their extra data into. Please use the existing ib_copy_to_udata() function to add driver-specific after the core response instead. You can look at how mthca gives the device-specific CQ number back to userspace for the create CQ operation to see the right way to do this. Making this work for create QP means that ibv_cmd_create_qp() needs to handle responses the same what ibv_cmd_create_cq() does, which is a driver API change. But given that ehca needs it too, I'm now convinced that the change that Steve wants for ibv_cmd_create_qp() is necessary in libibverbs 1.0. - R. From swise at opengridcomputing.com Wed Feb 15 13:31:05 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 15 Feb 2006 15:31:05 -0600 Subject: [openib-general] ibv_cmd_create_qp() question In-Reply-To: <8887D240-23A7-4CA4-848B-C8AB29588F7A@schihei.de> References: <1140030057.21595.34.camel@stevo-desktop> <8887D240-23A7-4CA4-848B-C8AB29588F7A@schihei.de> Message-ID: <1140039065.21595.82.camel@stevo-desktop> Hey Roland, Why don't I use the same scheme as Heiko for now. Then I won't hold up the release. At this point, I can't say for sure what else we might suggest to change in the core APIs anyway. I'll submit a patch down the road to fix the core to handle this ala create_cq(), but it can wait until 1.1.0. Sound like a plan? Steve. On Wed, 2006-02-15 at 22:18 +0100, Heiko J Schick wrote: > Hello Steven, > > we had this problem, too. For ibv_cmd_create_qp we pass via our cmd > struct > the address of the response block to the kernel.The kernel can copies > then > all values into the response block (from kernel space to user space). > > The provider library can then the provider specific information which > are > passed from the kernel. > > See ehcau_create_qp in ehca_umain.c for more information. > > Regards, > Heiko > > On Feb 15, 2006, at 8:00 PM, Steve Wise wrote: > > > I have a provider in the works that needs provider-specific > > information > > passed from the kernel create_qp verb back to the provider > > library. It > > seems that ibv_cmd_create_qp() doesn't allow for provider-specific > > data > > to be passed out of the kernel back to the provider lib. > > > > ibv_cmd_create_cq() does support this. > > > > I can come up with a patch to the code to support this, but I > > wanted to > > query the group to make sure I'm not missing something. > > > > Is it a reasonable extension to add this to create_qp()? > > > > Thanks, > > > > Steve. > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/ > > openib-general > > From rdreier at cisco.com Wed Feb 15 13:34:00 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 15 Feb 2006 13:34:00 -0800 Subject: [openib-general] Re: PATCH] mthca - command interface - revised In-Reply-To: <20060215212748.GC14424@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 15 Feb 2006 23:27:48 +0200") References: <1140010293.4601.8.camel@mtls03.yok.mtl.com> <20060215212748.GC14424@mellanox.co.il> Message-ID: Michael> AFAIK, which of the two options gives better performance Michael> might depend on the application and the specific system. Michael> For now, Eli made the simpler option the default. Have you seen cases where using the HCR is faster? It seems that in both cases we are doing posted writes to PCI memory, except that the HCR case has to do at least one (slow) read to check the go bit. The doorbell case does use more write barriers since all the writes have to be ordered, but I have a hard time believing that the write barriers are anywhere near as expensive as the read of the go bit. - R. From mst at mellanox.co.il Wed Feb 15 13:42:19 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Feb 2006 23:42:19 +0200 Subject: [openib-general] Re: PATCH] mthca - command interface - revised In-Reply-To: References: <1140010293.4601.8.camel@mtls03.yok.mtl.com> <20060215212748.GC14424@mellanox.co.il> Message-ID: <20060215214219.GE14424@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: PATCH] mthca - command interface - revised > > Michael> AFAIK, which of the two options gives better performance > Michael> might depend on the application and the specific system. > Michael> For now, Eli made the simpler option the default. > > Have you seen cases where using the HCR is faster? It seems that in > both cases we are doing posted writes to PCI memory, except that the > HCR case has to do at least one (slow) read to check the go bit. The > doorbell case does use more write barriers since all the writes have > to be ordered, but I have a hard time believing that the write > barriers are anywhere near as expensive as the read of the go bit. We did not yet figure it out why. Possibly some systems slow down if you do a lot of PIO writes in a burst. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From swise at opengridcomputing.com Wed Feb 15 13:43:19 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 15 Feb 2006 15:43:19 -0600 Subject: [openib-general] ibv_cmd_create_qp() question In-Reply-To: References: <1140030057.21595.34.camel@stevo-desktop> <8887D240-23A7-4CA4-848B-C8AB29588F7A@schihei.de> Message-ID: <1140039799.21595.87.camel@stevo-desktop> Just curious, why don't all the verbs have this support? Maybe we should align all the verbs to support commands and responses that allow provider-specific extensions? Stevo. On Wed, 2006-02-15 at 13:30 -0800, Roland Dreier wrote: > Heiko> Hello Steven, we had this problem, too. For > Heiko> ibv_cmd_create_qp we pass via our cmd struct the address of > Heiko> the response block to the kernel.The kernel can copies then > Heiko> all values into the response block (from kernel space to > Heiko> user space). > > Heiko> The provider library can then the provider specific > Heiko> information which are passed from the kernel. > > Oh man, I didn't notice that before. Please don't work around the > existing code like that -- just fix the interface to do what you need. > I really don't want low-level drivers using extra userspace pointers > to stick their extra data into. > > Please use the existing ib_copy_to_udata() function to add > driver-specific after the core response instead. You can look at how > mthca gives the device-specific CQ number back to userspace for the > create CQ operation to see the right way to do this. > > Making this work for create QP means that ibv_cmd_create_qp() needs to > handle responses the same what ibv_cmd_create_cq() does, which is a > driver API change. But given that ehca needs it too, I'm now > convinced that the change that Steve wants for ibv_cmd_create_qp() is > necessary in libibverbs 1.0. > > - R. From mst at mellanox.co.il Wed Feb 15 13:48:37 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Feb 2006 23:48:37 +0200 Subject: [openib-general] Re: IPoIB and lid change In-Reply-To: References: Message-ID: <20060215214837.GA15037@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: IPoIB and lid change > > Michael> One simple way to address this would be to have a list of > Michael> all address handles per net device and kill them on an SM > Michael> change event. > > Seems reasonable. It seems a little painful to implement at a first > glance but I might be looking at it wrong. Here's my plan (might not have the time to look at it before Sunday which is why I'm not just doing it and sending a patch): - start with ipoib_all_neigh_issues_2.patch - make the neigh list per-device - now in the destructor we'll walk the list of devices rather than the global neigh list - walk all neighbours and kill lids on client reregistration event We now get the destructor issue fixed in a slightly less ugly way, code will be ready for when destructor moves into the params structure, and the LID change will get fixed for simplest/common case. Does this look like a good plan? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From swise at opengridcomputing.com Wed Feb 15 13:53:02 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 15 Feb 2006 15:53:02 -0600 Subject: [openib-general] ibv_cmd_create_qp() question In-Reply-To: References: <1140030057.21595.34.camel@stevo-desktop> <8887D240-23A7-4CA4-848B-C8AB29588F7A@schihei.de> Message-ID: <1140040382.21595.89.camel@stevo-desktop> > Making this work for create QP means that ibv_cmd_create_qp() needs to > handle responses the same what ibv_cmd_create_cq() does, which is a > driver API change. But given that ehca needs it too, I'm now > convinced that the change that Steve wants for ibv_cmd_create_qp() is > necessary in libibverbs 1.0. > > - R. Stay tuned for a patch. Should we patch all the lib code that will need to change too? Or just the core kernel and libibverbs code. Stevo. From rdreier at cisco.com Wed Feb 15 13:56:44 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 15 Feb 2006 13:56:44 -0800 Subject: [openib-general] ibv_cmd_create_qp() question In-Reply-To: <1140039065.21595.82.camel@stevo-desktop> (Steve Wise's message of "Wed, 15 Feb 2006 15:31:05 -0600") References: <1140030057.21595.34.camel@stevo-desktop> <8887D240-23A7-4CA4-848B-C8AB29588F7A@schihei.de> <1140039065.21595.82.camel@stevo-desktop> Message-ID: Steve> Hey Roland, Why don't I use the same scheme as Heiko for Steve> now. Then I won't hold up the release. At this point, I Steve> can't say for sure what else we might suggest to change in Steve> the core APIs anyway. I'll submit a patch down the road to Steve> fix the core to handle this ala create_cq(), but it can Steve> wait until 1.1.0. Actually I'd rather get this straightened out sooner rather than later. I don't want ehca merged with the scheme it uses, so I think libibverbs needs to be ready to do it properly too. It's pretty trivial to update ibv_cmd_create_qp(); in fact I just did it. How does this patch look to you? (driver library changes coming shortly) - R. Index: libibverbs/include/infiniband/driver.h =================================================================== --- libibverbs/include/infiniband/driver.h (revision 5421) +++ libibverbs/include/infiniband/driver.h (working copy) @@ -114,7 +114,8 @@ int ibv_cmd_destroy_srq(struct ibv_srq * int ibv_cmd_create_qp(struct ibv_pd *pd, struct ibv_qp *qp, struct ibv_qp_init_attr *attr, - struct ibv_create_qp *cmd, size_t cmd_size); + struct ibv_create_qp *cmd, size_t cmd_size, + struct ibv_create_qp_resp *resp, size_t resp_size); int ibv_cmd_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *qp_attr, enum ibv_qp_attr_mask attr_mask, struct ibv_qp_init_attr *qp_init_attr, Index: libibverbs/ChangeLog =================================================================== --- libibverbs/ChangeLog (revision 5421) +++ libibverbs/ChangeLog (working copy) @@ -1,3 +1,11 @@ +2006-02-15 Roland Dreier + + * src/cmd.c (ibv_cmd_create_qp): Allow userspace device-specific + driver to pass in a response buffer, so that the low-level driver + in the kernel can pass back device-specific information. This + changes the userspace driver API, since the signature of + ibv_cmd_create_qp() is changed. + 2006-02-14 Roland Dreier * Release version 1.0-rc6. Index: libibverbs/src/cmd.c =================================================================== --- libibverbs/src/cmd.c (revision 5421) +++ libibverbs/src/cmd.c (working copy) @@ -543,17 +543,11 @@ int ibv_cmd_destroy_srq(struct ibv_srq * int ibv_cmd_create_qp(struct ibv_pd *pd, struct ibv_qp *qp, struct ibv_qp_init_attr *attr, - struct ibv_create_qp *cmd, size_t cmd_size) + struct ibv_create_qp *cmd, size_t cmd_size, + struct ibv_create_qp_resp *resp, size_t resp_size) { - union { - struct ibv_create_qp_resp resp; - struct ibv_create_qp_resp_v3 resp_v3; - } r; - - if (abi_ver > 3) - IBV_INIT_CMD_RESP(cmd, cmd_size, CREATE_QP, &r.resp, sizeof r.resp); - else - IBV_INIT_CMD_RESP(cmd, cmd_size, CREATE_QP, &r.resp_v3, sizeof r.resp_v3); + IBV_INIT_CMD_RESP(cmd, cmd_size, CREATE_QP, resp, resp_size); + cmd->user_handle = (uintptr_t) qp; cmd->pd_handle = pd->handle; cmd->send_cq_handle = attr->send_cq->handle; @@ -572,16 +566,22 @@ int ibv_cmd_create_qp(struct ibv_pd *pd, return errno; if (abi_ver > 3) { - qp->handle = r.resp.qp_handle; - qp->qp_num = r.resp.qpn; - attr->cap.max_recv_sge = r.resp.max_recv_sge; - attr->cap.max_send_sge = r.resp.max_send_sge; - attr->cap.max_recv_wr = r.resp.max_recv_wr; - attr->cap.max_send_wr = r.resp.max_send_wr; - attr->cap.max_inline_data = r.resp.max_inline_data; + qp->handle = resp->qp_handle; + qp->qp_num = resp->qpn; + attr->cap.max_recv_sge = resp->max_recv_sge; + attr->cap.max_send_sge = resp->max_send_sge; + attr->cap.max_recv_wr = resp->max_recv_wr; + attr->cap.max_send_wr = resp->max_send_wr; + attr->cap.max_inline_data = resp->max_inline_data; } else { - qp->handle = r.resp_v3.qp_handle; - qp->qp_num = r.resp_v3.qpn; + struct ibv_create_qp_resp_v3 *resp_v3 = + (struct ibv_create_qp_resp_v3 *) resp; + + qp->handle = resp_v3->qp_handle; + qp->qp_num = resp_v3->qpn; + memmove((void *) resp + sizeof *resp, + (void *) resp_v3 + sizeof *resp_v3, + resp_size - sizeof *resp); } return 0; From rdreier at cisco.com Wed Feb 15 13:58:01 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 15 Feb 2006 13:58:01 -0800 Subject: [openib-general] ibv_cmd_create_qp() question In-Reply-To: <1140040382.21595.89.camel@stevo-desktop> (Steve Wise's message of "Wed, 15 Feb 2006 15:53:02 -0600") References: <1140030057.21595.34.camel@stevo-desktop> <8887D240-23A7-4CA4-848B-C8AB29588F7A@schihei.de> <1140040382.21595.89.camel@stevo-desktop> Message-ID: Steve> Stay tuned for a patch. Should we patch all the lib code Steve> that will need to change too? Or just the core kernel and Steve> libibverbs code. I just sent a patch to libibverbs. I'll patch the library code that needs to change too. Does the core kernel need to change? It seems that a provider can use ib_copy_to_udata() in its create_qp method the same way it can in create_cq. But of course I haven't tried it. - R. From rdreier at cisco.com Wed Feb 15 14:00:10 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 15 Feb 2006 14:00:10 -0800 Subject: [openib-general] ibv_cmd_create_qp() question In-Reply-To: <1140039799.21595.87.camel@stevo-desktop> (Steve Wise's message of "Wed, 15 Feb 2006 15:43:19 -0600") References: <1140030057.21595.34.camel@stevo-desktop> <8887D240-23A7-4CA4-848B-C8AB29588F7A@schihei.de> <1140039799.21595.87.camel@stevo-desktop> Message-ID: Steve> Just curious, why don't all the verbs have this support? Steve> Maybe we should align all the verbs to support commands and Steve> responses that allow provider-specific extensions? It seemed like over-engineering to me. I don't think we need arbitrary data in every single verb. I could easily be wrong but I think we'll very quickly converge on a set of verbs that covers all HW. - R. From swise at opengridcomputing.com Wed Feb 15 14:02:55 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 15 Feb 2006 16:02:55 -0600 Subject: [openib-general] ibv_cmd_create_qp() question In-Reply-To: References: <1140030057.21595.34.camel@stevo-desktop> <8887D240-23A7-4CA4-848B-C8AB29588F7A@schihei.de> <1140040382.21595.89.camel@stevo-desktop> Message-ID: <1140040975.21595.93.camel@stevo-desktop> On Wed, 2006-02-15 at 13:58 -0800, Roland Dreier wrote: > Steve> Stay tuned for a patch. Should we patch all the lib code > Steve> that will need to change too? Or just the core kernel and > Steve> libibverbs code. > > I just sent a patch to libibverbs. I'll patch the library code that > needs to change too. Does the core kernel need to change? It seems > that a provider can use ib_copy_to_udata() in its create_qp method the > same way it can in create_cq. But of course I haven't tried it. Ok. I just perused the uverbs core code again and I guess you're right. From swise at opengridcomputing.com Wed Feb 15 14:03:31 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 15 Feb 2006 16:03:31 -0600 Subject: [openib-general] ibv_cmd_create_qp() question In-Reply-To: References: <1140030057.21595.34.camel@stevo-desktop> <8887D240-23A7-4CA4-848B-C8AB29588F7A@schihei.de> <1140039799.21595.87.camel@stevo-desktop> Message-ID: <1140041011.21595.95.camel@stevo-desktop> On Wed, 2006-02-15 at 14:00 -0800, Roland Dreier wrote: > Steve> Just curious, why don't all the verbs have this support? > > Steve> Maybe we should align all the verbs to support commands and > Steve> responses that allow provider-specific extensions? > > It seemed like over-engineering to me. I don't think we need > arbitrary data in every single verb. I could easily be wrong but I > think we'll very quickly converge on a set of verbs that covers all HW. > > - R. ok. From caitlinb at broadcom.com Wed Feb 15 14:22:17 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 15 Feb 2006 14:22:17 -0800 Subject: [openib-general] ibv_cmd_create_qp() question Message-ID: <54AD0F12E08D1541B826BE97C98F99F1296BEA@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > Steve> Just curious, why don't all the verbs have this support? > > Steve> Maybe we should align all the verbs to support commands and > Steve> responses that allow provider-specific extensions? > > It seemed like over-engineering to me. I don't think we need > arbitrary data in every single verb. I could easily be wrong > but I think we'll very quickly converge on a set of verbs > that covers all HW. > Allowing provider-specific data to be established when an object is *created* would definitely be adequate. That allows sharing info about provider-specific data structures. Once that information is shared, however, there should be no need to have private exchanges on each and every verb. The modifications being requested in a qp_modify are *not* provider-specific, merely the implementation data structures that the modifications will be made upon. But since the user-provider-verbs and kernel-provider-verbs already share that information there is nothing provider-specific that has to be communicated with a modify, query or delete verb. So the relevant set is create verbs for objects that are accessed on the fast path (and hence might have user-space created data structures): QP, SRQ and CQ. From rdreier at cisco.com Wed Feb 15 14:43:04 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 15 Feb 2006 14:43:04 -0800 Subject: [openib-general] ibv_cmd_create_qp() question In-Reply-To: <8887D240-23A7-4CA4-848B-C8AB29588F7A@schihei.de> (Heiko J. Schick's message of "Wed, 15 Feb 2006 22:18:43 +0100") References: <1140030057.21595.34.camel@stevo-desktop> <8887D240-23A7-4CA4-848B-C8AB29588F7A@schihei.de> Message-ID: OK, I just committed the following patch, which adds the resp/resp_size paramters to ibv_cmd_create_qp(). I will follow up with a patch to ehca to change the kernel/userspace create QP interface to use ib_copy_to_udata() instead of the device-specific respbuf hack that you have now. - R. Index: libibverbs/include/infiniband/driver.h =================================================================== --- libibverbs/include/infiniband/driver.h (revision 5421) +++ libibverbs/include/infiniband/driver.h (working copy) @@ -114,7 +114,8 @@ int ibv_cmd_destroy_srq(struct ibv_srq * int ibv_cmd_create_qp(struct ibv_pd *pd, struct ibv_qp *qp, struct ibv_qp_init_attr *attr, - struct ibv_create_qp *cmd, size_t cmd_size); + struct ibv_create_qp *cmd, size_t cmd_size, + struct ibv_create_qp_resp *resp, size_t resp_size); int ibv_cmd_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *qp_attr, enum ibv_qp_attr_mask attr_mask, struct ibv_qp_init_attr *qp_init_attr, Index: libibverbs/ChangeLog =================================================================== --- libibverbs/ChangeLog (revision 5421) +++ libibverbs/ChangeLog (working copy) @@ -1,3 +1,11 @@ +2006-02-15 Roland Dreier + + * src/cmd.c (ibv_cmd_create_qp): Allow userspace device-specific + driver to pass in a response buffer, so that the low-level driver + in the kernel can pass back device-specific information. This + changes the userspace driver API, since the signature of + ibv_cmd_create_qp() is changed. + 2006-02-14 Roland Dreier * Release version 1.0-rc6. Index: libibverbs/src/cmd.c =================================================================== --- libibverbs/src/cmd.c (revision 5421) +++ libibverbs/src/cmd.c (working copy) @@ -543,17 +543,11 @@ int ibv_cmd_destroy_srq(struct ibv_srq * int ibv_cmd_create_qp(struct ibv_pd *pd, struct ibv_qp *qp, struct ibv_qp_init_attr *attr, - struct ibv_create_qp *cmd, size_t cmd_size) + struct ibv_create_qp *cmd, size_t cmd_size, + struct ibv_create_qp_resp *resp, size_t resp_size) { - union { - struct ibv_create_qp_resp resp; - struct ibv_create_qp_resp_v3 resp_v3; - } r; - - if (abi_ver > 3) - IBV_INIT_CMD_RESP(cmd, cmd_size, CREATE_QP, &r.resp, sizeof r.resp); - else - IBV_INIT_CMD_RESP(cmd, cmd_size, CREATE_QP, &r.resp_v3, sizeof r.resp_v3); + IBV_INIT_CMD_RESP(cmd, cmd_size, CREATE_QP, resp, resp_size); + cmd->user_handle = (uintptr_t) qp; cmd->pd_handle = pd->handle; cmd->send_cq_handle = attr->send_cq->handle; @@ -572,16 +566,22 @@ int ibv_cmd_create_qp(struct ibv_pd *pd, return errno; if (abi_ver > 3) { - qp->handle = r.resp.qp_handle; - qp->qp_num = r.resp.qpn; - attr->cap.max_recv_sge = r.resp.max_recv_sge; - attr->cap.max_send_sge = r.resp.max_send_sge; - attr->cap.max_recv_wr = r.resp.max_recv_wr; - attr->cap.max_send_wr = r.resp.max_send_wr; - attr->cap.max_inline_data = r.resp.max_inline_data; + qp->handle = resp->qp_handle; + qp->qp_num = resp->qpn; + attr->cap.max_recv_sge = resp->max_recv_sge; + attr->cap.max_send_sge = resp->max_send_sge; + attr->cap.max_recv_wr = resp->max_recv_wr; + attr->cap.max_send_wr = resp->max_send_wr; + attr->cap.max_inline_data = resp->max_inline_data; } else { - qp->handle = r.resp_v3.qp_handle; - qp->qp_num = r.resp_v3.qpn; + struct ibv_create_qp_resp_v3 *resp_v3 = + (struct ibv_create_qp_resp_v3 *) resp; + + qp->handle = resp_v3->qp_handle; + qp->qp_num = resp_v3->qpn; + memmove((void *) resp + sizeof *resp, + (void *) resp_v3 + sizeof *resp_v3, + resp_size - sizeof *resp); } return 0; Index: libmthca/src/verbs.c =================================================================== --- libmthca/src/verbs.c (revision 5421) +++ libmthca/src/verbs.c (working copy) @@ -470,9 +470,10 @@ int mthca_destroy_srq(struct ibv_srq *sr struct ibv_qp *mthca_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr) { - struct mthca_create_qp cmd; - struct mthca_qp *qp; - int ret; + struct mthca_create_qp cmd; + struct ibv_create_qp_resp resp; + struct mthca_qp *qp; + int ret; /* Sanity check QP size before proceeding */ if (attr->cap.max_send_wr > 65536 || @@ -525,7 +526,8 @@ struct ibv_qp *mthca_create_qp(struct ib cmd.lkey = qp->mr->lkey; - ret = ibv_cmd_create_qp(pd, &qp->ibv_qp, attr, &cmd.ibv_cmd, sizeof cmd); + ret = ibv_cmd_create_qp(pd, &qp->ibv_qp, attr, &cmd.ibv_cmd, sizeof cmd, + &resp, sizeof resp); if (ret) goto err_rq_db; Index: libmthca/ChangeLog =================================================================== --- libmthca/ChangeLog (revision 5421) +++ libmthca/ChangeLog (working copy) @@ -1,3 +1,8 @@ +2006-02-15 Roland Dreier + + * src/verbs.c (mthca_create_qp): Update to add new response and + response size parameters for libibverbs ibv_cmd_create_qp(). + 2006-02-14 Roland Dreier * Release version 1.0-rc6. Index: libipathverbs/src/verbs.c =================================================================== --- libipathverbs/src/verbs.c (revision 5421) +++ libipathverbs/src/verbs.c (working copy) @@ -175,15 +175,16 @@ int ipath_destroy_cq(struct ibv_cq *cq) struct ibv_qp *ipath_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr) { - struct ibv_create_qp cmd; - struct ibv_qp *qp; - int ret; + struct ibv_create_qp cmd; + struct ibv_create_qp_resp resp; + struct ibv_qp *qp; + int ret; qp = malloc(sizeof *qp); if (!qp) return NULL; - ret = ibv_cmd_create_qp(pd, qp, attr, &cmd, sizeof cmd); + ret = ibv_cmd_create_qp(pd, qp, attr, &cmd, sizeof cmd, &resp, sizeof resp); if (ret) { free(qp); return NULL; Index: libehca/src/ehca_umain.c =================================================================== --- libehca/src/ehca_umain.c (revision 5421) +++ libehca/src/ehca_umain.c (working copy) @@ -283,7 +283,8 @@ struct ibv_qp *ehcau_create_qp(struct ib int ret = 0; struct ehcau_qp *my_qp = NULL; struct ehcau_create_qp cmd; - struct ehcau_create_qp_resp resp; + struct ehcau_create_qp_resp ehca_resp; + struct ibv_create_qp_resp resp; struct ibv_context *context = NULL; int ret2 = 0; @@ -300,8 +301,8 @@ struct ibv_qp *ehcau_create_qp(struct ib memset(my_qp, 0, sizeof(*my_qp)); memset(&cmd, 0, sizeof(cmd)); - memset(&resp, 0, sizeof(resp)); - cmd.respbuf = (uintptr_t) & resp; /* TODO: better include resp in ibv_cmd_create_qp call */ + memset(&ehca_resp, 0, sizeof(ehca_resp)); + cmd.respbuf = (uintptr_t) &ehca_resp; /* TODO: better include resp in ibv_cmd_create_qp call */ if (pthread_spin_init(&my_qp->spinlock_s, PTHREAD_PROCESS_PRIVATE) || pthread_spin_init(&my_qp->spinlock_r, PTHREAD_PROCESS_PRIVATE)) { @@ -309,7 +310,8 @@ struct ibv_qp *ehcau_create_qp(struct ib } ret = ibv_cmd_create_qp(pd, &my_qp->ib_qp, attr, - &cmd.ibv_cmd, sizeof(cmd)); + &cmd.ibv_cmd, sizeof(cmd), + &resp, sizeof resp); if (ret != 0) { EDEB_ERR(4, "ibv_cmd_create_qp() failed ret=%x pd=%p", @@ -317,9 +319,9 @@ struct ibv_qp *ehcau_create_qp(struct ib goto create_qp_exit0; } /* copy data returned from kernel */ - my_qp->qp_num = resp.qp_num; - my_qp->token = resp.token; - my_qp->ehca_qp_core = resp.ehca_qp_core; + my_qp->qp_num = ehca_resp.qp_num; + my_qp->token = ehca_resp.token; + my_qp->ehca_qp_core = ehca_resp.ehca_qp_core; my_qp->ehca_qp_core.ipz_rqueue.current_q_addr = my_qp->ehca_qp_core.ipz_rqueue.queue; my_qp->ehca_qp_core.ipz_squeue.current_q_addr = From rdreier at cisco.com Wed Feb 15 15:04:36 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 15 Feb 2006 15:04:36 -0800 Subject: [openib-general] ibv_cmd_create_qp() question In-Reply-To: <8887D240-23A7-4CA4-848B-C8AB29588F7A@schihei.de> (Heiko J. Schick's message of "Wed, 15 Feb 2006 22:18:43 +0100") References: <1140030057.21595.34.camel@stevo-desktop> <8887D240-23A7-4CA4-848B-C8AB29588F7A@schihei.de> Message-ID: As promised, here is a patch that converts ehca to use ib_copy_to_udata() in the create_qp method in exactly the same way it is used in the create_cq method. Please test and commit if it looks OK. (I have an eHCA test machine that I'll get set up soon so I can test myself, I promise). As a bonus, this simplifies your code as well: linux-kernel/infiniband/hw/ehca/ehca_classes.h | 10 +--------- linux-kernel/infiniband/hw/ehca/ehca_qp.c | 15 ++------------- userspace/libehca/src/ehca_uclasses.h | 13 ++----------- userspace/libehca/src/ehca_umain.c | 23 ++++++++++------------- 4 files changed, 15 insertions(+), 46 deletions(-) Thanks, Roland Index: userspace/libehca/src/ehca_uclasses.h =================================================================== --- userspace/libehca/src/ehca_uclasses.h (revision 5421) +++ userspace/libehca/src/ehca_uclasses.h (working copy) @@ -142,13 +142,8 @@ int ehcau_attach_mcast(struct ibv_qp *qp int ehcau_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); /** - * cmd and resp structs to pass to kernel space + * resp structs from kernel space */ -struct ehcau_create_cq { - struct ibv_create_cq ibv_cmd; - u64 respbuf; -}; - struct ehcau_create_cq_resp { struct ibv_create_cq_resp ibv_resp; u32 cq_number; @@ -156,12 +151,8 @@ struct ehcau_create_cq_resp { struct ehca_cq_core ehca_cq_core; }; -struct ehcau_create_qp { - struct ibv_create_qp ibv_cmd; - u64 respbuf; -}; - struct ehcau_create_qp_resp { + struct ibv_create_qp_resp ibv_resp; u32 qp_num; u32 token; struct ehca_qp_core ehca_qp_core; Index: userspace/libehca/src/ehca_umain.c =================================================================== --- userspace/libehca/src/ehca_umain.c (revision 5424) +++ userspace/libehca/src/ehca_umain.c (working copy) @@ -200,7 +200,7 @@ int ehcau_dealloc_pd(struct ibv_pd *pd) struct ibv_cq *ehcau_create_cq(struct ibv_context *context, int cqe, struct ibv_comp_channel *channel, int comp_vector) { - struct ehcau_create_cq cmd; + struct ibv_create_cq cmd; struct ehcau_create_cq_resp resp; struct ehcau_cq *my_cq = NULL; int ret = 0; @@ -220,9 +220,9 @@ struct ibv_cq *ehcau_create_cq(struct ib "context=%p cqe=%x", context, cqe); goto create_cq_exit0; } - cmd.respbuf = (uintptr_t) & resp; + ret = ibv_cmd_create_cq(context, cqe, channel, comp_vector, &my_cq->ib_cq, - &cmd.ibv_cmd, sizeof(cmd), + &cmd, sizeof(cmd), &resp.ibv_resp, sizeof(resp)); if (ret) { EDEB_ERR(4, "ibv_cmd_create_cq() failed " @@ -282,9 +282,8 @@ struct ibv_qp *ehcau_create_qp(struct ib { int ret = 0; struct ehcau_qp *my_qp = NULL; - struct ehcau_create_qp cmd; - struct ehcau_create_qp_resp ehca_resp; - struct ibv_create_qp_resp resp; + struct ibv_create_qp cmd; + struct ehcau_create_qp_resp resp; struct ibv_context *context = NULL; int ret2 = 0; @@ -301,8 +300,6 @@ struct ibv_qp *ehcau_create_qp(struct ib memset(my_qp, 0, sizeof(*my_qp)); memset(&cmd, 0, sizeof(cmd)); - memset(&ehca_resp, 0, sizeof(ehca_resp)); - cmd.respbuf = (uintptr_t) &ehca_resp; /* TODO: better include resp in ibv_cmd_create_qp call */ if (pthread_spin_init(&my_qp->spinlock_s, PTHREAD_PROCESS_PRIVATE) || pthread_spin_init(&my_qp->spinlock_r, PTHREAD_PROCESS_PRIVATE)) { @@ -310,8 +307,8 @@ struct ibv_qp *ehcau_create_qp(struct ib } ret = ibv_cmd_create_qp(pd, &my_qp->ib_qp, attr, - &cmd.ibv_cmd, sizeof(cmd), - &resp, sizeof resp); + &cmd, sizeof(cmd), + &resp.ibv_resp, sizeof resp); if (ret != 0) { EDEB_ERR(4, "ibv_cmd_create_qp() failed ret=%x pd=%p", @@ -319,9 +316,9 @@ struct ibv_qp *ehcau_create_qp(struct ib goto create_qp_exit0; } /* copy data returned from kernel */ - my_qp->qp_num = ehca_resp.qp_num; - my_qp->token = ehca_resp.token; - my_qp->ehca_qp_core = ehca_resp.ehca_qp_core; + my_qp->qp_num = resp.qp_num; + my_qp->token = resp.token; + my_qp->ehca_qp_core = resp.ehca_qp_core; my_qp->ehca_qp_core.ipz_rqueue.current_q_addr = my_qp->ehca_qp_core.ipz_rqueue.queue; my_qp->ehca_qp_core.ipz_squeue.current_q_addr = Index: linux-kernel/infiniband/hw/ehca/ehca_classes.h =================================================================== --- linux-kernel/infiniband/hw/ehca/ehca_classes.h (revision 5421) +++ linux-kernel/infiniband/hw/ehca/ehca_classes.h (working copy) @@ -345,22 +345,14 @@ extern struct idr ehca_qp_idr; extern struct idr ehca_cq_idr; /* - * cmd and resp structs for comm bw user and kernel space + * resp structs for comm bw user and kernel space */ -struct ehca_create_cq { - u64 respbuf; -}; - struct ehca_create_cq_resp { u32 cq_number; u32 token; struct ehca_cq_core ehca_cq_core; }; -struct ehca_create_qp { - u64 respbuf; -}; - struct ehca_create_qp_resp { u32 qp_num; u32 token; Index: linux-kernel/infiniband/hw/ehca/ehca_qp.c =================================================================== --- linux-kernel/infiniband/hw/ehca/ehca_qp.c (revision 5421) +++ linux-kernel/infiniband/hw/ehca/ehca_qp.c (working copy) @@ -404,7 +404,6 @@ struct ib_qp *ehca_create_qp(struct ib_p struct ehca_cq *recv_ehca_cq = NULL; struct ehca_cq *send_ehca_cq = NULL; struct ib_ucontext *context = NULL; - struct ehca_create_qp ucmd; u64 hipz_rc = H_Parameter; int max_send_sge; int max_recv_sge; @@ -449,13 +448,6 @@ struct ib_qp *ehca_create_qp(struct ib_p if (pd->uobject && udata != NULL) { context = pd->uobject->context; } - if (context != NULL) { - if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) { - EDEB_ERR(4, "copy_from_user() failed udata=%p ", - udata); - return ERR_PTR(-EFAULT); - } - } my_qp = ehca_qp_new(); if (!my_qp) { @@ -670,11 +662,8 @@ struct ib_qp *ehca_create_qp(struct ib_p &vma); my_qp->uspace_fwh = (u64)resp.ehca_qp_core.galpas.kernel.fw_handle; - - if (copy_to_user - ((u64 __user *) (unsigned long)ucmd.respbuf, &resp, - sizeof(resp))) { - EDEB_ERR(4, "copy_to_user() failed"); + if (ib_copy_to_udata(udata, &resp, sizeof resp) { + EDEB_ERR(4, "Copy to udata failed"); ret = -EINVAL; goto create_qp_exit3; } From iod00d at hp.com Wed Feb 15 15:08:56 2006 From: iod00d at hp.com (Grant Grundler) Date: Wed, 15 Feb 2006 15:08:56 -0800 Subject: [openib-general] Re: PATCH] mthca - command interface - revised In-Reply-To: References: <1140010293.4601.8.camel@mtls03.yok.mtl.com> <20060215212748.GC14424@mellanox.co.il> Message-ID: <20060215230856.GB15878@esmail.cup.hp.com> On Wed, Feb 15, 2006 at 01:34:00PM -0800, Roland Dreier wrote: > Michael> AFAIK, which of the two options gives better performance > Michael> might depend on the application and the specific system. > Michael> For now, Eli made the simpler option the default. > > Have you seen cases where using the HCR is faster? It seems that in > both cases we are doing posted writes to PCI memory, except that the > HCR case has to do at least one (slow) read to check the go bit. The > doorbell case does use more write barriers since all the writes have > to be ordered, but I have a hard time believing that the write > barriers are anywhere near as expensive as the read of the go bit. me too. AFAIK, the write barriers only guarantee the write has left the CPU, is in flight, and subject to PCI ordering rules. The MMIO read is going to cost 1000-3000 CPU cycles depending on chipset, CPU speed, and which register it's reading from the device. However, that doesn't mean all metrics are better just because the CPU is more efficient. Forcing things down the PCI bus will sometimes improve latency sensitive benchmarks. grant From rdreier at cisco.com Wed Feb 15 15:21:03 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 15 Feb 2006 15:21:03 -0800 Subject: [openib-general] ibv_cmd_create_qp() question In-Reply-To: (Roland Dreier's message of "Wed, 15 Feb 2006 15:04:36 -0800") References: <1140030057.21595.34.camel@stevo-desktop> <8887D240-23A7-4CA4-848B-C8AB29588F7A@schihei.de> Message-ID: > As promised, here is a patch that converts ehca to use > ib_copy_to_udata() in the create_qp method in exactly the same way it > is used in the create_cq method. > Please test and commit if it looks OK. (I have an eHCA test machine > that I'll get set up soon so I can test myself, I promise). Err... here's a patch that at least compiles.... As a bonus it deletes even more code: linux-kernel/infiniband/hw/ehca/ehca_classes.h | 10 +--------- linux-kernel/infiniband/hw/ehca/ehca_cq.c | 8 -------- linux-kernel/infiniband/hw/ehca/ehca_qp.c | 15 ++------------- userspace/libehca/src/ehca_uclasses.h | 13 ++----------- userspace/libehca/src/ehca_umain.c | 23 ++++++++++------------- 5 files changed, 15 insertions(+), 54 deletions(-) sorry for the screw-up with the previous patch. - Roland Signed-off-by: Roland Dreier --- userspace/libehca/src/ehca_uclasses.h (revision 5421) +++ userspace/libehca/src/ehca_uclasses.h (working copy) @@ -142,13 +142,8 @@ int ehcau_attach_mcast(struct ibv_qp *qp int ehcau_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); /** - * cmd and resp structs to pass to kernel space + * resp structs from kernel space */ -struct ehcau_create_cq { - struct ibv_create_cq ibv_cmd; - u64 respbuf; -}; - struct ehcau_create_cq_resp { struct ibv_create_cq_resp ibv_resp; u32 cq_number; @@ -156,12 +151,8 @@ struct ehcau_create_cq_resp { struct ehca_cq_core ehca_cq_core; }; -struct ehcau_create_qp { - struct ibv_create_qp ibv_cmd; - u64 respbuf; -}; - struct ehcau_create_qp_resp { + struct ibv_create_qp_resp ibv_resp; u32 qp_num; u32 token; struct ehca_qp_core ehca_qp_core; --- userspace/libehca/src/ehca_umain.c (revision 5424) +++ userspace/libehca/src/ehca_umain.c (working copy) @@ -200,7 +200,7 @@ int ehcau_dealloc_pd(struct ibv_pd *pd) struct ibv_cq *ehcau_create_cq(struct ibv_context *context, int cqe, struct ibv_comp_channel *channel, int comp_vector) { - struct ehcau_create_cq cmd; + struct ibv_create_cq cmd; struct ehcau_create_cq_resp resp; struct ehcau_cq *my_cq = NULL; int ret = 0; @@ -220,9 +220,9 @@ struct ibv_cq *ehcau_create_cq(struct ib "context=%p cqe=%x", context, cqe); goto create_cq_exit0; } - cmd.respbuf = (uintptr_t) & resp; + ret = ibv_cmd_create_cq(context, cqe, channel, comp_vector, &my_cq->ib_cq, - &cmd.ibv_cmd, sizeof(cmd), + &cmd, sizeof(cmd), &resp.ibv_resp, sizeof(resp)); if (ret) { EDEB_ERR(4, "ibv_cmd_create_cq() failed " @@ -282,9 +282,8 @@ struct ibv_qp *ehcau_create_qp(struct ib { int ret = 0; struct ehcau_qp *my_qp = NULL; - struct ehcau_create_qp cmd; - struct ehcau_create_qp_resp ehca_resp; - struct ibv_create_qp_resp resp; + struct ibv_create_qp cmd; + struct ehcau_create_qp_resp resp; struct ibv_context *context = NULL; int ret2 = 0; @@ -301,8 +300,6 @@ struct ibv_qp *ehcau_create_qp(struct ib memset(my_qp, 0, sizeof(*my_qp)); memset(&cmd, 0, sizeof(cmd)); - memset(&ehca_resp, 0, sizeof(ehca_resp)); - cmd.respbuf = (uintptr_t) &ehca_resp; /* TODO: better include resp in ibv_cmd_create_qp call */ if (pthread_spin_init(&my_qp->spinlock_s, PTHREAD_PROCESS_PRIVATE) || pthread_spin_init(&my_qp->spinlock_r, PTHREAD_PROCESS_PRIVATE)) { @@ -310,8 +307,8 @@ struct ibv_qp *ehcau_create_qp(struct ib } ret = ibv_cmd_create_qp(pd, &my_qp->ib_qp, attr, - &cmd.ibv_cmd, sizeof(cmd), - &resp, sizeof resp); + &cmd, sizeof(cmd), + &resp.ibv_resp, sizeof resp); if (ret != 0) { EDEB_ERR(4, "ibv_cmd_create_qp() failed ret=%x pd=%p", @@ -319,9 +316,9 @@ struct ibv_qp *ehcau_create_qp(struct ib goto create_qp_exit0; } /* copy data returned from kernel */ - my_qp->qp_num = ehca_resp.qp_num; - my_qp->token = ehca_resp.token; - my_qp->ehca_qp_core = ehca_resp.ehca_qp_core; + my_qp->qp_num = resp.qp_num; + my_qp->token = resp.token; + my_qp->ehca_qp_core = resp.ehca_qp_core; my_qp->ehca_qp_core.ipz_rqueue.current_q_addr = my_qp->ehca_qp_core.ipz_rqueue.queue; my_qp->ehca_qp_core.ipz_squeue.current_q_addr = --- linux-kernel/infiniband/hw/ehca/ehca_classes.h (revision 5421) +++ linux-kernel/infiniband/hw/ehca/ehca_classes.h (working copy) @@ -345,22 +345,14 @@ extern struct idr ehca_qp_idr; extern struct idr ehca_cq_idr; /* - * cmd and resp structs for comm bw user and kernel space + * resp structs for comm bw user and kernel space */ -struct ehca_create_cq { - u64 respbuf; -}; - struct ehca_create_cq_resp { u32 cq_number; u32 token; struct ehca_cq_core ehca_cq_core; }; -struct ehca_create_qp { - u64 respbuf; -}; - struct ehca_create_qp_resp { u32 qp_num; u32 token; --- linux-kernel/infiniband/hw/ehca/ehca_cq.c (revision 5421) +++ linux-kernel/infiniband/hw/ehca/ehca_cq.c (working copy) @@ -132,7 +132,6 @@ struct ib_cq *ehca_create_cq(struct ib_d u64 hipz_rc = H_Success; int ipz_rc = 0; int ret = 0; - struct ehca_create_cq ucmd; const u32 additional_cqe=20; int i= 0; @@ -147,13 +146,6 @@ struct ib_cq *ehca_create_cq(struct ib_d } number_of_entries += additional_cqe; - if (context) { - if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) { - EDEB_ERR(4, "Copy from udata failed."); - return ERR_PTR(-EFAULT); - } - } - my_cq = ehca_cq_new(); if (my_cq == NULL) { cq = ERR_PTR(-ENOMEM); --- linux-kernel/infiniband/hw/ehca/ehca_qp.c (revision 5421) +++ linux-kernel/infiniband/hw/ehca/ehca_qp.c (working copy) @@ -404,7 +404,6 @@ struct ib_qp *ehca_create_qp(struct ib_p struct ehca_cq *recv_ehca_cq = NULL; struct ehca_cq *send_ehca_cq = NULL; struct ib_ucontext *context = NULL; - struct ehca_create_qp ucmd; u64 hipz_rc = H_Parameter; int max_send_sge; int max_recv_sge; @@ -449,13 +448,6 @@ struct ib_qp *ehca_create_qp(struct ib_p if (pd->uobject && udata != NULL) { context = pd->uobject->context; } - if (context != NULL) { - if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) { - EDEB_ERR(4, "copy_from_user() failed udata=%p ", - udata); - return ERR_PTR(-EFAULT); - } - } my_qp = ehca_qp_new(); if (!my_qp) { @@ -670,11 +662,8 @@ struct ib_qp *ehca_create_qp(struct ib_p &vma); my_qp->uspace_fwh = (u64)resp.ehca_qp_core.galpas.kernel.fw_handle; - - if (copy_to_user - ((u64 __user *) (unsigned long)ucmd.respbuf, &resp, - sizeof(resp))) { - EDEB_ERR(4, "copy_to_user() failed"); + if (ib_copy_to_udata(udata, &resp, sizeof resp)) { + EDEB_ERR(4, "Copy to udata failed"); ret = -EINVAL; goto create_qp_exit3; } From ralphc at pathscale.com Wed Feb 15 15:54:01 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Wed, 15 Feb 2006 15:54:01 -0800 Subject: [openib-general] Re: [PATCH] change Mellanox SDP workaround to a moduleparameter In-Reply-To: <20060215212403.GB14424@mellanox.co.il> References: <1139015153.475.20.camel@brick.internal.keyresearch.com> <20060215212403.GB14424@mellanox.co.il> Message-ID: <1140047642.673.174.camel@brick.internal.keyresearch.com> I'm not all that familiar with the MTU negotiation protocol either. Note that the CM or SDP should be using ib_query_port() to get the max MTU the device/port supports. My main concern is the device specific code which is compiled into sdp_cm_path_complete() to reduce the MTU to something less than the maximum to work around a performance bug. I was proposing to make this a module parameter but perhaps a better solution would be some interface where the device driver can set it or SDP queries the device for the optimum value. On Wed, 2006-02-15 at 23:24 +0200, Michael S. Tsirkin wrote: > Quoting r. Ralph Campbell : > > Subject: [PATCH] change Mellanox SDP workaround to a moduleparameter > > > > This patch changes the hardwired MTU limit of 1024 in SDP > > into a module parameter so it can be disabled for HCAs > > without the RC performance problem. > > > > Signed-off-by: Ralph Campbell > > I thought about it some more: what happens if nodes with different > max MTU values try to connect? > > My understanding of the spec: > > - passive side should send REJ indicating error and passing the > maximum MTU it supports > - active side should retry the connection with a lower MTU > > Right? > By the way, this handling seems to be missing in CMA. > -- Ralph Campbell From bos at pathscale.com Wed Feb 15 16:01:27 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 15 Feb 2006 16:01:27 -0800 Subject: [openib-general] SDP and Linux 2.6.16-rc2 In-Reply-To: <714B5872-0F6C-4DA3-8542-3EB29709F0F7@schihei.de> References: <714B5872-0F6C-4DA3-8542-3EB29709F0F7@schihei.de> Message-ID: <1140048087.19138.5.camel@serpentine.pathscale.com> On Wed, 2006-02-15 at 22:25 +0100, Heiko J Schick wrote: > These symbols are: > - dev_ioctl > - ip_dev_find > > Will be this symbols exported in Linux 2.6.16 again, because without > is not possible to load ib_sdp. I don't know about dev_ioctl, but ip_dev_find isn't used by any modules in the mainline kernel tree, which is why it's not exported. If and when some OpenIB component that's part of the mainline tree uses it, it will get re-exported. (Ralph Campbell's message of "Wed, 15 Feb 2006 15:54:01 -0800") References: <1139015153.475.20.camel@brick.internal.keyresearch.com> <20060215212403.GB14424@mellanox.co.il> <1140047642.673.174.camel@brick.internal.keyresearch.com> Message-ID: mst> I thought about it some more: what happens if nodes with mst> different max MTU values try to connect? This should never happen. The SM should never give a path with an MTU that is not supported by both end nodes. I guess the question is what to do when a Tavor (with the performance bug that makes a 1K MTU faster) connects to someone else. - R. From jgunthorpe at obsidianresearch.com Wed Feb 15 16:45:43 2006 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Wed, 15 Feb 2006 17:45:43 -0700 Subject: [openib-general] Low cost MBs compatible with Mellanox PCIE Cards In-Reply-To: References: Message-ID: <20060216004543.GA2135@obsidianresearch.com> On Wed, Feb 15, 2006 at 02:12:10PM +1300, Dave Watkins wrote: > I've tried a few Asus boards with IB cards in their graphics card slots > with no luck, the boards did post but the cards weren't visable to the > drivers. Same cards in other machines were fine. I just ran out an bought an Asus A8N-VM (~80$ CDN) and so far it seems to work (0702 bios), at least the driver loads and lspci -vv reports a x8 PCI-E link with the card in the x16 graphics slot. So that's promising. Unfortunately the board's BIOS's ACPI tables are totally broken and the BIOS assigns every interrupt source in the system to IRQ 5 :< At least it does work in full APIC mode which means it might be possible to get MSI working if the nvidia bridge isn't broken. Using stock Kernel 2.6.15.4, in-kernel mthca driver and a mhgs18-xt HCA. 03:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev 20) Subsystem: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- I suspect somewhere in the BIOS or chipset you can actually say that a > certain PCI-E x16 slot is for "graphics" and no other cards are > initialised, but that's only a guess. I've never heard that a chipset has limited support to graphics or IO. My understanding is that any incompatibility is soundly a BIOS problem with enumeration and configuration of the bus segment. I expect with the chipset specs available it would probably be possible to make a linux PCI fixup to make systems that boot but won't enable the device work by directly twiddling the PCI-E bridge in the system. Basically the same way linux currently has fixups for BIOS bugs in supporting PCI-PCI bridges. Thanks, Jason From rdreier at cisco.com Wed Feb 15 16:52:17 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 15 Feb 2006 16:52:17 -0800 Subject: [openib-general] Low cost MBs compatible with Mellanox PCIE Cards In-Reply-To: <20060216004543.GA2135@obsidianresearch.com> (Jason Gunthorpe's message of "Wed, 15 Feb 2006 17:45:43 -0700") References: <20060216004543.GA2135@obsidianresearch.com> Message-ID: Jason> Unfortunately the board's BIOS's ACPI tables are totally Jason> broken and the BIOS assigns every interrupt source in the Jason> system to IRQ 5 :< At least it does work in full APIC mode Jason> which means it might be possible to get MSI working if the Jason> nvidia bridge isn't broken. I've not seen any problems with MSI/MSI-X with nforce4 and PCIe HCAs on both my Asus A8N-SLI and HP DL145G2 systems. Just building a kernel with CONFIG_PCI_MSI enabled might help, since it changes the way the kernel numbers interrupts. - R. From halr at voltaire.com Wed Feb 15 19:56:01 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Feb 2006 22:56:01 -0500 Subject: [openib-general] Re: [PATCH] change Mellanox SDP workaround to a moduleparameter In-Reply-To: <20060215212403.GB14424@mellanox.co.il> References: <1139015153.475.20.camel@brick.internal.keyresearch.com> <20060215212403.GB14424@mellanox.co.il> Message-ID: <1140062156.4333.26103.camel@hal.voltaire.com> On Wed, 2006-02-15 at 16:24, Michael S. Tsirkin wrote: > Quoting r. Ralph Campbell : > > Subject: [PATCH] change Mellanox SDP workaround to a moduleparameter > > > > This patch changes the hardwired MTU limit of 1024 in SDP > > into a module parameter so it can be disabled for HCAs > > without the RC performance problem. > > > > Signed-off-by: Ralph Campbell > > I thought about it some more: what happens if nodes with different > max MTU values try to connect? > > My understanding of the spec: > > - passive side should send REJ indicating error Invalid Path MTU (error 26) > and passing the maximum MTU it supports in the ARI > - active side should retry the connection with a lower MTU This part is optional. > Right? > By the way, this handling seems to be missing in CMA. From halr at voltaire.com Wed Feb 15 20:00:18 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Feb 2006 23:00:18 -0500 Subject: [openib-general] SDP and Linux 2.6.16-rc2 In-Reply-To: <1140048087.19138.5.camel@serpentine.pathscale.com> References: <714B5872-0F6C-4DA3-8542-3EB29709F0F7@schihei.de> <1140048087.19138.5.camel@serpentine.pathscale.com> Message-ID: <1140062220.4333.26113.camel@hal.voltaire.com> On Wed, 2006-02-15 at 19:01, Bryan O'Sullivan wrote: > On Wed, 2006-02-15 at 22:25 +0100, Heiko J Schick wrote: > > > These symbols are: > > - dev_ioctl > > - ip_dev_find > > > > Will be this symbols exported in Linux 2.6.16 again, because without > > is not possible to load ib_sdp. > > I don't know about dev_ioctl, but ip_dev_find isn't used by any modules > in the mainline kernel tree, which is why it's not exported. If and > when some OpenIB component that's part of the mainline tree uses it, it > will get re-exported. addr.c which is part of CMA which was pushed upstream requires ip_dev_find to be exported. -- Hal > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Wed Feb 15 20:03:33 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Feb 2006 23:03:33 -0500 Subject: [openib-general] Re: [PATCH] change Mellanox SDP workaround to a moduleparameter In-Reply-To: References: <1139015153.475.20.camel@brick.internal.keyresearch.com> <20060215212403.GB14424@mellanox.co.il> <1140047642.673.174.camel@brick.internal.keyresearch.com> Message-ID: <1140062417.4333.26138.camel@hal.voltaire.com> On Wed, 2006-02-15 at 19:03, Roland Dreier wrote: > mst> I thought about it some more: what happens if nodes with > mst> different max MTU values try to connect? > > This should never happen. The SM should never give a path with an MTU > that is not supported by both end nodes. True, but the description for the CM REJ code 26 for Invalid Path MTU states something a little different (p. 667): "The recepient of the REQ message cannot support the maximum packet payload size requested." is being interpreted as "prefers not to support the max payload size requested" which is probably OK. > I guess the question is what to do when a Tavor (with the performance > bug that makes a 1K MTU faster) connects to someone else. Isn't it the other way 'round (when something with a larger MTU connects to Tavor) ? -- Hal > - R. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at mellanox.co.il Wed Feb 15 23:54:18 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 16 Feb 2006 09:54:18 +0200 Subject: [openib-general] Re: Re: [PATCH] change Mellanox SDP workaround to a moduleparameter In-Reply-To: <1140062417.4333.26138.camel@hal.voltaire.com> References: <1139015153.475.20.camel@brick.internal.keyresearch.com> <20060215212403.GB14424@mellanox.co.il> <1140047642.673.174.camel@brick.internal.keyresearch.com> <1140062417.4333.26138.camel@hal.voltaire.com> Message-ID: <20060216075418.GO12974@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: Re: Re: [PATCH] change Mellanox SDP workaround to a moduleparameter > > On Wed, 2006-02-15 at 19:03, Roland Dreier wrote: > > mst> I thought about it some more: what happens if nodes with > > mst> different max MTU values try to connect? > > > > This should never happen. The SM should never give a path with an MTU > > that is not supported by both end nodes. > > True, but the description for the CM REJ code 26 for Invalid Path MTU > states something a little different (p. 667): > > "The recepient of the REQ message cannot support the maximum packet > payload size requested." Right. > is being interpreted as "prefers not to support the max payload size > requested" which is probably OK. > > > I guess the question is what to do when a Tavor (with the performance > > bug that makes a 1K MTU faster) connects to someone else. > > Isn't it the other way 'round (when something with a larger MTU connects > to Tavor) ? Right. I wish we had an MTU field in the REP packet, but we dont. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From eli at mellanox.co.il Thu Feb 16 01:08:31 2006 From: eli at mellanox.co.il (Eli Cohen) Date: Thu, 16 Feb 2006 11:08:31 +0200 Subject: [openib-general] Re: PATCH] mthca - command interface - revised Message-ID: <1140080911.4601.52.camel@mtls03.yok.mtl.com> Looks like we all agree that should be better to post commands through doorbells and save the polling o the go bit. Then I can change the patch so that there will be a r/w module parameter to mthca, ON by default, to enable this feature and the user can turn off/on this functionality on the fly by changing the value of this parameter. If we agree on this I'll post a new patch early next week Eli -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From halr at voltaire.com Thu Feb 16 03:12:41 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Feb 2006 06:12:41 -0500 Subject: [openib-general] Re: Re: [PATCH] change Mellanox SDP workaround to a moduleparameter In-Reply-To: <20060216075418.GO12974@mellanox.co.il> References: <1139015153.475.20.camel@brick.internal.keyresearch.com> <20060215212403.GB14424@mellanox.co.il> <1140047642.673.174.camel@brick.internal.keyresearch.com> <1140062417.4333.26138.camel@hal.voltaire.com> <20060216075418.GO12974@mellanox.co.il> Message-ID: <1140088360.4333.29091.camel@hal.voltaire.com> On Thu, 2006-02-16 at 02:54, Michael S. Tsirkin wrote: > Quoting r. Hal Rosenstock : > > Subject: Re: Re: [PATCH] change Mellanox SDP workaround to a moduleparameter > > > > On Wed, 2006-02-15 at 19:03, Roland Dreier wrote: > > > mst> I thought about it some more: what happens if nodes with > > > mst> different max MTU values try to connect? > > > > > > This should never happen. The SM should never give a path with an MTU > > > that is not supported by both end nodes. > > > > True, but the description for the CM REJ code 26 for Invalid Path MTU > > states something a little different (p. 667): > > > > "The recepient of the REQ message cannot support the maximum packet > > payload size requested." > > Right. > > > is being interpreted as "prefers not to support the max payload size > > requested" which is probably OK. > > > > > I guess the question is what to do when a Tavor (with the performance > > > bug that makes a 1K MTU faster) connects to someone else. > > > > Isn't it the other way 'round (when something with a larger MTU connects > > to Tavor) ? > > Right. I wish we had an MTU field in the REP packet, but we dont. Yes, that would be better IMO too. Not sure why it wasn't done that way. Guess you could file an erratum on this. -- Hal From info at mumulove.info Thu Feb 16 03:27:11 2006 From: info at mumulove.info (info at mumulove.info) Date: Thu, 16 Feb 2006 06:27:11 -0500 (EST) Subject: [openib-general] Re: Message-ID: <20060216112711.5DFA61CDC6C@mumulove.info> $B8=:_$N=P2q$$7O6H3&$O!"0-5=!&2M6u at A5a!&ITEv at A5a$,A}$($?0Y$K!"M%NI$J%5%$%H$^$G$b$,Ho32$rHo$C$F$$$k$N$,8=>u$G$9!#$=$3$GEvR2p$7?7$7$$=P2q$$7O6H3&$rC[$->e$2$k;v$rL\I8$KF|!9EXNO$7$F$*$j$^$9$N$G$46(NO$*4j$$CW$7$^$9!#(B $B!Z0B?4$JM%NI%5%$%H$H$O![(B 1.$BDcNA6b$GMxMQ$G$-$k;v!#(B2.$B=w at -2q0w$,A4$F%U%j!<$G$"$k;v!#(B3.$B%5%]!<%H%;%s%?!<$NBP1~$,Aa$$;v!#(B4.$B%;%-%e%j%F%#!<%7%9%F%`$,K|A4$G$"$k;v!#(B5.$B2M6u at A5a$dITEv at A5a$,L5$$;v!#!|EvR2p$9$k%5%$%H$OA4$F(B1$B!A(B5$B$N>r7o$r%/%j%"!<$7$F$$$k%5%$%H$G$9$N$G0B?4$7$F$4MxMQ$K$J$l$^$9!#:#7n$N%j%5!<%A7k2L$GI>2A$,9b$+$C$?M%NI%5%$%H$r(Bhttp://$B$G$4>R2p$7$F$*$j$^$9!#@'Hs$4Mw$/$@$5$$!#!|=EMW(B!!$B%5%$%H$r8+$?L\$GH=CG$7$F$O$$$1$^$;$s!#0lHVBg;v$J;v$OC5$9Aj References: <1139015153.475.20.camel@brick.internal.keyresearch.com> <20060215212403.GB14424@mellanox.co.il> <1140047642.673.174.camel@brick.internal.keyresearch.com> <1140062417.4333.26138.camel@hal.voltaire.com> <20060216075418.GO12974@mellanox.co.il> <1140088360.4333.29091.camel@hal.voltaire.com> Message-ID: <20060216113408.GU12974@mellanox.co.il> Quoting r. Hal Rosenstock : > > > > mst> I thought about it some more: what happens if nodes with > > > > mst> different max MTU values try to connect? > > > > > > > > This should never happen. The SM should never give a path with an MTU > > > > that is not supported by both end nodes. > > > > > > True, but the description for the CM REJ code 26 for Invalid Path MTU > > > states something a little different (p. 667): > > > > > > "The recepient of the REQ message cannot support the maximum packet > > > payload size requested." > > > > Right. > > > > > is being interpreted as "prefers not to support the max payload size > > > requested" which is probably OK. > > > > > > > I guess the question is what to do when a Tavor (with the performance > > > > bug that makes a 1K MTU faster) connects to someone else. > > > > > > Isn't it the other way 'round (when something with a larger MTU connects > > > to Tavor) ? > > > > Right. I wish we had an MTU field in the REP packet, but we dont. > > Yes, that would be better IMO too. Not sure why it wasn't done that way. > Guess you could file an erratum on this. No idea. Possibly the REJ code 26 was deemed sufficient. We'll need some kind of solution before we start creating SDP clients with different max mtu values, controlled from software. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Thu Feb 16 04:53:24 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 16 Feb 2006 14:53:24 +0200 Subject: [openib-general] Re: Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: <20060215142529.GM24524@minantech.com> References: <20060213190206.GC12458@mellanox.co.il> <20060214065145.GE24524@minantech.com> <20060214164236.GB12974@mellanox.co.il> <20060215073452.GI24524@minantech.com> <20060215082331.GE10026@mellanox.co.il> <20060215083115.GJ24524@minantech.com> <20060215090250.GF12974@mellanox.co.il> <20060215093007.GK24524@minantech.com> <20060215101448.GJ12974@mellanox.co.il> <20060215142529.GM24524@minantech.com> Message-ID: <20060216125324.GC12974@mellanox.co.il> Quoting Gleb Natapov : > > > > Clarification: as I see it, longer term we want to add a flag to make > > > > get_user_pages trigger an immediate page copy on fork (rather than > > > > copy_ptes). > > > > > > Can you elaborate? Do you mean one more VMA flag (VM_COPYONFORK)? > > > > This should hopefully solve more than just the reg_mr issue, and not > > specific to infiniband. See e.g. here: http://lkml.org/lkml/2005/12/12/30 > > So no, this will have to be a per-page flag: set by get_user_pages when > > passed some new option, and cleared by put_page when the page ref count > > drops to page map count. > > Yes this is very serious issue I wonder why aio users don't complain all > over the lklm. (or should aio buffers have to be aligned?) No, I dont think so. > > BTW, I dont know when I will get around to working on it, so any help > > would be appreciated. > > Do you think new page flag is a viable solution? With the holy war > against new (and old) page flags. Besides fork will have to go from pte to > struct page to check flags for each mapped page in the process! We'll have to see whether this is acceptable. Clearly, if we keep this per-vma we'll need a counter, not just a flag. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From ogerlitz at voltaire.com Thu Feb 16 05:01:35 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 16 Feb 2006 15:01:35 +0200 (IST) Subject: [openib-general] [PATCH] iser: fixes for RDMA unaligned SGs, cleanups around SG handling Message-ID: ------------------------------------------------------------------------ r5427 | ogerlitz | 2006-02-16 14:55:12 +0200 (Thu, 16 Feb 2006) | 4 lines various cleanups in the SG handling code Signed-off-by: Or Gerlitz ------------------------------------------------------------------------ r5426 | ogerlitz | 2006-02-16 14:52:23 +0200 (Thu, 16 Feb 2006) | 4 lines fixes for the rare case of SGs which are unaligned for RDMA Signed-off-by: Or Gerlitz ------------------------------------------------------------------------ Index: iser_memory.c =================================================================== --- iser_memory.c (revision 5415) +++ iser_memory.c (revision 5426) @@ -40,16 +40,7 @@ #include "iscsi_iser.h" -/** - * iser_page_to_virt - Translates page descriptor to virtual kernel address - * returns virtual kernel address - */ -inline void * -iser_page_to_virt(struct page *page) -{ - return phys_to_virt(page_to_phys(page)); -} - +#define ISER_KMALLOC_THRESHOLD 0x20000 /* 128K - kmalloc limit */ /** * Decrements the reference count for the * registered buffer & releases it @@ -141,22 +132,26 @@ void iser_start_rdma_unaligned_sg(struct struct iser_data_buf *p_mem = &p_iser_task->data[cmd_dir]; unsigned long cmd_data_len = iser_sg_size(p_mem); - mem = kmalloc(cmd_data_len, GFP_KERNEL | __GFP_NOFAIL); + if (cmd_data_len > ISER_KMALLOC_THRESHOLD) + mem = (void *)__get_free_pages(GFP_KERNEL | __GFP_NOFAIL, + long_log2(roundup_pow_of_two(cmd_data_len)) - PAGE_SHIFT); + else + mem = kmalloc(cmd_data_len, GFP_KERNEL | __GFP_NOFAIL); + if (mem == NULL) { iser_bug("Failed to allocate mem size %d %d for copying sglist\n", p_mem->size,(int)cmd_data_len); } if (cmd_dir == ISER_DIR_OUT) { - /* copy the sglist to p */ - /* iser_data_buf_memcpy() */ + /* copy the unaligned sg the buffer which is used for RDMA */ struct scatterlist *p_sg = (struct scatterlist *)p_mem->p_buf; int i; char *p; for (p = mem, i = 0; i < p_mem->size; i++) { memcpy(p, - iser_page_to_virt(p_sg[i].page)+ p_sg[i].offset, + page_address(p_sg[i].page) + p_sg[i].offset, p_sg[i].length); p += p_sg[i].length; } @@ -207,17 +202,21 @@ void iser_finalize_rdma_unaligned_sg(str dma_unmap_single(dma_device, dma_addr, size, DMA_FROM_DEVICE); /* copy back read RDMA to unaligned sg */ mem = p_mem_copy->p_buf; - p_sg = (struct scatterlist *)&p_iser_task->data[ISER_DIR_IN].p_buf; + p_sg = (struct scatterlist *)p_iser_task->data[ISER_DIR_IN].p_buf; sg_size = p_iser_task->data[ISER_DIR_IN].size; for (p = mem, i = 0; i < sg_size; i++){ - memcpy(iser_page_to_virt(p_sg[i].page)+p_sg[i].offset, + memcpy(page_address(p_sg[i].page) + p_sg[i].offset, p, p_sg[i].length); p += p_sg[i].length; } - kfree(p_mem_copy->p_buf); + if (size > ISER_KMALLOC_THRESHOLD) + free_pages((unsigned long)p_mem_copy->p_buf, + long_log2(roundup_pow_of_two((int)size)) - PAGE_SHIFT); + else + kfree(p_mem_copy->p_buf); p_mem_copy->p_buf = NULL; } @@ -226,7 +225,11 @@ void iser_finalize_rdma_unaligned_sg(str size = p_mem_copy->size; dma_addr = p_mem_copy->dma_addr; dma_unmap_single(dma_device, dma_addr, size, DMA_TO_DEVICE); - kfree(p_mem_copy->p_buf); + if (size > ISER_KMALLOC_THRESHOLD) + free_pages((unsigned long)p_mem_copy->p_buf, + long_log2(roundup_pow_of_two((int)size)) - PAGE_SHIFT); + else + kfree(p_mem_copy->p_buf); p_mem_copy->p_buf = NULL; } } Index: iscsi_iser.h =================================================================== --- iscsi_iser.h (revision 5426) +++ iscsi_iser.h (revision 5427) @@ -173,11 +173,11 @@ enum iser_buf_type { }; struct iser_data_buf { - void *p_buf; - unsigned int size; - enum iser_buf_type type; - dma_addr_t dma_addr; - unsigned int dma_nents; + enum iser_buf_type type; /* single or scatterlist */ + void *p_buf; /* single -> data scatterlist -> sg */ + unsigned int size; /* data len for single, nentries for sg */ + dma_addr_t dma_addr; /* returned by dma_map_single */ + unsigned int dma_nents; /* returned by dma_map_sg for */ }; /* fwd declarations */ Index: iser_memory.c =================================================================== --- iser_memory.c (revision 5426) +++ iser_memory.c (revision 5427) @@ -98,29 +98,21 @@ void iser_reg_single(struct iser_adaptor p_regd_buf->direction = direction; } -static int iser_sg_subset_len(struct iser_data_buf *p_data, - int skip_entries, - int count_entries) + +/** + * iser_sg_size - returns the total data length in an sg list + */ +int iser_sg_size(struct iser_data_buf *p_data) { struct scatterlist *p_sg = (struct scatterlist *)p_data->p_buf; - int i, last_entry, total_len = 0; + int i, total_len=0; - last_entry = skip_entries + count_entries; - for (i = skip_entries; i < last_entry; i++) + for (i = 0; i < p_data->dma_nents; i++) total_len += sg_dma_len(&p_sg[i]); return total_len; } /** - * iser_sg_size - returns the total data length in sg list - */ -int iser_sg_size(struct iser_data_buf *p_mem) -{ - return - iser_sg_subset_len(p_mem, 0, p_mem->dma_nents); -} - -/** * iser_start_rdma_unaligned_sg */ void iser_start_rdma_unaligned_sg(struct iscsi_iser_cmd_task *p_iser_task, @@ -130,7 +122,7 @@ void iser_start_rdma_unaligned_sg(struct struct device *dma_device; char *mem = NULL; struct iser_data_buf *p_mem = &p_iser_task->data[cmd_dir]; - unsigned long cmd_data_len = iser_sg_size(p_mem); + unsigned long cmd_data_len = p_iser_task->data_len[cmd_dir]; if (cmd_data_len > ISER_KMALLOC_THRESHOLD) mem = (void *)__get_free_pages(GFP_KERNEL | __GFP_NOFAIL, @@ -247,8 +239,7 @@ void iser_finalize_rdma_unaligned_sg(str * consecutive elements. Also, it handles one entry SG. */ static int iser_sg_to_page_vec(struct iser_data_buf *p_data, - struct iser_page_vec *page_vec, - int skip, int cnt) + struct iser_page_vec *page_vec) { struct scatterlist *p_sg = (struct scatterlist *)p_data->p_buf; dma_addr_t first_addr, last_addr, page; @@ -259,9 +250,9 @@ static int iser_sg_to_page_vec(struct is /* compute the offset of first element */ /* FIXME page_vec->offset type should be dma_addr_t */ - page_vec->offset = (u64) p_sg[skip].offset; + page_vec->offset = (u64) p_sg[0].offset; - for (i = skip; i < skip + cnt; i++) { + for (i = 0; i < p_data->dma_nents; i++) { total_sz += sg_dma_len(&p_sg[i]); first_addr = sg_dma_address(&p_sg[i]); @@ -271,7 +262,7 @@ static int iser_sg_to_page_vec(struct is end_aligned = !(last_addr & ~PAGE_MASK); /* continue to collect page fragments till aligned or SG ends */ - while (!end_aligned && (i + 1 < skip + cnt)) { + while (!end_aligned && (i + 1 < p_data->dma_nents)) { i++; total_sz += sg_dma_len(&p_sg[i]); last_addr = sg_dma_address(&p_sg[i]) + sg_dma_len(&p_sg[i]); @@ -330,19 +321,16 @@ static int iser_single_to_page_vec(struc * the number of entries which are aligned correctly. Supports the case where * consecutive SG elements are actually fragments of the same physcial page. */ -static unsigned int iser_data_buf_aligned_len(struct iser_data_buf *p_data, - int skip) +static unsigned int iser_data_buf_aligned_len(struct iser_data_buf *p_data) { struct scatterlist *p_sg; dma_addr_t end_addr, next_addr; int i, cnt; unsigned int ret_len = 0; - if (p_data->type == ISER_BUF_TYPE_SINGLE) - return 1; p_sg = (struct scatterlist *)p_data->p_buf; - for (cnt = 0, i = skip; i < p_data->dma_nents; i++, cnt++) { + for (cnt = 0, i = 0; i < p_data->dma_nents; i++, cnt++) { /* iser_dbg("Checking sg iobuf [%d]: phys=0x%08lX " "offset: %ld sz: %ld\n", i, (unsigned long)page_to_phys(p_sg[i].page), @@ -393,15 +381,11 @@ static void iser_data_buf_dump(struct is * iser_page_vec_alloc - allocate page_vec covering a given data buffer */ static struct iser_page_vec *iser_page_vec_alloc(struct iser_data_buf *p_data, - int skip, int cnt) + int total_size) { struct iser_page_vec *page_vec; - int npages, total_size; + int npages; - if (p_data->type == ISER_BUF_TYPE_SINGLE) - total_size = p_data->size; - else - total_size = iser_sg_subset_len(p_data, skip, cnt); npages = total_size / PAGE_SIZE + 2; page_vec = kmalloc(sizeof(struct iser_page_vec) + @@ -431,8 +415,7 @@ static void iser_dump_page_vec(struct is } static void iser_page_vec_build(struct iser_data_buf *p_data, - struct iser_page_vec *page_vec, - int skip, int cnt) + struct iser_page_vec *page_vec) { int page_vec_len = 0; @@ -441,9 +424,9 @@ static void iser_page_vec_build(struct i page_vec_len = iser_single_to_page_vec(p_data, page_vec); } else { iser_dbg("Translating sg sz: %d\n", p_data->dma_nents); - page_vec_len = iser_sg_to_page_vec(p_data,page_vec, skip,cnt); - iser_dbg("sg size %d skip %d cnt %d page_vec_len %d\n", - p_data->dma_nents,skip,cnt,page_vec_len); + page_vec_len = iser_sg_to_page_vec(p_data,page_vec); + iser_dbg("sg len %d page_vec_len %d\n", + p_data->dma_nents,page_vec_len); } page_vec->length = page_vec_len; @@ -470,7 +453,7 @@ int iser_reg_rdma_mem(struct iscsi_iser_ struct iser_data_buf *p_mem = &p_iser_task->data[cmd_dir]; struct iser_page_vec *page_vec; struct iser_regd_buf *p_regd_buf; - int cnt_to_reg = 0; + int aligned_len; int err; p_regd_buf = &p_iser_task->rdma_regd[cmd_dir]; @@ -478,31 +461,23 @@ int iser_reg_rdma_mem(struct iscsi_iser_ iser_dbg("p_mem %p p_mem->type %d\n", p_mem,p_mem->type); if (p_mem->type != ISER_BUF_TYPE_SINGLE) { - int aligned_len; - - iser_dbg("converting sg to page_vec\n"); - aligned_len = iser_data_buf_aligned_len(p_mem,0); - if (aligned_len == p_mem->size) - cnt_to_reg = aligned_len; - else { - iser_err("can't reg for rdma, alignment violation\n"); + aligned_len = iser_data_buf_aligned_len(p_mem); + if (aligned_len != p_mem->size) { + iser_err("rdma alignment violation %d/%d aligned\n", + aligned_len, p_mem->size); iser_data_buf_dump(p_mem); /* allocate copy buf, if we are writing, copy the */ - /* unaligned scatterlist, anyway dma map the copy */ + /* unaligned scatterlist, dma map the copy */ iser_start_rdma_unaligned_sg(p_iser_task, cmd_dir); - p_regd_buf->virt_addr = p_iser_task->data_copy[cmd_dir].p_buf; p_mem = &p_iser_task->data_copy[cmd_dir]; } - } else { - iser_dbg("converting single to page_vec\n"); - p_regd_buf->virt_addr = p_mem->p_buf; } - page_vec = iser_page_vec_alloc(p_mem,0,cnt_to_reg); + page_vec = iser_page_vec_alloc(p_mem, p_iser_task->data_len[cmd_dir]); if(!page_vec) return -ENOMEM; - iser_page_vec_build(p_mem, page_vec,0,cnt_to_reg); + iser_page_vec_build(p_mem, page_vec); err = iser_reg_page_vec(p_iser_conn,page_vec,&p_regd_buf->reg); kfree(page_vec); if(err) From takshak at gs-lab.com Thu Feb 16 05:05:18 2006 From: takshak at gs-lab.com (Takshak C.) Date: Thu, 16 Feb 2006 18:35:18 +0530 Subject: [openib-general] Get Table Records for SA Attribute ID ? In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F589AC20@taurus.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F589AC20@taurus.voltaire.com> Message-ID: <43F4788E.3070909@gs-lab.com> Hi Hal, Thanks for the information. Based on your feedback, I have corrected the SA MAD structure and added RMPP header as below: typedef struct mad_header { uint8_t base_version; uint8_t mgmt_class; uint8_t class_version; uint8_t response_bit; uint8_t method; uint16_t status; uint16_t class_spec; uint64_t tid; uint16_t attr_id; uint16_t resv; uint32_t attr_mod; } ib_mad_t ; typedef struct _ib_rmpp_mad { ib_mad_t common_hdr; uint8_t rmpp_version; uint8_t rmpp_type; uint8_t rmpp_flags; uint8_t rmpp_status; uint32_t seg_num; uint32_t paylen_newwin; uint8_t data[192]; } ib_rmpp_mad_t; ib_rmpp_mad_t *p_mad = (struct ib_rmpp_mad_t*)(umad_get_mad(umad)); memset(p_mad, 0, sizeof(ib_rmpp_mad_t)); p_mad->common_hdr.base_version = 1 ; p_mad->common_hdr.mgmt_class = IB_MCLASS_SUBN_ADM ; p_mad->common_hdr.class_version = (uint8_t) 2 ; p_mad->common_hdr.method = IB_MAD_METHOD_GET; p_mad->common_hdr.status = 0; p_mad->common_hdr.class_spec = 0; p_mad->common_hdr.tid = 0x123 ; p_mad->common_hdr.attr_id = IB_SA_ATTR_PATHRECORD ; p_mad->common_hdr.resv = 0; p_mad->common_hdr.attr_mod = 0 ; p_mad->rmpp_version = 1 ; p_mad->seg_num = 1 ; p_mad->rmpp_flags |= (uint8_t)0x70 ; p_mad->rmpp_status = 0; p_mad->paylen_newwin = IB_MAD_SIZE - MAD_RMPP_HDR_SIZE ; // 256 - 36 umad_set_addr(umad, port_attr.sm_lid , 1, 0, IB_DEFAULT_QP1_QKEY); umad_set_grh(umad, 0); umad_set_pkey(umad, 0xFFFF); // IB_DEFAULT_PKEY 0xFFFF I have registered umad with RMPP = 1. I have not started openSM instance here. I m using vendor specific SM. I have done umad_send(...) and umad_recv(). Could you please tell me, what should I do to retrieve the PathRecord after umad_recv(...) call ? I believe, if I call function : ib_rmpp_mad_t *recv_mad = (ib_rmpp_mad_t*)umad_get_mad(umad); and then read this recv_mad will give me path record from local HCA ? Or is there anyother good way to do the same ? Please throw some light on this. Do you have any userspace SA support for retrieving path, service record information ? Regards. - Takshak Hal Rosenstock wrote: >Hi, > >There are a couple of issues with the below. > >1. SA MAD structure is missing the RMPP header. Once I saw that I didn't check for further issues with the format. > >2. I will assume your register call sets RMPP. > >3. SA class version is 2. > >What SM are you using ? If you are using OpenSM, you can turn on verbose and see if the packet is seen by the SM. You could also enable madeye (in utils) to see if the packet is sent (and if anything is received back). > >-- Hal > >________________________________ > >From: openib-general-bounces at openib.org on behalf of Takshak C. >Sent: Mon 2/6/2006 8:00 AM >To: openib-general at openib.org >Subject: [openib-general] Get Table Records for SA Attribute ID ? > > > >Hi, > >I m trying to get the table records for SA attribute ID in following way. >But, I m not getting a single record, could anyone comment on the problem. > >1. I have created saMadFormat structure described in the specification as below: > >struct saMadFormat >{ > > uint8_t base_version ; > uint8_t mgmt_class ; > uint8_t class_version ; > uint8_t sa_method ; > uint16_t status ; > uint16_t not_used ; > uint64_t tid ; > uint16_t attr_id ; > uint16_t resv ; > uint32_t attr_mod ; > uint64_t sa_key; > uint64_t sm_key ; > uint32_t seg_num ; > uint32_t payload_len ; > uint8_t frag_flag ; > uint8_t edit_mod ; > uint16_t window ; > uint32_t endRID ; > uint64_t comp_mask ; > uint8_t adminData[192] ; >}; > >2. Then I have done all the basic operations like umad_open, umad_register for the IB_SA_CLASS > and umad_open_port etc successfully. > >3. struct saMadFormat *saQuery = (struct saMadFormat*)(umad_get_mad(umad)); > memset(saQuery, 0, sizeof(*saQuery)); > > saQuery->base_version = 1; > saQuery->mgmt_class = IB_SA_CLASS ; > saQuery->class_version = 1 ; > saQuery->sa_method = IB_MAD_METHOD_GET_TABLE ; > saQuery->attr_id = IB_SA_ATTR_PATHRECORD ; > saQuery->attr_mod = 0 ; > saQuery->tid = htonll(drmad_tid++); > saQuery->endRID = 0 ; > > umad_set_addr(umad, lid, 1, 0, IB_DEFAULT_QP1_QKEY); > umad_set_grh(umad, 0); > umad_set_pkey(umad, 0xFFFF); > >4. length = IB_MAD_SIZE; > > if (umad_send(portid, mad_agent, umad, length, timeout_ms, 0) < 0) > IBPANIC("send failed"); > > if (umad_recv(portid, umad, &length, -1) != mad_agent) > IBPANIC("recv error: %s", drmad_status_str(saQuery)); > > > > if (!dump_char) { > xdump(stdout, 0, saQuery->adminData, 192); > return 0; > } > >I m expecting that, I will get the resultant data in saQuery->adminData. >Is this correct ? If not then, how should I retrieve the table records ? >Any Idea ? > > >Thanks >- Takshak > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > From rpearson at systemfabricworks.com Thu Feb 16 06:44:48 2006 From: rpearson at systemfabricworks.com (Bob Pearson) Date: Thu, 16 Feb 2006 08:44:48 -0600 Subject: [openib-general] Low cost MBs compatible with Mellanox PCIE Cards In-Reply-To: <19a929370602152056g40b604c3ha846b705c2bd7941@mail.gmail.com> Message-ID: <000f01c63307$89a09cf0$9f01a8c0@BOBP> Jim has worked with the A8N-SLI board. We have worked with a newer version called A8N32-SLI Deluxe which has much better PCIe support. Ours has only run the gen1 stack so I have no idea what it does with the gen2 stack. As for the BIOS issue mentioned perhaps Steve W could take a look and see if the comment from Jason makes any sense but my impression is that everything was just fine. _____ From: Bill Boas [mailto:bill.boas at gmail.com] Sent: Wednesday, February 15, 2006 10:56 PM To: Bob Pearson Subject: Fwd: [openib-general] Low cost MBs compatible with Mellanox PCIE Cards Do we have something to contribute to this thread? ---------- Forwarded message ---------- From: Roland Dreier Date: Feb 15, 2006 4:52 PM Subject: Re: [openib-general] Low cost MBs compatible with Mellanox PCIE Cards To: Jason Gunthorpe Cc: openib-general at openib.org Jason> Unfortunately the board's BIOS's ACPI tables are totally Jason> broken and the BIOS assigns every interrupt source in the Jason> system to IRQ 5 :< At least it does work in full APIC mode Jason> which means it might be possible to get MSI working if the Jason> nvidia bridge isn't broken. I've not seen any problems with MSI/MSI-X with nforce4 and PCIe HCAs on both my Asus A8N-SLI and HP DL145G2 systems. Just building a kernel with CONFIG_PCI_MSI enabled might help, since it changes the way the kernel numbers interrupts. - R. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From swelch at systemfabricworks.com Thu Feb 16 07:21:24 2006 From: swelch at systemfabricworks.com (Steve Welch) Date: Thu, 16 Feb 2006 09:21:24 -0600 Subject: [openib-general] Low cost MBs compatible with Mellanox PCIE Cards In-Reply-To: <000f01c63307$89a09cf0$9f01a8c0@BOBP> Message-ID: <064536929FFE43A49A0CEFD946BF9228.MAI@windows.securehostserver.com> Bob, We do not have interrupt assignment problems for the HCA that I'm aware of using the A8N32-SLI Deluxe MB. We do have message signaled interrupts enabled as well (Roland's suggestion to Jason). Output follows. Steve W. Feb 16 07:50:26 localhost kernel: ACPI: PCI interrupt 0000:07:00.0[A] -> GSI 17 (level, low) -> IRQ 225 [root at localhost ~]# cat /proc/interrupts CPU0 0: 55263251 IO-APIC-edge timer 1: 16 IO-APIC-edge i8042 8: 0 IO-APIC-edge rtc 9: 0 IO-APIC-level acpi 12: 402 IO-APIC-edge i8042 177: 10162 IO-APIC-level libata 185: 0 IO-APIC-level libata, ohci_hcd 193: 5559144 IO-APIC-level ehci_hcd, eth0 201: 0 IO-APIC-level NVidia CK804 209: 18286320 IO-APIC-level xxxxxx 217: 19671760 IO-APIC-level xxxxxx 225: 33752396 IO-APIC-level InfiniHost_III_Lx0 NMI: 2101 LOC: 55255250 ERR: 0 MIS: 0 ________________________________________ From: Bob Pearson [mailto:rpearson at systemfabricworks.com] Sent: Thursday, February 16, 2006 8:45 AM To: 'Bill Boas'; openib-general at openib.org Cc: 'Steve Welch'; 'Jim Mott' Subject: RE: [openib-general] Low cost MBs compatible with Mellanox PCIE Cards Jim has worked with the A8N-SLI board. We have worked with a newer version called A8N32-SLI Deluxe which has much better PCIe support. Ours has only run the gen1 stack so I have no idea what it does with the gen2 stack. As for the BIOS issue mentioned perhaps Steve W could take a look and see if the comment from Jason makes any sense but my impression is that everything was just fine. ________________________________________ From: Bill Boas [mailto:bill.boas at gmail.com] Sent: Wednesday, February 15, 2006 10:56 PM To: Bob Pearson Subject: Fwd: [openib-general] Low cost MBs compatible with Mellanox PCIE Cards Do we have something to contribute to this thread?   ---------- Forwarded message ---------- From: Roland Dreier Date: Feb 15, 2006 4:52 PM Subject: Re: [openib-general] Low cost MBs compatible with Mellanox PCIE Cards To: Jason Gunthorpe Cc: openib-general at openib.org    Jason> Unfortunately the board's BIOS's ACPI tables are totally    Jason> broken and the BIOS assigns every interrupt source in the    Jason> system to IRQ 5 :< At least it does work in full APIC mode    Jason> which means it might be possible to get MSI working if the    Jason> nvidia bridge isn't broken. I've not seen any problems with MSI/MSI-X with nforce4 and PCIe HCAs on both my Asus A8N-SLI and HP DL145G2 systems. Just building a kernel with CONFIG_PCI_MSI enabled might help, since it changes the way the kernel numbers interrupts. - R. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general   From rdreier at cisco.com Thu Feb 16 09:28:22 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 16 Feb 2006 09:28:22 -0800 Subject: [openib-general] Re: PATCH] mthca - command interface - revised In-Reply-To: <1140080911.4601.52.camel@mtls03.yok.mtl.com> (Eli Cohen's message of "Thu, 16 Feb 2006 11:08:31 +0200") References: <1140080911.4601.52.camel@mtls03.yok.mtl.com> Message-ID: Eli> Looks like we all agree that should be better to post Eli> commands through doorbells and save the polling o the go Eli> bit. Then I can change the patch so that there will be a r/w Eli> module parameter to mthca, ON by default, to enable this Eli> feature and the user can turn off/on this functionality on Eli> the fly by changing the value of this parameter. If we agree Eli> on this I'll post a new patch early next week That makes sense to me. What workloads benefit from this? MST mentioned that some systems and/or workloads get slower -- do you have any details here? - R. From iod00d at hp.com Thu Feb 16 09:47:00 2006 From: iod00d at hp.com (Grant Grundler) Date: Thu, 16 Feb 2006 09:47:00 -0800 Subject: [openib-general] Re: PATCH] mthca - command interface - revised In-Reply-To: <1140080911.4601.52.camel@mtls03.yok.mtl.com> References: <1140080911.4601.52.camel@mtls03.yok.mtl.com> Message-ID: <20060216174700.GB19310@esmail.cup.hp.com> On Thu, Feb 16, 2006 at 11:08:31AM +0200, Eli Cohen wrote: ... > Then I can change the patch > so that there will be a r/w module parameter to mthca, ON by default, to > enable this feature and the user can turn off/on this functionality on > the fly by changing the value of this parameter. Please also include a short bit of documentation - just a paragraph or two - on why a user would want to twiddle this parameter. Users will ask openib.org mailing lists and eventually the distros. thanks, grant From rdreier at cisco.com Thu Feb 16 11:05:54 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 16 Feb 2006 11:05:54 -0800 Subject: [openib-general] [ANNOUNCE] libibverbs 1.0-rc7 released Message-ID: I just tagged a 1.0-rc7 release of libibverbs and pushed it out to the relevant channels, which means that it should appear on http://openib.org/downloads/ shortly. Binary packages will be in Fedora Extras when the builds complete. Please download and test this release, so that the final 1.0 release can be a good one. The API and ABI are now frozen for the 1.0 release, which is currently planned for roughly three weeks from now. Any incompatible changes will have to wait for the 1.1 release. Bug fixes and other changes that don't affect the library interface are of course welcome any time. Changes since 1.0-rc5 include: - Handle new kernel ABI (version 5) required for proper alignment of the create QP response structure. - Update low-level driver library interface so that kernel drivers can return device-specific information to the create QP operation. - Add query QP and query SRQ support (contributed by Mellanox). - Add resize CQ support. - Bug fixes and cleanups as usual. See the ChangeLog in the package for full details. Thanks, Roland From xma at us.ibm.com Thu Feb 16 11:14:48 2006 From: xma at us.ibm.com (Shirley Ma) Date: Thu, 16 Feb 2006 11:14:48 -0800 Subject: [openib-general] Re: [PATCH RESEND] net: Move destructor from neigh->ops to neigh_params In-Reply-To: <20060202185713.GA6219@mellanox.co.il> Message-ID: Hi, Michael, What's the current status of this patch? I kept hitting the panic when bring the interface up and down. I went through the neighbour and ipoib_neigh code. I think it's not necessary to patch net/core/neighbour. The reason we hit this problem is neighbour won't have a pointer to ipoib_neigh if path_free() or mcast_free() being called by the time neighbour has been freed. (ipoib_neigh always has a pointer to a neighbour). If neigh_destructor() gets called in this context, in ipoib_neigh_destructor() if ipoib_neigh is NULL it does nothing. So removing neigh->neighbour->ops->destructor = NULL in kfree(neigh) is sufficient to fix this problem. How do you think? Here is the patch I used for testing. diff -urN infiniband/ulp/ipoib/ipoib_main.c infiniband-patch/ulp/ipoib/ipoib_main.c --- infiniband/ulp/ipoib/ipoib_main.c 2006-02-01 13:45:43.000000000 -0800 +++ infiniband-patch/ulp/ipoib/ipoib_main.c 2006-02-16 11:02:24.902458152 -0800 @@ -247,7 +247,6 @@ if (neigh->ah) ipoib_put_ah(neigh->ah); *to_ipoib_neigh(neigh->neighbour) = NULL; - neigh->neighbour->ops->destructor = NULL; kfree(neigh); } @@ -530,7 +529,6 @@ err: *to_ipoib_neigh(skb->dst->neighbour) = NULL; list_del(&neigh->list); - neigh->neighbour->ops->destructor = NULL; kfree(neigh); ++priv->stats.tx_dropped; diff -urN infiniband/ulp/ipoib/ipoib_multicast.c infiniband-patch/ulp/ipoib/ipoib_multicast.c --- infiniband/ulp/ipoib/ipoib_multicast.c 2006-02-16 11:00:40.379348080 -0800 +++ infiniband-patch/ulp/ipoib/ipoib_multicast.c 2006-02-16 11:02:33.225192904 -0800 @@ -115,7 +115,6 @@ if (neigh->ah) ipoib_put_ah(neigh->ah); *to_ipoib_neigh(neigh->neighbour) = NULL; - neigh->neighbour->ops->destructor = NULL; kfree(neigh); } Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Thu Feb 16 11:17:41 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 16 Feb 2006 11:17:41 -0800 Subject: [openib-general] Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: (Roland Dreier's message of "Thu, 16 Feb 2006 11:05:54 -0800") References: Message-ID: I thought it might be helpful to give an informal roadmap of my plans for libibverbs. As I said, I hope to have a libibverbs 1.0 release out in three weeks or so. Once that happens, I plan to do 1.0.x maintenance releases "as needed." At the same time, I have some ideas for libibverbs 1.1. I think that I can get my ideas done in six months or so, which seems like a good interval between major releases. My ideas (also listed in the libibverbs README) are the following; I'd like to hear what other people's plans for libibverbs work are so that we can figure out if those projects fit into a 1.1 release, and when we expect to land the new features. * Implement memory window (MW) support. This will break the device driver ABI, because new methods will need to be added to struct ibv_context_ops. * Implement the reregister memory region (MR) verb. We will add an extension to the IB spec to allow the application to indicate that the region is only being extended, and that operations in progress should _not_ fail (contrary to the IB spec, which states that reregister must be implemented so that it behaves equivalently to a deregister followed by a register). This will break the device driver ABI, because a new method will need to be added to struct ibv_context_ops. * Eliminate the dependency on libsysfs by implementing the required sysfs handling directly. This will break the API, because the dev and ibdev members of struct ibv_device will be removed. It will also break the device driver ABI, because the signature of the driver initialization function will change. The driver initialization function will be changed as part of this work; this has the added benefit of allowing us to choose a better name than "openib_driver_init." I'm also thinking of moving my libibverbs and libmthca development trees to git (most likely hosted at kernel.org). This has the drawback of moving their development repositories out of the common openib.org svn tree. However, it will make handling 1.0, 1.1 and feature development branches much easier. I'd like to hear opinions on this before I make a decision. - R. From mst at mellanox.co.il Thu Feb 16 11:23:03 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 16 Feb 2006 21:23:03 +0200 Subject: [openib-general] Re: Re: [PATCH RESEND] net: Move destructor from neigh->ops to neigh_params In-Reply-To: References: <20060202185713.GA6219@mellanox.co.il> Message-ID: <20060216192303.GA20997@mellanox.co.il> Quoting r. Shirley Ma : > Subject: Re: Re: [PATCH RESEND] net: Move destructor from?neigh->ops to neigh_params > > > Hi, Michael, > > What's the current status of this patch? > > I kept hitting the panic when bring the interface up and down. I went through the neighbour and ipoib_neigh code. I think it's not necessary to patch net/core/neighbour. > > The reason we hit this problem is neighbour won't have a pointer to ipoib_neigh if path_free() or mcast_free() being called by the time neighbour has been freed. (ipoib_neigh always has a pointer to a neighbour). > > If neigh_destructor() gets called in this context, in ipoib_neigh_destructor() if ipoib_neigh is NULL it does nothing. So removing neigh->neighbour->ops->destructor = NULL in kfree(neigh) is sufficient to fix this problem. How do you think? > > Here is the patch I used for testing. With this approach you'll get crashes when the module will get unloaded. Further, we may get called with neighbour that is not related to ipoib at all. Shirley, please look under https://openib.org/svn/trunk/contrib/mellanox/patches There is a set of various patches for ipoib pending Rolands review there. Most of them fix various hang/oops conditions or packet leaks. An approach to fixing this specific problem that does not involve kernel patches is implemented in patch ipoib_all_neigh_issues_2.patch. I'll be thankful for more testing. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From jgunthorpe at obsidianresearch.com Thu Feb 16 11:21:38 2006 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Thu, 16 Feb 2006 12:21:38 -0700 Subject: [openib-general] Low cost MBs compatible with Mellanox PCIE Cards In-Reply-To: References: <20060216004543.GA2135@obsidianresearch.com> Message-ID: <20060216192138.GB3901@obsidianresearch.com> On Wed, Feb 15, 2006 at 04:52:17PM -0800, Roland Dreier wrote: > I've not seen any problems with MSI/MSI-X with nforce4 and PCIe HCAs > on both my Asus A8N-SLI and HP DL145G2 systems. I've managed to make it work here too. I found an earlier email from you about the HT MSI Mapping capability which proved to be the key. The BIOS on this board doesn't enable it :| People on this list may be interested in this patch which adds a PCI quirk to force the HyperTransport MSI mapping function to be enabled if the BIOS forgot about it. It makes my board work: ib1{jgg}~#cat /proc/interrupts CPU0 0: 344750 IO-APIC-edge timer 1: 8 IO-APIC-edge i8042 2: 0 XT-PIC cascade 5: 141156 IO-APIC-level ohci_hcd:usb1, ehci_hcd:usb2, eth0 8: 4 IO-APIC-edge rtc 14: 1445 IO-APIC-edge ide0 15: 15 IO-APIC-edge ide1 193: 179 PCI-MSI ib_mthca Thanks, Jason --- linux-2.6.15.4/drivers/pci/quirks.c 2006-02-16 12:08:59.000000000 -0700 +++ lin/drivers/pci/quirks.c 2006-02-16 12:12:30.000000000 -0700 @@ -1257,6 +1257,29 @@ } DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_NCR, PCI_DEVICE_ID_NCR_53C810, fixup_rev1_53c810); +#ifdef CONFIG_PCI_MSI +static void __devinit fixup_ht_msi(struct pci_dev* dev) +{ + /* Some BIOS's do not enable the hypertransport MSI mapping capability + on the chipset. This breaks MSI support.. */ + int pos = pci_find_capability(dev,PCI_CAP_ID_HT); + while (pos != 0) + { + u32 cap; + pci_read_config_dword(dev,pos,&cap); + if (((cap >> 16) & PCI_HT_CMD_TYP) == PCI_HT_CMD_TYP_MSIM) { + if ((cap & PCI_HT_MSIM_ENABLE) == 0) { + printk("BIOS BUG: HyperTransport MSI mapping not enabled for %s, enabling.\n",pci_name(dev)); + cap |= PCI_HT_MSIM_ENABLE; + pci_write_config_dword(dev,pos,cap); + } + break; + } + pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_HT); + } +} +DECLARE_PCI_FIXUP_FINAL(PCI_ANY_ID, PCI_ANY_ID, fixup_ht_msi); +#endif static void pci_do_fixups(struct pci_dev *dev, struct pci_fixup *f, struct pci_fixup *end) { --- linux-2.6.15.4/include/linux/pci_regs.h 2006-02-16 12:09:05.000000000 -0700 +++ lin/include/linux/pci_regs.h 2006-02-16 12:12:30.000000000 -0700 @@ -196,12 +196,14 @@ #define PCI_CAP_ID_MSI 0x05 /* Message Signalled Interrupts */ #define PCI_CAP_ID_CHSWP 0x06 /* CompactPCI HotSwap */ #define PCI_CAP_ID_PCIX 0x07 /* PCI-X */ +#define PCI_CAP_ID_HT 0x08 /* HyperTransport */ #define PCI_CAP_ID_SHPC 0x0C /* PCI Standard Hot-Plug Controller */ #define PCI_CAP_ID_EXP 0x10 /* PCI Express */ #define PCI_CAP_ID_MSIX 0x11 /* MSI-X */ #define PCI_CAP_LIST_NEXT 1 /* Next capability in the list */ #define PCI_CAP_FLAGS 2 /* Capability defined flags (16 bits) */ #define PCI_CAP_SIZEOF 4 +#define PCI_HT_CMD_TYP 0xf800 /* Hypertransport capability type mask */ /* Power Management Registers */ @@ -285,6 +287,10 @@ #define PCI_MSI_DATA_64 12 /* 16 bits of data for 64-bit devices */ #define PCI_MSI_MASK_BIT 16 /* Mask bits register */ +/* HyperTransport MSI Mapping registers */ +#define PCI_HT_CMD_TYP_MSIM 0xa800 // MSI Mapping type +#define PCI_HT_MSIM_ENABLE (1<<16) + /* CompactPCI Hotswap Register */ #define PCI_CHSWP_CSR 2 /* Control and Status Register */ From ftillier at silverstorm.com Thu Feb 16 11:24:51 2006 From: ftillier at silverstorm.com (Fab Tillier) Date: Thu, 16 Feb 2006 11:24:51 -0800 Subject: [openib-general] Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: Message-ID: <001401c6332e$ae5499d0$6701a8c0@infiniconsys.com> > From: Roland Dreier [mailto:rdreier at cisco.com] > Sent: Thursday, February 16, 2006 11:18 AM > > I'm also thinking of moving my libibverbs and libmthca development > trees to git (most likely hosted at kernel.org). This has the > drawback of moving their development repositories out of the common > openib.org svn tree. However, it will make handling 1.0, 1.1 and > feature development branches much easier. I'd like to hear opinions > on this before I make a decision. I don't think the host of the repository matters. I don't know anything about git and its potential impact on the OpenIB dual license. As long as development is still done under the dual license I don't see any problem using a different SCM tool. Would we be able to host the git tree on the same server as SVN? - Fab From rdreier at cisco.com Thu Feb 16 11:28:57 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 16 Feb 2006 11:28:57 -0800 Subject: [openib-general] Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: <001401c6332e$ae5499d0$6701a8c0@infiniconsys.com> (Fab Tillier's message of "Thu, 16 Feb 2006 11:24:51 -0800") References: <001401c6332e$ae5499d0$6701a8c0@infiniconsys.com> Message-ID: Fab> I don't think the host of the repository matters. I don't Fab> know anything about git and its potential impact on the Fab> OpenIB dual license. As long as development is still done Fab> under the dual license I don't see any problem using a Fab> different SCM tool. Would we be able to host the git tree on Fab> the same server as SVN? I don't think the SCM tool has anything to do with the license of the code. Using git to hold a tree is completely orthogonal to the license of the code in that tree. git trees could be hosted on openib.org but I'm not sure what the advantage is. It would be more of a pain for the openib.org admins, since it's yet another server to maintain. - R. From jgunthorpe at obsidianresearch.com Thu Feb 16 11:38:43 2006 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Thu, 16 Feb 2006 12:38:43 -0700 Subject: [openib-general] Low cost MBs compatible with Mellanox PCIE Cards In-Reply-To: <064536929FFE43A49A0CEFD946BF9228.MAI@windows.securehostserver.com> References: <000f01c63307$89a09cf0$9f01a8c0@BOBP> <064536929FFE43A49A0CEFD946BF9228.MAI@windows.securehostserver.com> Message-ID: <20060216193843.GC3901@obsidianresearch.com> On Thu, Feb 16, 2006 at 09:21:24AM -0600, Steve Welch wrote: > We do not have interrupt assignment problems for the HCA that I'm aware of > using the A8N32-SLI Deluxe MB. We do have message signaled interrupts > enabled as well (Roland's suggestion to Jason). Output follows. Thanks, thats good to know. Roland mentioned that his A8N-SLI only got a 1x PCI Express link, do you have the same problem? A recent lspci -vvv will show the link width.. My MSI problem was a BIOS bug, it did not enable support in the chipset. I have it working now. So to sumarize I've got this info: MSI MSI-7184 - Will not POST Asus A8N-SLI - Works, but does (may?) not get a x4/x8 link Asus A8N32-SLI Deluxe - Works Asus A8N-VM - Works, but needs a kernel patch and ACPI disabled. Tyan * - Mellanox Tyan FAE says recent ones work Supermicro * - Mellanox FAE says recent ones work Regards, Jason From halr at voltaire.com Thu Feb 16 11:29:51 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Feb 2006 14:29:51 -0500 Subject: [openib-general] [PATCH 1 of 3] mad: large RMPP support, Round 2 In-Reply-To: <20060212152744.GA19049@mellanox.co.il> References: <20060212152744.GA19049@mellanox.co.il> Message-ID: <1140118189.4333.32784.camel@hal.voltaire.com> On Sun, 2006-02-12 at 10:27, Jack Morgenstein wrote: > Implement large RMPP support: I applied this patch series as it addresses Sean's primary objection in terms of walking the RMPP list (being O(n) rather than O(n**2)) and addressing most of his other comments. It certainly makes things better than they are now and Sean can make further changes next week. Sean also had a comment about single segment MADs/single SGE relative to mad.c::ib_create_send_mad which I don't see addressed but he did say he would be willing to accept it without this. It is important to start to get feedback on this from the larger community (especially those with large clusters) prior to pushing this upstream. Also, user_mad.c::copy_recv_mad appears to still be copying to a temporary buffer but I think that was what coaelsce was already doing. A nit is to factor out the common data_offset code in mad_rmpp.c and user_mad.c. -- Hal From mst at mellanox.co.il Thu Feb 16 11:48:22 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 16 Feb 2006 21:48:22 +0200 Subject: [openib-general] Re: Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: References: Message-ID: <20060216194821.GB20997@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Plans for libibverbs 1.0, 1.1 and beyond > > I thought it might be helpful to give an informal roadmap of my plans > for libibverbs. As I said, I hope to have a libibverbs 1.0 release > out in three weeks or so. Once that happens, I plan to do 1.0.x > maintenance releases "as needed." One thing that might be important is to use the upcoming madvise(MADV_DONTFORK) on private CQ/QP work request ring buffers. This would require allocating these from page-sized pools. > I'm also thinking of moving my libibverbs and libmthca development > trees to git (most likely hosted at kernel.org). This has the > drawback of moving their development repositories out of the common > openib.org svn tree. However, it will make handling 1.0, 1.1 and > feature development branches much easier. I'd like to hear opinions > on this before I make a decision. Well, since you ask, I am pretty happy with how things work with svn. Do you expect sufficient development on branches that needs advanced merging that git provides? I would think we just need an occasional bugfix there, which should be easy to handle by plain patches. Hopefully we'll just have a development trunk and stable branches, not development branches. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Thu Feb 16 11:52:19 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 16 Feb 2006 21:52:19 +0200 Subject: [openib-general] Re: Low cost MBs compatible with Mellanox PCIE Cards In-Reply-To: <20060216193843.GC3901@obsidianresearch.com> References: <000f01c63307$89a09cf0$9f01a8c0@BOBP> <064536929FFE43A49A0CEFD946BF9228.MAI@windows.securehostserver.com> <20060216193843.GC3901@obsidianresearch.com> Message-ID: <20060216195219.GC20997@mellanox.co.il> Quoting r. Jason Gunthorpe : > Subject: Re: Low cost MBs compatible with Mellanox PCIE Cards > > On Thu, Feb 16, 2006 at 09:21:24AM -0600, Steve Welch wrote: > > > We do not have interrupt assignment problems for the HCA that I'm aware of > > using the A8N32-SLI Deluxe MB. We do have message signaled interrupts > > enabled as well (Roland's suggestion to Jason). Output follows. > > Thanks, thats good to know. Roland mentioned that his A8N-SLI only > got a 1x PCI Express link, do you have the same problem? A recent > lspci -vvv will show the link width.. > > My MSI problem was a BIOS bug, it did not enable support in the > chipset. I have it working now. > > So to sumarize I've got this info: > MSI MSI-7184 - Will not POST > Asus A8N-SLI - Works, but does (may?) not get a x4/x8 link > Asus A8N32-SLI Deluxe - Works > Asus A8N-VM - Works, but needs a kernel patch and ACPI disabled. > Tyan * - Mellanox Tyan FAE says recent ones work > Supermicro * - Mellanox FAE says recent ones work > > Regards, > Jason It could make sense to put this in openib Wiki together with any info on patches needed to make it work. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From gdror at mellanox.co.il Thu Feb 16 11:53:21 2006 From: gdror at mellanox.co.il (Dror Goldenberg) Date: Thu, 16 Feb 2006 21:53:21 +0200 Subject: [openib-general] Re: Re: [PATCH] change Mellanox SDP workaround toa moduleparameter Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3FEDFE1@mtlexch01.mtl.com> > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Hal Rosenstock > Sent: Thursday, February 16, 2006 1:13 PM > > On Thu, 2006-02-16 at 02:54, Michael S. Tsirkin wrote: > > Quoting r. Hal Rosenstock : > > > Subject: Re: Re: [PATCH] change Mellanox SDP workaround to a > > > moduleparameter > > > > > > On Wed, 2006-02-15 at 19:03, Roland Dreier wrote: > > > > > > > I guess the question is what to do when a Tavor (with the > > > > performance bug that makes a 1K MTU faster) connects to someone > > > > else. > > > > > > Isn't it the other way 'round (when something with a larger MTU > > > connects to Tavor) ? > > > > Right. I wish we had an MTU field in the REP packet, but we dont. > > Yes, that would be better IMO too. Not sure why it wasn't > done that way. Guess you could file an erratum on this. > > -- Hal The SWG defined a generic mechanism which uses REJ to indicate that the passive side does not accept a certain REQ fields, and allows the passive side to indicate an alternative value. Indirection is also supported through the same protocol. It also allows the active side, following the REJ, to use an alternate value, other than the one suggested by the passive side, i.e. passive side only has a veto capability. This is the mechanism and the short theory behind it. Unfortunately it's a bit inefficient in terms of performance because of the ping pong of messages. Solving just the MTU might not be a good enough argument. The approach should be to enable the active side to specify a set of acceptable parameters for each one of the REQ fields, and then let the passive side to choose. This may change the CM packets all over and will introduce new problems. I don't think that there's a good chance of just adding a solution for just one of the fields. Anyway, you can still try and propose this to IBTA, I tried it once already :) From xma at us.ibm.com Thu Feb 16 11:53:00 2006 From: xma at us.ibm.com (Shirley Ma) Date: Thu, 16 Feb 2006 11:53:00 -0800 Subject: [openib-general] Re: Re: [PATCH RESEND] net: Move destructor from neigh->ops to neigh_params In-Reply-To: <20060216192303.GA20997@mellanox.co.il> Message-ID: Michael, > With this approach you'll get crashes when the module will get unloaded. I haven't hit the crash yet when unloading the module. Could you explain it in detail? Do you mean, it might hit below problem? unregister_netdevice: waiting for ib0 to become free. Usage count = 3 > Further, we may get called with neighbour that is not related to ipoib at all. The none ipoib related neighbour shouldn't have an ipoib destructor allocated. Thanks for the pointer, I will look at these patches soon. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Thu Feb 16 11:50:20 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Feb 2006 14:50:20 -0500 Subject: [openib-general] Re: Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: <20060216194821.GB20997@mellanox.co.il> References: <20060216194821.GB20997@mellanox.co.il> Message-ID: <1140119260.4333.32933.camel@hal.voltaire.com> On Thu, 2006-02-16 at 14:48, Michael S. Tsirkin wrote: > Quoting r. Roland Dreier : > > Subject: Plans for libibverbs 1.0, 1.1 and beyond > > > > I thought it might be helpful to give an informal roadmap of my plans > > for libibverbs. As I said, I hope to have a libibverbs 1.0 release > > out in three weeks or so. Once that happens, I plan to do 1.0.x > > maintenance releases "as needed." > > One thing that might be important is to use the upcoming > madvise(MADV_DONTFORK) on private CQ/QP work request ring buffers. > This would require allocating these from page-sized pools. Could this be done as fix release to 1.0 ? -- Hal From swelch at systemfabricworks.com Thu Feb 16 12:13:47 2006 From: swelch at systemfabricworks.com (Steve Welch) Date: Thu, 16 Feb 2006 14:13:47 -0600 Subject: [openib-general] Low cost MBs compatible with Mellanox PCIE Cards In-Reply-To: <20060216193843.GC3901@obsidianresearch.com> Message-ID: The HCA shows x8 on the A8N32-SLI Deluxe: Link: Supported Speed 2.5Gb/s, Width x8, ASPM L0s, Port 8 Link: Latency L0s unlimited, L1 unlimited Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch- Link: Speed 2.5Gb/s, Width x8 Glad to hear you have the MSI problem taken care of. Steve > -----Original Message----- > From: Jason Gunthorpe [mailto:jgunthorpe at obsidianresearch.com] > Sent: Thursday, February 16, 2006 1:39 PM > To: Steve Welch > Cc: 'Bob Pearson'; 'Bill Boas'; openib-general at openib.org > Subject: Re: [openib-general] Low cost MBs compatible with Mellanox PCIE > Cards > > On Thu, Feb 16, 2006 at 09:21:24AM -0600, Steve Welch wrote: > > > We do not have interrupt assignment problems for the HCA that I'm aware > of > > using the A8N32-SLI Deluxe MB. We do have message signaled interrupts > > enabled as well (Roland's suggestion to Jason). Output follows. > > Thanks, thats good to know. Roland mentioned that his A8N-SLI only > got a 1x PCI Express link, do you have the same problem? A recent > lspci -vvv will show the link width.. > > My MSI problem was a BIOS bug, it did not enable support in the > chipset. I have it working now. > > So to sumarize I've got this info: > MSI MSI-7184 - Will not POST > Asus A8N-SLI - Works, but does (may?) not get a x4/x8 link > Asus A8N32-SLI Deluxe - Works > Asus A8N-VM - Works, but needs a kernel patch and ACPI > disabled. > Tyan * - Mellanox Tyan FAE says recent ones work > Supermicro * - Mellanox FAE says recent ones work > > Regards, > Jason From jlentini at netapp.com Thu Feb 16 12:21:00 2006 From: jlentini at netapp.com (James Lentini) Date: Thu, 16 Feb 2006 15:21:00 -0500 (EST) Subject: [openib-general] Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: References: Message-ID: roland> * Implement memory window (MW) support. cool roland> I'm also thinking of moving my libibverbs and libmthca roland> development trees to git (most likely hosted at kernel.org). roland> This has the drawback of moving their development repositories roland> out of the common openib.org svn tree. However, it will make roland> handling 1.0, 1.1 and feature development branches much roland> easier. I wouldn't want to see the repository split up. Is moving all of the OpenIB code from svn to git an option? From halr at voltaire.com Thu Feb 16 12:12:28 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Feb 2006 15:12:28 -0500 Subject: [openib-general] Get Table Records for SA Attribute ID ? In-Reply-To: <43F4788E.3070909@gs-lab.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F589AC20@taurus.voltaire.com> <43F4788E.3070909@gs-lab.com> Message-ID: <1140120738.4333.33149.camel@hal.voltaire.com> Hi Takshak, On Thu, 2006-02-16 at 08:05, Takshak C. wrote: > Hi Hal, > > Thanks for the information. > Based on your feedback, I have corrected the SA MAD structure and added RMPP header as below: > > typedef struct mad_header > { > uint8_t base_version; > uint8_t mgmt_class; > uint8_t class_version; > uint8_t response_bit; > uint8_t method; Response bit is combined with method in 8 bits. > uint16_t status; > uint16_t class_spec; > uint64_t tid; > uint16_t attr_id; > uint16_t resv; > uint32_t attr_mod; > } ib_mad_t ; > > typedef struct _ib_rmpp_mad > { > ib_mad_t common_hdr; > > uint8_t rmpp_version; > uint8_t rmpp_type; > uint8_t rmpp_flags; > uint8_t rmpp_status; > > uint32_t seg_num; > uint32_t paylen_newwin; > uint8_t data[192]; > > } ib_rmpp_mad_t; > > ib_rmpp_mad_t *p_mad = (struct ib_rmpp_mad_t*)(umad_get_mad(umad)); > memset(p_mad, 0, sizeof(ib_rmpp_mad_t)); > > p_mad->common_hdr.base_version = 1 ; > p_mad->common_hdr.mgmt_class = IB_MCLASS_SUBN_ADM ; > p_mad->common_hdr.class_version = (uint8_t) 2 ; > p_mad->common_hdr.method = IB_MAD_METHOD_GET; > p_mad->common_hdr.status = 0; > p_mad->common_hdr.class_spec = 0; > p_mad->common_hdr.tid = 0x123 ; > p_mad->common_hdr.attr_id = IB_SA_ATTR_PATHRECORD ; > p_mad->common_hdr.resv = 0; > p_mad->common_hdr.attr_mod = 0 ; > > p_mad->rmpp_version = 1 ; > p_mad->seg_num = 1 ; > p_mad->rmpp_flags |= (uint8_t)0x70 ; > p_mad->rmpp_status = 0; > p_mad->paylen_newwin = IB_MAD_SIZE - MAD_RMPP_HDR_SIZE ; // 256 - 36 Since you are doing a GET rather than GET TABLE, none of this RMPP stuff matters. > umad_set_addr(umad, port_attr.sm_lid , 1, 0, IB_DEFAULT_QP1_QKEY); > umad_set_grh(umad, 0); > umad_set_pkey(umad, 0xFFFF); // IB_DEFAULT_PKEY 0xFFFF > > > I have registered umad with RMPP = 1. I have not started openSM instance here. I m using vendor specific SM. > I have done umad_send(...) and umad_recv(). Do you see the packet sent on the IB wire ? Do you see a response come in ? If you don't have an analyzer, you can use madeye. > Could you please tell me, what should I do to retrieve the PathRecord after umad_recv(...) call ? You would need to decode it. > I believe, if I call function : > ib_rmpp_mad_t *recv_mad = (ib_rmpp_mad_t*)umad_get_mad(umad); and then > read this recv_mad will give me path record from local HCA ? > Or is there anyother good way to do the same ? Stepping back, what are you trying to do ? I think you answered part of this below. Can you elaborate more ? > Please throw some light on this. Do you have any userspace SA support for retrieving path, service record > information ? There have been discussions about userspace SA support but nothing currently for OpenIB (gen2). Currently, you can get this by using osm_vendor_ibumad_sa.c which supports most SA requests. It is built as part of libosmvendor (part of the OpenSM build) but can be used outside of OpenSM. It is used by osmtest if you want to look at some use cases. It obtains PathRecords and ServiceRecords. That might be an easier direction to go than trying to use the management libraries to build the pieces of a userspace SA client you want. -- Hal > > Regards. > - Takshak > > > Hal Rosenstock wrote: > > >Hi, > > > >There are a couple of issues with the below. > > > >1. SA MAD structure is missing the RMPP header. Once I saw that I didn't check for further issues with the format. > > > >2. I will assume your register call sets RMPP. > > > >3. SA class version is 2. > > > >What SM are you using ? If you are using OpenSM, you can turn on verbose and see if the packet is seen by the SM. You could also enable madeye (in utils) to see if the packet is sent (and if anything is received back). > > > >-- Hal > > > >________________________________ > > > >From: openib-general-bounces at openib.org on behalf of Takshak C. > >Sent: Mon 2/6/2006 8:00 AM > >To: openib-general at openib.org > >Subject: [openib-general] Get Table Records for SA Attribute ID ? > > > > > > > >Hi, > > > >I m trying to get the table records for SA attribute ID in following way. > >But, I m not getting a single record, could anyone comment on the problem. > > > >1. I have created saMadFormat structure described in the specification as below: > > > >struct saMadFormat > >{ > > > > uint8_t base_version ; > > uint8_t mgmt_class ; > > uint8_t class_version ; > > uint8_t sa_method ; > > uint16_t status ; > > uint16_t not_used ; > > uint64_t tid ; > > uint16_t attr_id ; > > uint16_t resv ; > > uint32_t attr_mod ; > > uint64_t sa_key; > > uint64_t sm_key ; > > uint32_t seg_num ; > > uint32_t payload_len ; > > uint8_t frag_flag ; > > uint8_t edit_mod ; > > uint16_t window ; > > uint32_t endRID ; > > uint64_t comp_mask ; > > uint8_t adminData[192] ; > >}; > > > >2. Then I have done all the basic operations like umad_open, umad_register for the IB_SA_CLASS > > and umad_open_port etc successfully. > > > >3. struct saMadFormat *saQuery = (struct saMadFormat*)(umad_get_mad(umad)); > > memset(saQuery, 0, sizeof(*saQuery)); > > > > saQuery->base_version = 1; > > saQuery->mgmt_class = IB_SA_CLASS ; > > saQuery->class_version = 1 ; > > saQuery->sa_method = IB_MAD_METHOD_GET_TABLE ; > > saQuery->attr_id = IB_SA_ATTR_PATHRECORD ; > > saQuery->attr_mod = 0 ; > > saQuery->tid = htonll(drmad_tid++); > > saQuery->endRID = 0 ; > > > > umad_set_addr(umad, lid, 1, 0, IB_DEFAULT_QP1_QKEY); > > umad_set_grh(umad, 0); > > umad_set_pkey(umad, 0xFFFF); > > > >4. length = IB_MAD_SIZE; > > > > if (umad_send(portid, mad_agent, umad, length, timeout_ms, 0) < 0) > > IBPANIC("send failed"); > > > > if (umad_recv(portid, umad, &length, -1) != mad_agent) > > IBPANIC("recv error: %s", drmad_status_str(saQuery)); > > > > > > > > if (!dump_char) { > > xdump(stdout, 0, saQuery->adminData, 192); > > return 0; > > } > > > >I m expecting that, I will get the resultant data in saQuery->adminData. > >Is this correct ? If not then, how should I retrieve the table records ? > >Any Idea ? > > > > > >Thanks > >- Takshak > > > >_______________________________________________ > >openib-general mailing list > >openib-general at openib.org > >http://openib.org/mailman/listinfo/openib-general > > > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > From mst at mellanox.co.il Thu Feb 16 12:26:38 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 16 Feb 2006 22:26:38 +0200 Subject: [openib-general] Re: Re: [PATCH RESEND] net: Move destructor from neigh->ops to neigh_params In-Reply-To: References: <20060216192303.GA20997@mellanox.co.il> Message-ID: <20060216202638.GD20997@mellanox.co.il> Quoting r. Shirley Ma : > Subject: Re: Re: [PATCH RESEND] net: Move destructor from?neigh->ops to neigh_params > > > Michael, > > > With this approach you'll get crashes when the module will get unloaded. > I haven't hit the crash yet when unloading the module. Could you explain it in detail? > > Do you mean, it might hit below problem? > unregister_netdevice: waiting for ib0 to become free. Usage count = 3 I think one of the patches might fix this. > > Further, we may get called with neighbour that is not related to ipoib at all. > The none ipoib related neighbour shouldn't have an ipoib destructor allocated. the destructor isnt per neighbour, unfortunately. Thats what my kernel patches fix. > Thanks for the pointer, I will look at these patches soon. Thanks, more testing is always useful - mine was limited to intel/amd machines. You can try applying them all if you want - they are independent. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Thu Feb 16 12:29:29 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 16 Feb 2006 22:29:29 +0200 Subject: [openib-general] Re: Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: References: Message-ID: <20060216202929.GE20997@mellanox.co.il> Quoting r. James Lentini : > I wouldn't want to see the repository split up. Is moving all of the > OpenIB code from svn to git an option? I dont think git supports multiple people working on the same repository. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Thu Feb 16 12:35:10 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 16 Feb 2006 22:35:10 +0200 Subject: [openib-general] Re: [PATCH RESEND] net: Move destructor from neigh->ops to neigh_params In-Reply-To: References: <20060202185713.GA6219@mellanox.co.il> Message-ID: <20060216203510.GH20997@mellanox.co.il> Quoting r. Shirley Ma : > Subject: Re: [openib-general] Re: [PATCH RESEND] net: Move destructor from?neigh->ops to neigh_params > > > Hi, Michael, > > What's the current status of this patch? It will hopefully be in 2.6.17 For 2.6.16 and below we'll need the trick in ipoib_all_neigh_issues_2.patch or something like it. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From rdreier at cisco.com Thu Feb 16 12:57:02 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 16 Feb 2006 12:57:02 -0800 Subject: [openib-general] Re: Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: <20060216202929.GE20997@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 16 Feb 2006 22:29:29 +0200") References: <20060216202929.GE20997@mellanox.co.il> Message-ID: Michael> I dont think git supports multiple people working on the Michael> same repository. It does but it's not the normal workflow, so it's probably not that smooth at the moment. - R. From rdreier at cisco.com Thu Feb 16 12:58:42 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 16 Feb 2006 12:58:42 -0800 Subject: [openib-general] Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: (James Lentini's message of "Thu, 16 Feb 2006 15:21:00 -0500 (EST)") References: Message-ID: James> I wouldn't want to see the repository split up. Is moving James> all of the OpenIB code from svn to git an option? It is a possibility but it would be rather painful. But I would argue that having everything in the same repo is just introducing extra dependencies. - R. From rdreier at cisco.com Thu Feb 16 13:03:49 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 16 Feb 2006 13:03:49 -0800 Subject: [openib-general] Re: Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: <20060216194821.GB20997@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 16 Feb 2006 21:48:22 +0200") References: <20060216194821.GB20997@mellanox.co.il> Message-ID: Michael> One thing that might be important is to use the upcoming Michael> madvise(MADV_DONTFORK) on private CQ/QP work request ring Michael> buffers. This would require allocating these from Michael> page-sized pools. We should be able to do this in a compatible way, so I think this can go into the 1.0 branch. I'll add it to the README. libmthca already uses posix_memalign to make sure that CQ and QP buffers are page-aligned. >> I'm also thinking of moving my libibverbs and libmthca >> development trees to git (most likely hosted at kernel.org). >> This has the drawback of moving their development repositories >> out of the common openib.org svn tree. However, it will make >> handling 1.0, 1.1 and feature development branches much easier. >> I'd like to hear opinions on this before I make a decision. Michael> Well, since you ask, I am pretty happy with how things Michael> work with svn. Michael> Do you expect sufficient development on branches that Michael> needs advanced merging that git provides? I would think Michael> we just need an occasional bugfix there, which should be Michael> easy to handle by plain patches. It's really nice to be able to develop features on independent branches. And I really like things like git-bisect for debugging. The more I use git the more I think it's a lot better for my workflow on libibverbs. But maybe the hassle for all the consumers of libibverbs is enough that it's worth it to keep it in svn. On the other hand things should be stabilizing enough that nearly everyone can work from tarballs or binary packages. - R. From jgunthorpe at obsidianresearch.com Thu Feb 16 13:15:19 2006 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Thu, 16 Feb 2006 14:15:19 -0700 Subject: [openib-general] Re: Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: <20060216202929.GE20997@mellanox.co.il> References: <20060216202929.GE20997@mellanox.co.il> Message-ID: <20060216211519.GD3901@obsidianresearch.com> On Thu, Feb 16, 2006 at 10:29:29PM +0200, Michael S. Tsirkin wrote: > Quoting r. James Lentini : > > I wouldn't want to see the repository split up. Is moving all of the > > OpenIB code from svn to git an option? > > I dont think git supports multiple people working on the same repository. It does, but I'm not sure if there is a non-ssh based way to connect to the remote repository and send the changes - ie like the DAV based mechanism svn uses. If you use ssh and the right group permissions then it does work (though I'm not 100% sure it has the proper locking yet..) We've been doing that here on a very light basis with our local kernel patches. FWIW having used BK commercially for years and git with the kernel I really think distributed SCM's are the way to go. They are alot easier on the developers, capture more information and let you have more interesting development flows. The CVS like model that SVN implements is quite limiting in comparison. That said, git still has alot of rough spots and I'm not sure how well things work outside the well tested kernel work flow :> Jason From jgunthorpe at obsidianresearch.com Thu Feb 16 13:18:52 2006 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Thu, 16 Feb 2006 14:18:52 -0700 Subject: [openib-general] Re: Low cost MBs compatible with Mellanox PCIE Cards In-Reply-To: <20060216195219.GC20997@mellanox.co.il> References: <000f01c63307$89a09cf0$9f01a8c0@BOBP> <064536929FFE43A49A0CEFD946BF9228.MAI@windows.securehostserver.com> <20060216193843.GC3901@obsidianresearch.com> <20060216195219.GC20997@mellanox.co.il> Message-ID: <20060216211852.GE3901@obsidianresearch.com> On Thu, Feb 16, 2006 at 09:52:19PM +0200, Michael S. Tsirkin wrote: > It could make sense to put this in openib Wiki together with any info > on patches needed to make it work. That sounds like a good idea, I have no wikki experience myself - can anyone add a new page? Any pointers? Thanks, Jason From ostampflee at terrasoftsolutions.com Thu Feb 16 13:27:47 2006 From: ostampflee at terrasoftsolutions.com (Owen Stampflee) Date: Thu, 16 Feb 2006 13:27:47 -0800 Subject: [openib-general] OpenSM realloc error In-Reply-To: <1140028908.4333.22226.camel@hal.voltaire.com> References: <1139945214.2169.4.camel@beast.terraplex.com> <1139963140.4333.16065.camel@hal.voltaire.com> <1139964683.5941.10.camel@beast.terraplex.com> <1140004573.4333.19957.camel@hal.voltaire.com> <1140021789.21679.2.camel@beast.terraplex.com> <1140023843.4333.21715.camel@hal.voltaire.com> <1140025284.22080.2.camel@beast.terraplex.com> <20060215184711.GE12172@sashak.voltaire.com> <1140028908.4333.22226.camel@hal.voltaire.com> Message-ID: <1140125269.8783.13.camel@beast.terraplex.com> So, here is the back trace with no code modifications... 0x00000080b9719db0 in .__GI_raise () from /lib64/tls/libc.so.6 (gdb) bt #0 0x00000080b9719db0 in .__GI_raise () from /lib64/tls/libc.so.6 #1 0x00000080b971b89c in .__GI_abort () from /lib64/tls/libc.so.6 #2 0x00000080b974e860 in .__libc_message () from /lib64/tls/libc.so.6 #3 0x00000080b97580bc in ._int_realloc () from /lib64/tls/libc.so.6 #4 0x00000080b9759528 in .__realloc () from /lib64/tls/libc.so.6 #5 0x00000080b975942c in .__realloc () from /lib64/tls/libc.so.6 #6 0x00000080b974cd30 in ._IO_mem_finish () from /lib64/tls/libc.so.6 #7 0x00000080b97426b8 in ._IO_new_fclose () from /lib64/tls/libc.so.6 #8 0x00000080b97b795c in .__GI_vsyslog () from /lib64/tls/libc.so.6 #9 0x00000080b97b7ddc in .__GI_syslog () from /lib64/tls/libc.so.6 #10 0x00000080a362be90 in .cl_log_event () from /usr/lib64/libosmcomp.so.1 #11 0x00000080a35f5700 in .osm_log () from /usr/lib64/libopensm.so.1 #12 0x000000001001316c in ?? () #13 0x00000000100059b4 in ?? () #14 0x00000080b970411c in .generic_start_main () from /lib64/tls/libc.so.6 #15 0x00000080b97042a4 in .__libc_start_main () from /lib64/tls/libc.so.6 #16 0x0000000000000000 in ?? () (gdb) Commenting out the cl_log_event in osm_log results in this backtrace: (gdb) bt #0 0x00000080b9719db0 in .__GI_raise () from /lib64/tls/libc.so.6 #1 0x00000080b971b89c in .__GI_abort () from /lib64/tls/libc.so.6 #2 0x00000080b974e860 in .__libc_message () from /lib64/tls/libc.so.6 #3 0x00000080b9756db0 in ._int_malloc () from /lib64/tls/libc.so.6 #4 0x00000080b9758b50 in .__GI___libc_malloc () from /lib64/tls/libc.so.6 #5 0x00000400000607bc in __cl_malloc_priv (size=0) at cl_memory_osd.c:62 #6 0x00000400000604d4 in __cl_zalloc_ntrk (size=0) at cl_memory.c:416 #7 0x00000400000629f4 in cl_ptr_vector_set_capacity (p_vector=0x100788d0, new_capacity=6349) at cl_ptr_vector.c:216 #8 0x0000040000062acc in cl_ptr_vector_set_size (p_vector=0x0, size=16) at cl_ptr_vector.c:270 #9 0x0000040000062c08 in cl_ptr_vector_init (p_vector=0x100788d0, min_size=6349, grow_size=16) at cl_ptr_vector.c:93 #10 0x000004000005bb00 in cl_disp_init (p_disp=0x100788a0, thread_count=0, name=0x100464c0 "opensm") at cl_dispatcher.c:214 #11 0x00000000100133f8 in ?? () #12 0x00000000100059b4 in ?? () #13 0x00000080b970411c in .generic_start_main () from /lib64/tls/libc.so.6 #14 0x00000080b97042a4 in .__libc_start_main () from /lib64/tls/libc.so.6 #15 0x0000000000000000 in ?? () So now I've compiled it in 32-bit mode (had to fix my chroot) and everything runs, but I get the following message... Feb 16 13:59:28 006732 [0000] -> OpenSM Rev:openib-1.1.0 Feb 16 13:59:28 008210 [F7E8D020] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000 Feb 16 13:59:28 008292 [F7E8D020] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000 Feb 16 13:59:28 015894 [F7E8D020] -> osm_vendor_get_all_port_attr: assign CA mthca0 port 1 guid (0x2c90109764831) as the default port Feb 16 13:59:28 015977 [F7E8D020] -> osm_vendor_bind: Binding to port 0x2c90109764831. Feb 16 13:59:28 021293 [F7E8D020] -> osm_vendor_bind: Binding to port 0x2c90109764831. Feb 16 13:59:28 021692 [F568C4E0] -> umad_receiver: ERR 5413: Failed to obtain request madw for received MAD(method=0x81 attr=0x11) -- dropping Other info: [root at m2 ~]# ibstat CA 'mthca0' CA type: MT23108 Number of ports: 2 Firmware version: 3.3.2 Hardware version: a1 Node GUID: 0x0002c90109764830 System image GUID: 0x0002c90109764833 Port 1: State: Initializing Physical state: LinkUp Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00510a68 Port GUID: 0x0002c90109764831 Port 2: State: Down Physical state: Polling Rate: 2 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00510a68 Port GUID: 0x0002c90109764832 [root at m2 ~]# ibstatus Infiniband device 'mthca0' port 1 status: default gid: fe80:0000:0000:0000:0002:c901:0976:4831 base lid: 0x0 sm lid: 0x0 state: 2: INIT phys state: 5: LinkUp rate: 10 Gb/sec (4X) Infiniband device 'mthca0' port 2 status: default gid: fe80:0000:0000:0000:0002:c901:0976:4832 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 2: Polling rate: 2.5 Gb/sec (1X) My archives suggest a firmware upgrade, but 3.3.3 isnt available from SBS as far as I can tell and my contact no longer works there so I'm going to have to find the new person to talk about getting newer firmware, unless of course another vendors firmware will work on this card. Cheers, Owen From robert.j.woodruff at intel.com Thu Feb 16 13:36:35 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Thu, 16 Feb 2006 13:36:35 -0800 Subject: [openib-general] Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: Message-ID: <000001c63341$0fb1d000$b5a0070a@amr.corp.intel.com> Roland wrote, >I'm also thinking of moving my libibverbs and libmthca development >trees to git (most likely hosted at kernel.org). This has the >drawback of moving their development repositories out of the common >openib.org svn tree. However, it will make handling 1.0, 1.1 and >feature development branches much easier. I'd like to hear opinions >on this before I make a decision. Having things split up, some at openib.org, some at kernel.org would be a pain to deal with. I'd like to see all of the code remain in one source code management tool. I have no problems with SVN. Not sure if git provides any better features or not. woody From viswa.krish at gmail.com Thu Feb 16 13:58:34 2006 From: viswa.krish at gmail.com (Viswanath Krishnamurthy) Date: Thu, 16 Feb 2006 13:58:34 -0800 Subject: [openib-general] Getting the right userspace libraries Message-ID: <4df28be40602161358h65d32dber371447839b23248f@mail.gmail.com> How does one pull out the correct userland libraries for 2.6.16 kernel IB stack. Is it to look at the SVN number in the driver code, and pull that version ? -Viswa -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Thu Feb 16 14:07:45 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 16 Feb 2006 14:07:45 -0800 Subject: [openib-general] Getting the right userspace libraries In-Reply-To: <4df28be40602161358h65d32dber371447839b23248f@mail.gmail.com> (Viswanath Krishnamurthy's message of "Thu, 16 Feb 2006 13:58:34 -0800") References: <4df28be40602161358h65d32dber371447839b23248f@mail.gmail.com> Message-ID: Viswanath> How does one pull out the correct userland libraries Viswanath> for 2.6.16 kernel IB stack. Is it to look at the SVN Viswanath> number in the driver code, and pull that version ? 2.6.16 is not out yet. However, there will be no major changes in the interface with userspace. So any recent userspace libraries will work. In general the head of the svn repository is probably your best bet. - R. From rdreier at cisco.com Thu Feb 16 14:09:28 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 16 Feb 2006 14:09:28 -0800 Subject: [openib-general] Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: <000001c63341$0fb1d000$b5a0070a@amr.corp.intel.com> (Bob Woodruff's message of "Thu, 16 Feb 2006 13:36:35 -0800") References: <000001c63341$0fb1d000$b5a0070a@amr.corp.intel.com> Message-ID: Bob> Having things split up, some at openib.org, some at Bob> kernel.org would be a pain to deal with. I'd like to see all Bob> of the code remain in one source code management tool. I have Bob> no problems with SVN. Not sure if git provides any better Bob> features or not. Can you be more explicit about the pain? What does it make worse? - R. From rjwalsh at pathscale.com Thu Feb 16 14:23:04 2006 From: rjwalsh at pathscale.com (Robert Walsh) Date: Thu, 16 Feb 2006 14:23:04 -0800 Subject: [openib-general] Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: References: Message-ID: <1140128584.25199.3.camel@hematite.internal.keyresearch.com> > * Eliminate the dependency on libsysfs by implementing the required > sysfs handling directly. Any particular reason why, other than just minimizing external dependencies? Regards, Robert. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 481 bytes Desc: This is a digitally signed message part URL: From robert.j.woodruff at intel.com Thu Feb 16 14:24:32 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Thu, 16 Feb 2006 14:24:32 -0800 Subject: [openib-general] Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: Message-ID: <000101c63347$c25ff140$b5a0070a@amr.corp.intel.com> Roland wrote, >Can you be more explicit about the pain? What does it make worse? >- R. Today when I want to get the tip, I simply do a SVN update of my tree and everything that has changed gets updated. I can also subscribe to the commits email list to know if something changes. If some components are now in a git tree, I would need to first install and learn git, then pull some components from git from kernel.org, some components from SVN and hope they work together. And if some code gets moved to another site, like kernel.org, is that development really still covered by the openib licensing and promoters agreements ? woody From rdreier at cisco.com Thu Feb 16 14:25:32 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 16 Feb 2006 14:25:32 -0800 Subject: [openib-general] Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: <1140128584.25199.3.camel@hematite.internal.keyresearch.com> (Robert Walsh's message of "Thu, 16 Feb 2006 14:23:04 -0800") References: <1140128584.25199.3.camel@hematite.internal.keyresearch.com> Message-ID: Roland> * Eliminate the dependency on libsysfs by implementing the Roland> required sysfs handling directly. Robert> Any particular reason why, other than just minimizing Robert> external dependencies? libsysfs is unmaintained, and the main user of it (udev) has stopped using it. So it's going to bitrot and get dropped from distros. And libsysfs kind of sucks too. - R. From rjwalsh at pathscale.com Thu Feb 16 14:26:18 2006 From: rjwalsh at pathscale.com (Robert Walsh) Date: Thu, 16 Feb 2006 14:26:18 -0800 Subject: [openib-general] Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: References: <1140128584.25199.3.camel@hematite.internal.keyresearch.com> Message-ID: <1140128778.25199.5.camel@hematite.internal.keyresearch.com> > libsysfs is unmaintained, and the main user of it (udev) has stopped > using it. So it's going to bitrot and get dropped from distros. And > libsysfs kind of sucks too. Fair enough - that's a good reason. Regards, Robert. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 481 bytes Desc: This is a digitally signed message part URL: From bos at pathscale.com Thu Feb 16 14:28:58 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 16 Feb 2006 14:28:58 -0800 Subject: [openib-general] Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: References: <000001c63341$0fb1d000$b5a0070a@amr.corp.intel.com> Message-ID: <1140128938.12954.23.camel@serpentine.pathscale.com> On Thu, 2006-02-16 at 14:09 -0800, Roland Dreier wrote: > Can you be more explicit about the pain? What does it make worse? It means that someone is going to have to integrate changes from your git tree into the SVN repo. Is this something you're willing to do, for example? Taking off my release manager hat and putting on my "developer speaking for himself" flame-retardant headgear, I would be second in line after Roland to put a stake in the heart of the SVN repository and replace it with a sensible DSCM. If we were to do that, it would not matter where individual repositories got hosted, provided the canonical repositories were available somewhere convenient and public. However, let me switch hats again. If we want to get a 1.0 release out the door in a reasonable time frame, this is not a good time to be having a barney over the choice of SCM to use. For example, I'd be almost as unhappy with git as I am with SVN, due to its user interface and resource hogginess. Unless a lot of other people are all excited about the idea of ditching SVN, I think we should hold our noses and stick with it for 1.0. We can start a conversation about moving post-1.0 development to a DSCM (and perhaps even make the move itself) in parallel, while we go through the 1.0 release motions. References: <1140128584.25199.3.camel@hematite.internal.keyresearch.com> Message-ID: <1140129043.12954.26.camel@serpentine.pathscale.com> On Thu, 2006-02-16 at 14:23 -0800, Robert Walsh wrote: > Any particular reason why, other than just minimizing external > dependencies? libsysfs has been orphaned, and due to its design is way harder to use than just banging on sysfs itself directly. Roland wrote, >Can you be more explicit about the pain? What does it make worse? > - R. Here is a thought. Since we have been talking about a 1.0 release anyway, what if when we do the "branch" for 1.0, we really just pull a version that would be used to start the trunk of the 2.0 development tree. At the same time, that would be a good time to switch source code management tools, if that is what people want to do. The new trunk, for 2.0, would start in the new tree and the SVN tree could become the 1.0 release tree and any further patches would be done just to fix bugs for the release. New features would go into the new tree in perhaps a new SCM tool. Thoughts ? woody From rdreier at cisco.com Thu Feb 16 14:40:48 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 16 Feb 2006 14:40:48 -0800 Subject: [openib-general] Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: <1140128938.12954.23.camel@serpentine.pathscale.com> (Bryan O'Sullivan's message of "Thu, 16 Feb 2006 14:28:58 -0800") References: <000001c63341$0fb1d000$b5a0070a@amr.corp.intel.com> <1140128938.12954.23.camel@serpentine.pathscale.com> Message-ID: Bryan> It means that someone is going to have to integrate changes Bryan> from your git tree into the SVN repo. Is this something Bryan> you're willing to do, for example? I don't see that as a requirement at all. If you want to look at my development tree, you pull the git tree. That gives you a lot more information than subversion: you get to see all of my branches and the merges between them, rather than the artificially linearized svn history. If you want a stable release, then take the 1.0, 1.0.1, etc tarballs that I will release. I'm also happy to produce daily snapshots of whatever branches people are interested in tracking. Bryan> However, let me switch hats again. If we want to get a 1.0 Bryan> release out the door in a reasonable time frame, this is Bryan> not a good time to be having a barney over the choice of Bryan> SCM to use. For example, I'd be almost as unhappy with git Bryan> as I am with SVN, due to its user interface and resource Bryan> hogginess. I think you're misunderstanding what I was proposing. I was not suggesting that svn at openib.org be turned off or anything like that, simply that one component move to a SCM that works better for me (the libibverbs maintainer). Right now the monolithic svn repo confuses things and makes it seem like there are dependencies between components when there actually aren't. We'll never reach consensus on a new SCM, so I think it's better if each component maintainer can decide what works best. Bryan> Unless a lot of other people are all excited about the idea Bryan> of ditching SVN, I think we should hold our noses and stick Bryan> with it for 1.0. We can start a conversation about moving Bryan> post-1.0 development to a DSCM (and perhaps even make the Bryan> move itself) in parallel, while we go through the 1.0 Bryan> release motions. I definitely wouldn't switch repositories until libibverbs 1.0 is out. Anyway, it seems that the users of the repository really want everything to stay the way it is, so I guess I'll drop this plan for now. Sigh... - R. From rdreier at cisco.com Thu Feb 16 14:43:39 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 16 Feb 2006 14:43:39 -0800 Subject: [openib-general] Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0006EAE63E@orsmsx408> (Robert J. Woodruff's message of "Thu, 16 Feb 2006 14:35:48 -0800") References: <1AC79F16F5C5284499BB9591B33D6F0006EAE63E@orsmsx408> Message-ID: Robert> The new trunk, for 2.0, would start in the new tree and Robert> the SVN tree could become the 1.0 release tree and any Robert> further patches would be done just to fix bugs for the Robert> release. New features would go into the new tree in Robert> perhaps a new SCM tool. I don't think it's worth making that move if we're going to leave everything in one repository. Right now, we have components like libibverbs, opensm, mvapich, udapl, etc, etc all in the same repository, which is just silly. Why should opensm and libibverbs be tied together? - R. From robert.j.woodruff at intel.com Thu Feb 16 15:19:21 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Thu, 16 Feb 2006 15:19:21 -0800 Subject: [openib-general] Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: Message-ID: <000201c6334f$6b307a90$b5a0070a@amr.corp.intel.com> Roland wrote, >I don't think it's worth making that move if we're going to leave >everything in one repository. Right now, we have components like >libibverbs, opensm, mvapich, udapl, etc, etc all in the same >repository, which is just silly. Why should opensm and libibverbs be >tied together? > - R. Having them in the same database makes life easier for those that need to manage releases or those than just want to get the latest version of all the code. Having them packaged in the same release makes life easier for those that want to install and use openib. I have put together just 2 RPMs (one kernel-mode) and (one user-mode) for my internal customers and they still seem to have problems getting things installed and I have to spend time helping them along. Having various components separated into independent releases makes it a lot harder for people to know what version works with what, and I am sure they will get a mismatch and then call me to help them debug it when it does not work. woody From halr at voltaire.com Thu Feb 16 15:18:05 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Feb 2006 18:18:05 -0500 Subject: [openib-general] OpenSM realloc error In-Reply-To: <1140125269.8783.13.camel@beast.terraplex.com> References: <1139945214.2169.4.camel@beast.terraplex.com> <1139963140.4333.16065.camel@hal.voltaire.com> <1139964683.5941.10.camel@beast.terraplex.com> <1140004573.4333.19957.camel@hal.voltaire.com> <1140021789.21679.2.camel@beast.terraplex.com> <1140023843.4333.21715.camel@hal.voltaire.com> <1140025284.22080.2.camel@beast.terraplex.com> <20060215184711.GE12172@sashak.voltaire.com> <1140028908.4333.22226.camel@hal.voltaire.com> <1140125269.8783.13.camel@beast.terraplex.com> Message-ID: <1140131863.4333.34690.camel@hal.voltaire.com> Hi Owen, On Thu, 2006-02-16 at 16:27, Owen Stampflee wrote: > So, here is the back trace with no code modifications... > > 0x00000080b9719db0 in .__GI_raise () from /lib64/tls/libc.so.6 > (gdb) bt > #0 0x00000080b9719db0 in .__GI_raise () from /lib64/tls/libc.so.6 > #1 0x00000080b971b89c in .__GI_abort () from /lib64/tls/libc.so.6 > #2 0x00000080b974e860 in .__libc_message () from /lib64/tls/libc.so.6 > #3 0x00000080b97580bc in ._int_realloc () from /lib64/tls/libc.so.6 > #4 0x00000080b9759528 in .__realloc () from /lib64/tls/libc.so.6 > #5 0x00000080b975942c in .__realloc () from /lib64/tls/libc.so.6 > #6 0x00000080b974cd30 in ._IO_mem_finish () from /lib64/tls/libc.so.6 > #7 0x00000080b97426b8 in ._IO_new_fclose () from /lib64/tls/libc.so.6 > #8 0x00000080b97b795c in .__GI_vsyslog () from /lib64/tls/libc.so.6 > #9 0x00000080b97b7ddc in .__GI_syslog () from /lib64/tls/libc.so.6 > #10 0x00000080a362be90 in .cl_log_event () > from /usr/lib64/libosmcomp.so.1 > #11 0x00000080a35f5700 in .osm_log () from /usr/lib64/libopensm.so.1 > #12 0x000000001001316c in ?? () > #13 0x00000000100059b4 in ?? () > #14 0x00000080b970411c in .generic_start_main () > from /lib64/tls/libc.so.6 > #15 0x00000080b97042a4 in .__libc_start_main () > from /lib64/tls/libc.so.6 > #16 0x0000000000000000 in ?? () > (gdb) > > Commenting out the cl_log_event in osm_log results in this backtrace: > > (gdb) bt > #0 0x00000080b9719db0 in .__GI_raise () from /lib64/tls/libc.so.6 > #1 0x00000080b971b89c in .__GI_abort () from /lib64/tls/libc.so.6 > #2 0x00000080b974e860 in .__libc_message () from /lib64/tls/libc.so.6 > #3 0x00000080b9756db0 in ._int_malloc () from /lib64/tls/libc.so.6 > #4 0x00000080b9758b50 in .__GI___libc_malloc () > from /lib64/tls/libc.so.6 > #5 0x00000400000607bc in __cl_malloc_priv (size=0) at > cl_memory_osd.c:62 > #6 0x00000400000604d4 in __cl_zalloc_ntrk (size=0) at cl_memory.c:416 > #7 0x00000400000629f4 in cl_ptr_vector_set_capacity > (p_vector=0x100788d0, > new_capacity=6349) at cl_ptr_vector.c:216 > #8 0x0000040000062acc in cl_ptr_vector_set_size (p_vector=0x0, size=16) > at cl_ptr_vector.c:270 > #9 0x0000040000062c08 in cl_ptr_vector_init (p_vector=0x100788d0, > min_size=6349, > grow_size=16) at cl_ptr_vector.c:93 > #10 0x000004000005bb00 in cl_disp_init (p_disp=0x100788a0, > thread_count=0, > name=0x100464c0 "opensm") at cl_dispatcher.c:214 > #11 0x00000000100133f8 in ?? () > #12 0x00000000100059b4 in ?? () > #13 0x00000080b970411c in .generic_start_main () > from /lib64/tls/libc.so.6 > #14 0x00000080b97042a4 in .__libc_start_main () > from /lib64/tls/libc.so.6 > #15 0x0000000000000000 in ?? () __cl_malloc_priv is just a wrapper for malloc: from cl_memory_osd.c: void* __cl_malloc_priv( IN const size_t size ) { return malloc( size ); } If I believe gdb this appears to be a malloc of 0 bytes but since the new_capacity was 6349 (and this would be multiplied by sizeof(void *)), I'm not sure whether to trust this. Can you send me the compile line from the OpenSM build ? Are the include paths correct for 64 bit headers ? > So now I've compiled it in 32-bit mode (had to fix my chroot) and > everything runs, but I get the following message... > > Feb 16 13:59:28 006732 [0000] -> OpenSM Rev:openib-1.1.0 > > Feb 16 13:59:28 008210 [F7E8D020] -> osm_report_notice: Reporting > Generic Notice type:3 num:66 from LID:0x0000 > GID:0xfe80000000000000,0x0000000000000000 > Feb 16 13:59:28 008292 [F7E8D020] -> osm_report_notice: Reporting > Generic Notice type:3 num:66 from LID:0x0000 > GID:0xfe80000000000000,0x0000000000000000 > Feb 16 13:59:28 015894 [F7E8D020] -> osm_vendor_get_all_port_attr: > assign CA mthca0 port 1 guid (0x2c90109764831) as the default port > Feb 16 13:59:28 015977 [F7E8D020] -> osm_vendor_bind: Binding to port > 0x2c90109764831. > Feb 16 13:59:28 021293 [F7E8D020] -> osm_vendor_bind: Binding to port > 0x2c90109764831. > Feb 16 13:59:28 021692 [F568C4E0] -> umad_receiver: ERR 5413: Failed to > obtain request madw for received MAD(method=0x81 attr=0x11) -- dropping For some reason, on the response received, it is not finding the match in the transaction table. I thought this was fixed a while ago for PowerPC. Can you run opensm with -V and see if there is any more output that might be helpful ? > Other info: > [root at m2 ~]# ibstat > CA 'mthca0' > CA type: MT23108 > Number of ports: 2 > Firmware version: 3.3.2 > Hardware version: a1 > Node GUID: 0x0002c90109764830 > System image GUID: 0x0002c90109764833 > Port 1: > State: Initializing > Physical state: LinkUp > Rate: 10 > Base lid: 0 > LMC: 0 > SM lid: 0 > Capability mask: 0x00510a68 > Port GUID: 0x0002c90109764831 > Port 2: > State: Down > Physical state: Polling > Rate: 2 > Base lid: 0 > LMC: 0 > SM lid: 0 > Capability mask: 0x00510a68 > Port GUID: 0x0002c90109764832 > > > [root at m2 ~]# ibstatus > Infiniband device 'mthca0' port 1 status: > default gid: fe80:0000:0000:0000:0002:c901:0976:4831 > base lid: 0x0 > sm lid: 0x0 > state: 2: INIT > phys state: 5: LinkUp > rate: 10 Gb/sec (4X) This is goodness and means the physical link has been established on this port. > Infiniband device 'mthca0' port 2 status: > default gid: fe80:0000:0000:0000:0002:c901:0976:4832 > base lid: 0x0 > sm lid: 0x0 > state: 1: DOWN > phys state: 2: Polling > rate: 2.5 Gb/sec (1X) > > > My archives suggest a firmware upgrade, but 3.3.3 isnt available from > SBS as far as I can tell and my contact no longer works there so I'm > going to have to find the new person to talk about getting newer > firmware, unless of course another vendors firmware will work on this > card. I think 3.3.2 should be OK. In any case, I doubt it's the source of the problem above. -- Hal > Cheers, > Owen > From mst at mellanox.co.il Thu Feb 16 15:38:49 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 17 Feb 2006 01:38:49 +0200 Subject: [openib-general] Re: Low cost MBs compatible with Mellanox PCIE Cards In-Reply-To: <20060216211852.GE3901@obsidianresearch.com> References: <000f01c63307$89a09cf0$9f01a8c0@BOBP> <064536929FFE43A49A0CEFD946BF9228.MAI@windows.securehostserver.com> <20060216193843.GC3901@obsidianresearch.com> <20060216195219.GC20997@mellanox.co.il> <20060216211852.GE3901@obsidianresearch.com> Message-ID: <20060216233849.GI20997@mellanox.co.il> Quoting r. Jason Gunthorpe : > Subject: Re: Low cost MBs compatible with Mellanox PCIE Cards > > On Thu, Feb 16, 2006 at 09:52:19PM +0200, Michael S. Tsirkin wrote: > > > It could make sense to put this in openib Wiki together with any info > > on patches needed to make it work. > > That sounds like a good idea, I have no wikki experience myself - can > anyone add a new page? Any pointers? Start here https://openib.org/tiki/tiki-index.php login and click edit. You'll see wiki help on the right. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Thu Feb 16 15:41:04 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 17 Feb 2006 01:41:04 +0200 Subject: [openib-general] Re: Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: References: <1140128584.25199.3.camel@hematite.internal.keyresearch.com> Message-ID: <20060216234104.GJ20997@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: Plans for libibverbs 1.0, 1.1 and beyond > > Roland> * Eliminate the dependency on libsysfs by implementing the > Roland> required sysfs handling directly. > > Robert> Any particular reason why, other than just minimizing > Robert> external dependencies? > > libsysfs is unmaintained, and the main user of it (udev) has stopped > using it. So it's going to bitrot and get dropped from distros. And > libsysfs kind of sucks too. Right. And the stuff we use it pretty trivial too. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From rdreier at cisco.com Thu Feb 16 15:39:58 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 16 Feb 2006 15:39:58 -0800 Subject: [openib-general] Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: <000201c6334f$6b307a90$b5a0070a@amr.corp.intel.com> (Bob Woodruff's message of "Thu, 16 Feb 2006 15:19:21 -0800") References: <000201c6334f$6b307a90$b5a0070a@amr.corp.intel.com> Message-ID: Bob> Having them packaged in the same release makes life easier Bob> for those that want to install and use openib. I have put Bob> together just 2 RPMs (one kernel-mode) and (one user-mode) Bob> for my internal customers and they still seem to have Bob> problems getting things installed and I have to spend time Bob> helping them along. This is probably OK in the short term but eventually you will go crazy when a bug fix in libmthca forces you to rebuild opensm, udapl, etc, etc. X.org faced exactly the same issue and they spent a huge amount of time converting their tree from a monolith to a modular build. We should learn from their suffering and not create a monolith in the first place. - R. From mst at mellanox.co.il Thu Feb 16 15:44:35 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 17 Feb 2006 01:44:35 +0200 Subject: [openib-general] Re: Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: References: <000001c63341$0fb1d000$b5a0070a@amr.corp.intel.com> <1140128938.12954.23.camel@serpentine.pathscale.com> Message-ID: <20060216234435.GK20997@mellanox.co.il> Quoting r. Roland Dreier : > Anyway, it seems that the users of the repository really want > everything to stay the way it is, so I guess I'll drop this plan for > now. Sigh... It could just be inertia. I never used git yet - quilt seems to be sufficient for stuff I do. How do you use it? Directly or with cogito? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From rdreier at cisco.com Thu Feb 16 15:45:02 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 16 Feb 2006 15:45:02 -0800 Subject: [openib-general] Re: Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: <20060216234435.GK20997@mellanox.co.il> (Michael S. Tsirkin's message of "Fri, 17 Feb 2006 01:44:35 +0200") References: <000001c63341$0fb1d000$b5a0070a@amr.corp.intel.com> <1140128938.12954.23.camel@serpentine.pathscale.com> <20060216234435.GK20997@mellanox.co.il> Message-ID: Michael> It could just be inertia. I never used git yet - quilt Michael> seems to be sufficient for stuff I do. How do you use Michael> it? Directly or with cogito? I mostly use git directly, although I've been using StGit a little lately. StGit is a very nice quilt replacement that integrates the quilt workflow with tracking upstream git trees. - R. From robert.j.woodruff at intel.com Thu Feb 16 16:09:52 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Thu, 16 Feb 2006 16:09:52 -0800 Subject: [openib-general] Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: Message-ID: <000301c63356$79d739b0$b5a0070a@amr.corp.intel.com> Roland wrote> >This is probably OK in the short term but eventually you will go crazy >when a bug fix in libmthca forces you to rebuild opensm, udapl, etc, etc. Agreed. Once stuff is shipped and installed by the distros, it may be easier to have different components mature at different rates. I still however think that openIB will want to periodically test a set of components together and call that set of components (RPMs or tar balls) a release. 1.0, 2.0, etc. At least people would then know that a specific set of components RPMs have all been tested and work together. my 2 cents, woody From rdreier at cisco.com Thu Feb 16 16:15:00 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 16 Feb 2006 16:15:00 -0800 Subject: [openib-general] Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: <000301c63356$79d739b0$b5a0070a@amr.corp.intel.com> (Bob Woodruff's message of "Thu, 16 Feb 2006 16:09:52 -0800") References: <000301c63356$79d739b0$b5a0070a@amr.corp.intel.com> Message-ID: Bob> I still however think that openIB will want to periodically Bob> test a set of components together and call that set of Bob> components (RPMs or tar balls) a release. 1.0, 2.0, etc. At Bob> least people would then know that a specific set of Bob> components RPMs have all been tested and work together. Yes, I think that's a good idea. But that doesn't require that all the components are checked into the same repository. - R. From robert.j.woodruff at intel.com Thu Feb 16 16:19:50 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Thu, 16 Feb 2006 16:19:50 -0800 Subject: [openib-general] Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: Message-ID: <000401c63357$de284430$b5a0070a@amr.corp.intel.com> Roland wrote> Bob> I still however think that openIB will want to periodically Bob> test a set of components together and call that set of Bob> components (RPMs or tar balls) a release. 1.0, 2.0, etc. At Bob> least people would then know that a specific set of Bob> components RPMs have all been tested and work together. >Yes, I think that's a good idea. But that doesn't require that all >the components are checked into the same repository. >- R. No it does not require they are in the same repository, but it is easier (at least for me) to only have to deal with one source code control system. woody From ostampflee at terrasoftsolutions.com Thu Feb 16 17:01:10 2006 From: ostampflee at terrasoftsolutions.com (Owen Stampflee) Date: Thu, 16 Feb 2006 17:01:10 -0800 Subject: [openib-general] OpenSM realloc error In-Reply-To: <1140131863.4333.34690.camel@hal.voltaire.com> References: <1139945214.2169.4.camel@beast.terraplex.com> <1139963140.4333.16065.camel@hal.voltaire.com> <1139964683.5941.10.camel@beast.terraplex.com> <1140004573.4333.19957.camel@hal.voltaire.com> <1140021789.21679.2.camel@beast.terraplex.com> <1140023843.4333.21715.camel@hal.voltaire.com> <1140025284.22080.2.camel@beast.terraplex.com> <20060215184711.GE12172@sashak.voltaire.com> <1140028908.4333.22226.camel@hal.voltaire.com> <1140125269.8783.13.camel@beast.terraplex.com> <1140131863.4333.34690.camel@hal.voltaire.com> Message-ID: <1140138070.9680.7.camel@beast.terraplex.com> http://cvs.terraplex.com/~owen/osm.log I'm currently using the Fedora FC5 packages that have been rebuilt, I'm going to try the OpenIB 5411 source. From sashak at voltaire.com Thu Feb 16 17:41:10 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 17 Feb 2006 03:41:10 +0200 Subject: [openib-general] OpenSM realloc error In-Reply-To: <1140125269.8783.13.camel@beast.terraplex.com> References: <1139945214.2169.4.camel@beast.terraplex.com> <1139963140.4333.16065.camel@hal.voltaire.com> <1139964683.5941.10.camel@beast.terraplex.com> <1140004573.4333.19957.camel@hal.voltaire.com> <1140021789.21679.2.camel@beast.terraplex.com> <1140023843.4333.21715.camel@hal.voltaire.com> <1140025284.22080.2.camel@beast.terraplex.com> <20060215184711.GE12172@sashak.voltaire.com> <1140028908.4333.22226.camel@hal.voltaire.com> <1140125269.8783.13.camel@beast.terraplex.com> Message-ID: <20060217014110.GN12172@sashak.voltaire.com> On 13:27 Thu 16 Feb , Owen Stampflee wrote: > > Commenting out the cl_log_event in osm_log results in this backtrace: > > (gdb) bt > #0 0x00000080b9719db0 in .__GI_raise () from /lib64/tls/libc.so.6 > #1 0x00000080b971b89c in .__GI_abort () from /lib64/tls/libc.so.6 > #2 0x00000080b974e860 in .__libc_message () from /lib64/tls/libc.so.6 > #3 0x00000080b9756db0 in ._int_malloc () from /lib64/tls/libc.so.6 > #4 0x00000080b9758b50 in .__GI___libc_malloc () > from /lib64/tls/libc.so.6 > #5 0x00000400000607bc in __cl_malloc_priv (size=0) at > cl_memory_osd.c:62 > #6 0x00000400000604d4 in __cl_zalloc_ntrk (size=0) at cl_memory.c:416 > #7 0x00000400000629f4 in cl_ptr_vector_set_capacity > (p_vector=0x100788d0, > new_capacity=6349) at cl_ptr_vector.c:216 p_vector=0x100788d0 does not look like valid 64-bit address. Is it case that complib is not created properly (don't know how exactly, probably ./configured)? Sasha From ostampflee at terrasoftsolutions.com Thu Feb 16 17:43:16 2006 From: ostampflee at terrasoftsolutions.com (Owen Stampflee) Date: Thu, 16 Feb 2006 17:43:16 -0800 Subject: [openib-general] OpenSM realloc error In-Reply-To: <1140131863.4333.34690.camel@hal.voltaire.com> References: <1139945214.2169.4.camel@beast.terraplex.com> <1139963140.4333.16065.camel@hal.voltaire.com> <1139964683.5941.10.camel@beast.terraplex.com> <1140004573.4333.19957.camel@hal.voltaire.com> <1140021789.21679.2.camel@beast.terraplex.com> <1140023843.4333.21715.camel@hal.voltaire.com> <1140025284.22080.2.camel@beast.terraplex.com> <20060215184711.GE12172@sashak.voltaire.com> <1140028908.4333.22226.camel@hal.voltaire.com> <1140125269.8783.13.camel@beast.terraplex.com> <1140131863.4333.34690.camel@hal.voltaire.com> Message-ID: <1140140596.9680.13.camel@beast.terraplex.com> A 32-bit build of 5411 gets the link to become active and ipv_rc_pingpng works, but I cant bring up ipoib... dmesg says this (tried both ib0 and ib1 to ensure ports werent swapped) ADDRCONF(NETDEV_UP): ib0: link is not ready ADDRCONF(NETDEV_UP): ib1: link is not ready At least we're making progress. Thanks, Owen On Thu, 2006-02-16 at 18:18 -0500, Hal Rosenstock wrote: > Hi Owen, > > On Thu, 2006-02-16 at 16:27, Owen Stampflee wrote: > > So, here is the back trace with no code modifications... > > > > 0x00000080b9719db0 in .__GI_raise () from /lib64/tls/libc.so.6 > > (gdb) bt > > #0 0x00000080b9719db0 in .__GI_raise () from /lib64/tls/libc.so.6 > > #1 0x00000080b971b89c in .__GI_abort () from /lib64/tls/libc.so.6 > > #2 0x00000080b974e860 in .__libc_message () from /lib64/tls/libc.so.6 > > #3 0x00000080b97580bc in ._int_realloc () from /lib64/tls/libc.so.6 > > #4 0x00000080b9759528 in .__realloc () from /lib64/tls/libc.so.6 > > #5 0x00000080b975942c in .__realloc () from /lib64/tls/libc.so.6 > > #6 0x00000080b974cd30 in ._IO_mem_finish () from /lib64/tls/libc.so.6 > > #7 0x00000080b97426b8 in ._IO_new_fclose () from /lib64/tls/libc.so.6 > > #8 0x00000080b97b795c in .__GI_vsyslog () from /lib64/tls/libc.so.6 > > #9 0x00000080b97b7ddc in .__GI_syslog () from /lib64/tls/libc.so.6 > > #10 0x00000080a362be90 in .cl_log_event () > > from /usr/lib64/libosmcomp.so.1 > > #11 0x00000080a35f5700 in .osm_log () from /usr/lib64/libopensm.so.1 > > #12 0x000000001001316c in ?? () > > #13 0x00000000100059b4 in ?? () > > #14 0x00000080b970411c in .generic_start_main () > > from /lib64/tls/libc.so.6 > > #15 0x00000080b97042a4 in .__libc_start_main () > > from /lib64/tls/libc.so.6 > > #16 0x0000000000000000 in ?? () > > (gdb) > > > > Commenting out the cl_log_event in osm_log results in this backtrace: > > > > (gdb) bt > > #0 0x00000080b9719db0 in .__GI_raise () from /lib64/tls/libc.so.6 > > #1 0x00000080b971b89c in .__GI_abort () from /lib64/tls/libc.so.6 > > #2 0x00000080b974e860 in .__libc_message () from /lib64/tls/libc.so.6 > > #3 0x00000080b9756db0 in ._int_malloc () from /lib64/tls/libc.so.6 > > #4 0x00000080b9758b50 in .__GI___libc_malloc () > > from /lib64/tls/libc.so.6 > > #5 0x00000400000607bc in __cl_malloc_priv (size=0) at > > cl_memory_osd.c:62 > > #6 0x00000400000604d4 in __cl_zalloc_ntrk (size=0) at cl_memory.c:416 > > #7 0x00000400000629f4 in cl_ptr_vector_set_capacity > > (p_vector=0x100788d0, > > new_capacity=6349) at cl_ptr_vector.c:216 > > #8 0x0000040000062acc in cl_ptr_vector_set_size (p_vector=0x0, size=16) > > at cl_ptr_vector.c:270 > > #9 0x0000040000062c08 in cl_ptr_vector_init (p_vector=0x100788d0, > > min_size=6349, > > grow_size=16) at cl_ptr_vector.c:93 > > #10 0x000004000005bb00 in cl_disp_init (p_disp=0x100788a0, > > thread_count=0, > > name=0x100464c0 "opensm") at cl_dispatcher.c:214 > > #11 0x00000000100133f8 in ?? () > > #12 0x00000000100059b4 in ?? () > > #13 0x00000080b970411c in .generic_start_main () > > from /lib64/tls/libc.so.6 > > #14 0x00000080b97042a4 in .__libc_start_main () > > from /lib64/tls/libc.so.6 > > #15 0x0000000000000000 in ?? () > > __cl_malloc_priv is just a wrapper for malloc: > > from cl_memory_osd.c: > void* > __cl_malloc_priv( > IN const size_t size ) > { > return malloc( size ); > } > > If I believe gdb this appears to be a malloc of 0 bytes but since the > new_capacity was 6349 (and this would be multiplied by sizeof(void *)), > I'm not sure whether to trust this. > > Can you send me the compile line from the OpenSM build ? Are the include > paths correct for 64 bit headers ? > > > So now I've compiled it in 32-bit mode (had to fix my chroot) and > > everything runs, but I get the following message... > > > > Feb 16 13:59:28 006732 [0000] -> OpenSM Rev:openib-1.1.0 > > > > Feb 16 13:59:28 008210 [F7E8D020] -> osm_report_notice: Reporting > > Generic Notice type:3 num:66 from LID:0x0000 > > GID:0xfe80000000000000,0x0000000000000000 > > Feb 16 13:59:28 008292 [F7E8D020] -> osm_report_notice: Reporting > > Generic Notice type:3 num:66 from LID:0x0000 > > GID:0xfe80000000000000,0x0000000000000000 > > Feb 16 13:59:28 015894 [F7E8D020] -> osm_vendor_get_all_port_attr: > > assign CA mthca0 port 1 guid (0x2c90109764831) as the default port > > Feb 16 13:59:28 015977 [F7E8D020] -> osm_vendor_bind: Binding to port > > 0x2c90109764831. > > Feb 16 13:59:28 021293 [F7E8D020] -> osm_vendor_bind: Binding to port > > 0x2c90109764831. > > Feb 16 13:59:28 021692 [F568C4E0] -> umad_receiver: ERR 5413: Failed to > > obtain request madw for received MAD(method=0x81 attr=0x11) -- dropping > > For some reason, on the response received, it is not finding the match > in the transaction table. I thought this was fixed a while ago for > PowerPC. Can you run opensm with -V and see if there is any more output > that might be helpful ? > > > Other info: > > [root at m2 ~]# ibstat > > CA 'mthca0' > > CA type: MT23108 > > Number of ports: 2 > > Firmware version: 3.3.2 > > Hardware version: a1 > > Node GUID: 0x0002c90109764830 > > System image GUID: 0x0002c90109764833 > > Port 1: > > State: Initializing > > Physical state: LinkUp > > Rate: 10 > > Base lid: 0 > > LMC: 0 > > SM lid: 0 > > Capability mask: 0x00510a68 > > Port GUID: 0x0002c90109764831 > > Port 2: > > State: Down > > Physical state: Polling > > Rate: 2 > > Base lid: 0 > > LMC: 0 > > SM lid: 0 > > Capability mask: 0x00510a68 > > Port GUID: 0x0002c90109764832 > > > > > > [root at m2 ~]# ibstatus > > Infiniband device 'mthca0' port 1 status: > > default gid: fe80:0000:0000:0000:0002:c901:0976:4831 > > base lid: 0x0 > > sm lid: 0x0 > > state: 2: INIT > > phys state: 5: LinkUp > > rate: 10 Gb/sec (4X) > > This is goodness and means the physical link has been established on > this port. > > > Infiniband device 'mthca0' port 2 status: > > default gid: fe80:0000:0000:0000:0002:c901:0976:4832 > > base lid: 0x0 > > sm lid: 0x0 > > state: 1: DOWN > > phys state: 2: Polling > > rate: 2.5 Gb/sec (1X) > > > > > > My archives suggest a firmware upgrade, but 3.3.3 isnt available from > > SBS as far as I can tell and my contact no longer works there so I'm > > going to have to find the new person to talk about getting newer > > firmware, unless of course another vendors firmware will work on this > > card. > > I think 3.3.2 should be OK. In any case, I doubt it's the source of the > problem above. > > -- Hal > > > Cheers, > > Owen > > > > > !DSPAM:43f50a778141514148722! From halr at voltaire.com Thu Feb 16 17:43:47 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Feb 2006 20:43:47 -0500 Subject: [openib-general] OpenSM realloc error In-Reply-To: <1140138070.9680.7.camel@beast.terraplex.com> References: <1139945214.2169.4.camel@beast.terraplex.com> <1139963140.4333.16065.camel@hal.voltaire.com> <1139964683.5941.10.camel@beast.terraplex.com> <1140004573.4333.19957.camel@hal.voltaire.com> <1140021789.21679.2.camel@beast.terraplex.com> <1140023843.4333.21715.camel@hal.voltaire.com> <1140025284.22080.2.camel@beast.terraplex.com> <20060215184711.GE12172@sashak.voltaire.com> <1140028908.4333.22226.camel@hal.voltaire.com> <1140125269.8783.13.camel@beast.terraplex.com> <1140131863.4333.34690.camel@hal.voltaire.com> <1140138070.9680.7.camel@beast.terraplex.com> Message-ID: <1140140623.4333.36207.camel@hal.voltaire.com> Hi again Owen, On Thu, 2006-02-16 at 20:01, Owen Stampflee wrote: > http://cvs.terraplex.com/~owen/osm.log > > I'm currently using the Fedora FC5 packages that have been rebuilt, I'm not sure what svn the FC5 package corresponds to. Can you check the following in osm/libvendor/osm_vendor_ibumad.c: Index: libvendor/osm_vendor_ibumad.c =================================================================== --- libvendor/osm_vendor_ibumad.c (revision 5029) +++ libvendor/osm_vendor_ibumad.c (revision 5030) @@ -137,7 +137,7 @@ get_madw(osm_vendor_t *p_vend, ib_net64_t *tid) { umad_match_t *m, *e; - ib_net64_t mtid = (*tid & 0xffffffff00000000llu); + ib_net64_t mtid = (*tid & cl_ntoh64(0x00000000ffffffffllu)); cl_spinlock_acquire( &p_vend->match_tbl_lock ); for (m = p_vend->mtbl.tbl, e = m + p_vend->mtbl.max; m < e; m++) { This was fixed in svn 5030: r5030 | halr | 2006-01-16 13:05:22 -0500 (Mon, 16 Jan 2006) | 8 lines In osm_vendor_ibumad.c::get_madw, fix endian of mask for transaction ID used for comparison The lack of this caused issued on running OpenSM on PPC where get_madw would fail finding the matching transaction Signed-off-by: Hal Rosenstock I am guessing that the FC5 OpenSM package is earlier than 5030. Can you check get_madw (and make the one line change if it does not handle the tid properly. > I'm going to try the OpenIB 5411 source. That will be another way to tell this. -- Hal From halr at voltaire.com Thu Feb 16 20:45:21 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Feb 2006 23:45:21 -0500 Subject: [openib-general] OpenSM realloc error In-Reply-To: <1140140596.9680.13.camel@beast.terraplex.com> References: <1139945214.2169.4.camel@beast.terraplex.com> <1139963140.4333.16065.camel@hal.voltaire.com> <1139964683.5941.10.camel@beast.terraplex.com> <1140004573.4333.19957.camel@hal.voltaire.com> <1140021789.21679.2.camel@beast.terraplex.com> <1140023843.4333.21715.camel@hal.voltaire.com> <1140025284.22080.2.camel@beast.terraplex.com> <20060215184711.GE12172@sashak.voltaire.com> <1140028908.4333.22226.camel@hal.voltaire.com> <1140125269.8783.13.camel@beast.terraplex.com> <1140131863.4333.34690.camel@hal.voltaire.com> <1140140596.9680.13.camel@beast.terraplex.com> Message-ID: <1140151518.4333.37979.camel@hal.voltaire.com> On Thu, 2006-02-16 at 20:43, Owen Stampflee wrote: > A 32-bit build of 5411 gets the link to become active Glad to hear this.That is what I would expect and would like to confirm the tid patch is missing from the FC5 package as well as getting to the bottom of the 64 bit issues if you have some time to help on this. -- Hal > and ipv_rc_pingpng works, but I cant bring up ipoib... > > dmesg says this (tried both ib0 and ib1 to ensure ports werent swapped) > ADDRCONF(NETDEV_UP): ib0: link is not ready > ADDRCONF(NETDEV_UP): ib1: link is not ready > > At least we're making progress. > > Thanks, > Owen > > On Thu, 2006-02-16 at 18:18 -0500, Hal Rosenstock wrote: > > Hi Owen, > > > > On Thu, 2006-02-16 at 16:27, Owen Stampflee wrote: > > > So, here is the back trace with no code modifications... > > > > > > 0x00000080b9719db0 in .__GI_raise () from /lib64/tls/libc.so.6 > > > (gdb) bt > > > #0 0x00000080b9719db0 in .__GI_raise () from /lib64/tls/libc.so.6 > > > #1 0x00000080b971b89c in .__GI_abort () from /lib64/tls/libc.so.6 > > > #2 0x00000080b974e860 in .__libc_message () from /lib64/tls/libc.so.6 > > > #3 0x00000080b97580bc in ._int_realloc () from /lib64/tls/libc.so.6 > > > #4 0x00000080b9759528 in .__realloc () from /lib64/tls/libc.so.6 > > > #5 0x00000080b975942c in .__realloc () from /lib64/tls/libc.so.6 > > > #6 0x00000080b974cd30 in ._IO_mem_finish () from /lib64/tls/libc.so.6 > > > #7 0x00000080b97426b8 in ._IO_new_fclose () from /lib64/tls/libc.so.6 > > > #8 0x00000080b97b795c in .__GI_vsyslog () from /lib64/tls/libc.so.6 > > > #9 0x00000080b97b7ddc in .__GI_syslog () from /lib64/tls/libc.so.6 > > > #10 0x00000080a362be90 in .cl_log_event () > > > from /usr/lib64/libosmcomp.so.1 > > > #11 0x00000080a35f5700 in .osm_log () from /usr/lib64/libopensm.so.1 > > > #12 0x000000001001316c in ?? () > > > #13 0x00000000100059b4 in ?? () > > > #14 0x00000080b970411c in .generic_start_main () > > > from /lib64/tls/libc.so.6 > > > #15 0x00000080b97042a4 in .__libc_start_main () > > > from /lib64/tls/libc.so.6 > > > #16 0x0000000000000000 in ?? () > > > (gdb) > > > > > > Commenting out the cl_log_event in osm_log results in this backtrace: > > > > > > (gdb) bt > > > #0 0x00000080b9719db0 in .__GI_raise () from /lib64/tls/libc.so.6 > > > #1 0x00000080b971b89c in .__GI_abort () from /lib64/tls/libc.so.6 > > > #2 0x00000080b974e860 in .__libc_message () from /lib64/tls/libc.so.6 > > > #3 0x00000080b9756db0 in ._int_malloc () from /lib64/tls/libc.so.6 > > > #4 0x00000080b9758b50 in .__GI___libc_malloc () > > > from /lib64/tls/libc.so.6 > > > #5 0x00000400000607bc in __cl_malloc_priv (size=0) at > > > cl_memory_osd.c:62 > > > #6 0x00000400000604d4 in __cl_zalloc_ntrk (size=0) at cl_memory.c:416 > > > #7 0x00000400000629f4 in cl_ptr_vector_set_capacity > > > (p_vector=0x100788d0, > > > new_capacity=6349) at cl_ptr_vector.c:216 > > > #8 0x0000040000062acc in cl_ptr_vector_set_size (p_vector=0x0, size=16) > > > at cl_ptr_vector.c:270 > > > #9 0x0000040000062c08 in cl_ptr_vector_init (p_vector=0x100788d0, > > > min_size=6349, > > > grow_size=16) at cl_ptr_vector.c:93 > > > #10 0x000004000005bb00 in cl_disp_init (p_disp=0x100788a0, > > > thread_count=0, > > > name=0x100464c0 "opensm") at cl_dispatcher.c:214 > > > #11 0x00000000100133f8 in ?? () > > > #12 0x00000000100059b4 in ?? () > > > #13 0x00000080b970411c in .generic_start_main () > > > from /lib64/tls/libc.so.6 > > > #14 0x00000080b97042a4 in .__libc_start_main () > > > from /lib64/tls/libc.so.6 > > > #15 0x0000000000000000 in ?? () > > > > __cl_malloc_priv is just a wrapper for malloc: > > > > from cl_memory_osd.c: > > void* > > __cl_malloc_priv( > > IN const size_t size ) > > { > > return malloc( size ); > > } > > > > If I believe gdb this appears to be a malloc of 0 bytes but since the > > new_capacity was 6349 (and this would be multiplied by sizeof(void *)), > > I'm not sure whether to trust this. > > > > Can you send me the compile line from the OpenSM build ? Are the include > > paths correct for 64 bit headers ? > > > > > So now I've compiled it in 32-bit mode (had to fix my chroot) and > > > everything runs, but I get the following message... > > > > > > Feb 16 13:59:28 006732 [0000] -> OpenSM Rev:openib-1.1.0 > > > > > > Feb 16 13:59:28 008210 [F7E8D020] -> osm_report_notice: Reporting > > > Generic Notice type:3 num:66 from LID:0x0000 > > > GID:0xfe80000000000000,0x0000000000000000 > > > Feb 16 13:59:28 008292 [F7E8D020] -> osm_report_notice: Reporting > > > Generic Notice type:3 num:66 from LID:0x0000 > > > GID:0xfe80000000000000,0x0000000000000000 > > > Feb 16 13:59:28 015894 [F7E8D020] -> osm_vendor_get_all_port_attr: > > > assign CA mthca0 port 1 guid (0x2c90109764831) as the default port > > > Feb 16 13:59:28 015977 [F7E8D020] -> osm_vendor_bind: Binding to port > > > 0x2c90109764831. > > > Feb 16 13:59:28 021293 [F7E8D020] -> osm_vendor_bind: Binding to port > > > 0x2c90109764831. > > > Feb 16 13:59:28 021692 [F568C4E0] -> umad_receiver: ERR 5413: Failed to > > > obtain request madw for received MAD(method=0x81 attr=0x11) -- dropping > > > > For some reason, on the response received, it is not finding the match > > in the transaction table. I thought this was fixed a while ago for > > PowerPC. Can you run opensm with -V and see if there is any more output > > that might be helpful ? > > > > > Other info: > > > [root at m2 ~]# ibstat > > > CA 'mthca0' > > > CA type: MT23108 > > > Number of ports: 2 > > > Firmware version: 3.3.2 > > > Hardware version: a1 > > > Node GUID: 0x0002c90109764830 > > > System image GUID: 0x0002c90109764833 > > > Port 1: > > > State: Initializing > > > Physical state: LinkUp > > > Rate: 10 > > > Base lid: 0 > > > LMC: 0 > > > SM lid: 0 > > > Capability mask: 0x00510a68 > > > Port GUID: 0x0002c90109764831 > > > Port 2: > > > State: Down > > > Physical state: Polling > > > Rate: 2 > > > Base lid: 0 > > > LMC: 0 > > > SM lid: 0 > > > Capability mask: 0x00510a68 > > > Port GUID: 0x0002c90109764832 > > > > > > > > > [root at m2 ~]# ibstatus > > > Infiniband device 'mthca0' port 1 status: > > > default gid: fe80:0000:0000:0000:0002:c901:0976:4831 > > > base lid: 0x0 > > > sm lid: 0x0 > > > state: 2: INIT > > > phys state: 5: LinkUp > > > rate: 10 Gb/sec (4X) > > > > This is goodness and means the physical link has been established on > > this port. > > > > > Infiniband device 'mthca0' port 2 status: > > > default gid: fe80:0000:0000:0000:0002:c901:0976:4832 > > > base lid: 0x0 > > > sm lid: 0x0 > > > state: 1: DOWN > > > phys state: 2: Polling > > > rate: 2.5 Gb/sec (1X) > > > > > > > > > My archives suggest a firmware upgrade, but 3.3.3 isnt available from > > > SBS as far as I can tell and my contact no longer works there so I'm > > > going to have to find the new person to talk about getting newer > > > firmware, unless of course another vendors firmware will work on this > > > card. > > > > I think 3.3.2 should be OK. In any case, I doubt it's the source of the > > problem above. > > > > -- Hal > > > > > Cheers, > > > Owen > > > > > > > > > !DSPAM:43f50a778141514148722! > From ostampflee at terrasoftsolutions.com Thu Feb 16 21:01:59 2006 From: ostampflee at terrasoftsolutions.com (Owen Stampflee) Date: Thu, 16 Feb 2006 21:01:59 -0800 Subject: [openib-general] OpenSM realloc error In-Reply-To: <1140151518.4333.37979.camel@hal.voltaire.com> References: <1139945214.2169.4.camel@beast.terraplex.com> <1139963140.4333.16065.camel@hal.voltaire.com> <1139964683.5941.10.camel@beast.terraplex.com> <1140004573.4333.19957.camel@hal.voltaire.com> <1140021789.21679.2.camel@beast.terraplex.com> <1140023843.4333.21715.camel@hal.voltaire.com> <1140025284.22080.2.camel@beast.terraplex.com> <20060215184711.GE12172@sashak.voltaire.com> <1140028908.4333.22226.camel@hal.voltaire.com> <1140125269.8783.13.camel@beast.terraplex.com> <1140131863.4333.34690.camel@hal.voltaire.com> <1140140596.9680.13.camel@beast.terraplex.com> <1140151518.4333.37979.camel@hal.voltaire.com> Message-ID: <1140152519.9680.24.camel@beast.terraplex.com> Of course, I need to get things working first, than we can deal with the 64-bit issues (gotta please the boss, and if shipping 32-bit binarys and both 32/64 bit libraries provides a working udapl, ipoib, and 32+64-bit mpi, I can meet my deadline (Monday)). I'm suspecting some glibc issues on our end, but I've never seen these before and since we're using a RHEL-based toolchain, this _should_ just work. Any thoughts on ipoib? My research hasnt shown that problem before. I'll see if I can get mvapich built tomorrow and see if that at least works. Thanks for all the assistance, Owen On Thu, 2006-02-16 at 23:45 -0500, Hal Rosenstock wrote: > On Thu, 2006-02-16 at 20:43, Owen Stampflee wrote: > > A 32-bit build of 5411 gets the link to become active > > Glad to hear this.That is what I would expect and would like to confirm > the tid patch is missing from the FC5 package as well as getting to the > bottom of the 64 bit issues if you have some time to help on this. > > -- Hal > > > and ipv_rc_pingpng works, but I cant bring up ipoib... > > > > dmesg says this (tried both ib0 and ib1 to ensure ports werent swapped) > > ADDRCONF(NETDEV_UP): ib0: link is not ready > > ADDRCONF(NETDEV_UP): ib1: link is not ready > > > > At least we're making progress. > > > > Thanks, > > Owen > > > > On Thu, 2006-02-16 at 18:18 -0500, Hal Rosenstock wrote: > > > Hi Owen, > > > > > > On Thu, 2006-02-16 at 16:27, Owen Stampflee wrote: > > > > So, here is the back trace with no code modifications... > > > > > > > > 0x00000080b9719db0 in .__GI_raise () from /lib64/tls/libc.so.6 > > > > (gdb) bt > > > > #0 0x00000080b9719db0 in .__GI_raise () from /lib64/tls/libc.so.6 > > > > #1 0x00000080b971b89c in .__GI_abort () from /lib64/tls/libc.so.6 > > > > #2 0x00000080b974e860 in .__libc_message () from /lib64/tls/libc.so.6 > > > > #3 0x00000080b97580bc in ._int_realloc () from /lib64/tls/libc.so.6 > > > > #4 0x00000080b9759528 in .__realloc () from /lib64/tls/libc.so.6 > > > > #5 0x00000080b975942c in .__realloc () from /lib64/tls/libc.so.6 > > > > #6 0x00000080b974cd30 in ._IO_mem_finish () from /lib64/tls/libc.so.6 > > > > #7 0x00000080b97426b8 in ._IO_new_fclose () from /lib64/tls/libc.so.6 > > > > #8 0x00000080b97b795c in .__GI_vsyslog () from /lib64/tls/libc.so.6 > > > > #9 0x00000080b97b7ddc in .__GI_syslog () from /lib64/tls/libc.so.6 > > > > #10 0x00000080a362be90 in .cl_log_event () > > > > from /usr/lib64/libosmcomp.so.1 > > > > #11 0x00000080a35f5700 in .osm_log () from /usr/lib64/libopensm.so.1 > > > > #12 0x000000001001316c in ?? () > > > > #13 0x00000000100059b4 in ?? () > > > > #14 0x00000080b970411c in .generic_start_main () > > > > from /lib64/tls/libc.so.6 > > > > #15 0x00000080b97042a4 in .__libc_start_main () > > > > from /lib64/tls/libc.so.6 > > > > #16 0x0000000000000000 in ?? () > > > > (gdb) > > > > > > > > Commenting out the cl_log_event in osm_log results in this backtrace: > > > > > > > > (gdb) bt > > > > #0 0x00000080b9719db0 in .__GI_raise () from /lib64/tls/libc.so.6 > > > > #1 0x00000080b971b89c in .__GI_abort () from /lib64/tls/libc.so.6 > > > > #2 0x00000080b974e860 in .__libc_message () from /lib64/tls/libc.so.6 > > > > #3 0x00000080b9756db0 in ._int_malloc () from /lib64/tls/libc.so.6 > > > > #4 0x00000080b9758b50 in .__GI___libc_malloc () > > > > from /lib64/tls/libc.so.6 > > > > #5 0x00000400000607bc in __cl_malloc_priv (size=0) at > > > > cl_memory_osd.c:62 > > > > #6 0x00000400000604d4 in __cl_zalloc_ntrk (size=0) at cl_memory.c:416 > > > > #7 0x00000400000629f4 in cl_ptr_vector_set_capacity > > > > (p_vector=0x100788d0, > > > > new_capacity=6349) at cl_ptr_vector.c:216 > > > > #8 0x0000040000062acc in cl_ptr_vector_set_size (p_vector=0x0, size=16) > > > > at cl_ptr_vector.c:270 > > > > #9 0x0000040000062c08 in cl_ptr_vector_init (p_vector=0x100788d0, > > > > min_size=6349, > > > > grow_size=16) at cl_ptr_vector.c:93 > > > > #10 0x000004000005bb00 in cl_disp_init (p_disp=0x100788a0, > > > > thread_count=0, > > > > name=0x100464c0 "opensm") at cl_dispatcher.c:214 > > > > #11 0x00000000100133f8 in ?? () > > > > #12 0x00000000100059b4 in ?? () > > > > #13 0x00000080b970411c in .generic_start_main () > > > > from /lib64/tls/libc.so.6 > > > > #14 0x00000080b97042a4 in .__libc_start_main () > > > > from /lib64/tls/libc.so.6 > > > > #15 0x0000000000000000 in ?? () > > > > > > __cl_malloc_priv is just a wrapper for malloc: > > > > > > from cl_memory_osd.c: > > > void* > > > __cl_malloc_priv( > > > IN const size_t size ) > > > { > > > return malloc( size ); > > > } > > > > > > If I believe gdb this appears to be a malloc of 0 bytes but since the > > > new_capacity was 6349 (and this would be multiplied by sizeof(void *)), > > > I'm not sure whether to trust this. > > > > > > Can you send me the compile line from the OpenSM build ? Are the include > > > paths correct for 64 bit headers ? > > > > > > > So now I've compiled it in 32-bit mode (had to fix my chroot) and > > > > everything runs, but I get the following message... > > > > > > > > Feb 16 13:59:28 006732 [0000] -> OpenSM Rev:openib-1.1.0 > > > > > > > > Feb 16 13:59:28 008210 [F7E8D020] -> osm_report_notice: Reporting > > > > Generic Notice type:3 num:66 from LID:0x0000 > > > > GID:0xfe80000000000000,0x0000000000000000 > > > > Feb 16 13:59:28 008292 [F7E8D020] -> osm_report_notice: Reporting > > > > Generic Notice type:3 num:66 from LID:0x0000 > > > > GID:0xfe80000000000000,0x0000000000000000 > > > > Feb 16 13:59:28 015894 [F7E8D020] -> osm_vendor_get_all_port_attr: > > > > assign CA mthca0 port 1 guid (0x2c90109764831) as the default port > > > > Feb 16 13:59:28 015977 [F7E8D020] -> osm_vendor_bind: Binding to port > > > > 0x2c90109764831. > > > > Feb 16 13:59:28 021293 [F7E8D020] -> osm_vendor_bind: Binding to port > > > > 0x2c90109764831. > > > > Feb 16 13:59:28 021692 [F568C4E0] -> umad_receiver: ERR 5413: Failed to > > > > obtain request madw for received MAD(method=0x81 attr=0x11) -- dropping > > > > > > For some reason, on the response received, it is not finding the match > > > in the transaction table. I thought this was fixed a while ago for > > > PowerPC. Can you run opensm with -V and see if there is any more output > > > that might be helpful ? > > > > > > > Other info: > > > > [root at m2 ~]# ibstat > > > > CA 'mthca0' > > > > CA type: MT23108 > > > > Number of ports: 2 > > > > Firmware version: 3.3.2 > > > > Hardware version: a1 > > > > Node GUID: 0x0002c90109764830 > > > > System image GUID: 0x0002c90109764833 > > > > Port 1: > > > > State: Initializing > > > > Physical state: LinkUp > > > > Rate: 10 > > > > Base lid: 0 > > > > LMC: 0 > > > > SM lid: 0 > > > > Capability mask: 0x00510a68 > > > > Port GUID: 0x0002c90109764831 > > > > Port 2: > > > > State: Down > > > > Physical state: Polling > > > > Rate: 2 > > > > Base lid: 0 > > > > LMC: 0 > > > > SM lid: 0 > > > > Capability mask: 0x00510a68 > > > > Port GUID: 0x0002c90109764832 > > > > > > > > > > > > [root at m2 ~]# ibstatus > > > > Infiniband device 'mthca0' port 1 status: > > > > default gid: fe80:0000:0000:0000:0002:c901:0976:4831 > > > > base lid: 0x0 > > > > sm lid: 0x0 > > > > state: 2: INIT > > > > phys state: 5: LinkUp > > > > rate: 10 Gb/sec (4X) > > > > > > This is goodness and means the physical link has been established on > > > this port. > > > > > > > Infiniband device 'mthca0' port 2 status: > > > > default gid: fe80:0000:0000:0000:0002:c901:0976:4832 > > > > base lid: 0x0 > > > > sm lid: 0x0 > > > > state: 1: DOWN > > > > phys state: 2: Polling > > > > rate: 2.5 Gb/sec (1X) > > > > > > > > > > > > My archives suggest a firmware upgrade, but 3.3.3 isnt available from > > > > SBS as far as I can tell and my contact no longer works there so I'm > > > > going to have to find the new person to talk about getting newer > > > > firmware, unless of course another vendors firmware will work on this > > > > card. > > > > > > I think 3.3.2 should be OK. In any case, I doubt it's the source of the > > > problem above. > > > > > > -- Hal > > > > > > > Cheers, > > > > Owen > > > > > > > > > > > > > > > > > > !DSPAM:43f5572d122323871347016! From karun at gs-lab.com Thu Feb 16 23:01:09 2006 From: karun at gs-lab.com (Karun Beer Sharma) Date: Fri, 17 Feb 2006 12:31:09 +0530 Subject: [openib-general] IBG2 installation Message-ID: <43F574B5.903@gs-lab.com> Hi: I have installed IBG2 (ver. 2.0.1 from Mellanox) with 2.6.9-22EL kernel version. The installation seems OK and I am able to execute some of the commands like ibnetdiscover etc. Then I downloaded Netpipe (ver 3.6.2) and tried to make (make ib) it on my machine. I am getting errors of missing header files (ib_defs.h). I checked the makefile and observed that VAPI_INC path is required. I searhed /usr/include but was not able to find the required header files. Please let me know if I need to install something else also. Thanks. Regards, Karun From ianjiang.ict at gmail.com Thu Feb 16 23:19:50 2006 From: ianjiang.ict at gmail.com (Ian Jiang) Date: Fri, 17 Feb 2006 15:19:50 +0800 Subject: [openib-general] [VAPI]VAPI_poll_cq: CQ is empty Message-ID: <7b2fa1820602162319r663e2c10o8790269cd286b038@mail.gmail.com> To get familiar to the IBGD-1.8.0 VAPI, I wrote a program very simple, according to two examples *hca_per* and *rctp* in IBGD. A Sender and a Receiver ran on tow different nodes just to complete a Send/Recv progress. Sender ====== (a) Create IB resources: 1. List HCAs (only one HCA in fact) 2. Get the handle of the HCA 3. Query the HCA 4. Allocate a PD 5. Quey Port 1 of the HCA (only one Port in fact) 6. Create Send CQ and Recv CQ 7. Create QP (b) Modify QP to INIT state: 1. qp_move_to_init(¶ms); (c) Create MRs for Recv and Send respectively: 1. user_mr_create(¶ms.in_mr, params.mr_sz_req); 2. user_mr_create(¶ms.out_mr, params.mr_sz_req); (d) Send parameters to Receiver (e) Get ready to transfer: 1. Modify QP to RTR state 2. Modify QP to RTS state (f) Post Send 1. post_send_req(¶ms.ib_res, ¶ms.out_mr) (g) Wait Send to complete 1. reap_send_req(¶ms.ib_res, ¶ms.out_mr, 1/* not block*/); Receiver ======= (a) Wait parameters from Sender (b) Create IB resources: 1. List HCAs (only one HCA in fact) 2. Get the handle of the HCA 3. Query the HCA 4. Allocate a PD 5. Quey Port 1 of the HCA (only one Port in fact) 6. Create Send CQ and Recv CQ 7. Create QP (b) Modify QP to INIT state: 1. qp_move_to_init(¶ms); (c) Create MRs for Recv and Send respectively: 1. user_mr_create(¶ms.in_mr, params.mr_sz_req); 2. user_mr_create(¶ms.out_mr, params.mr_sz_req); (d) Post Recv 1. post_recv_req(¶ms.ib_res, ¶ms.in_mr) (e) Get ready to transfer: 1. Modify QP to RTR state 2. Modify QP to RTS state (g) Wait Recv to complete 1. reap_recv_req(¶ms.ib_res, ¶ms.in_mr, 1/* not block*/); Problem: ======= Both VAPI_poll_cq for Send CQ and Recv CQ returned "CQ is empty". And I failed to find out where the problem was, so turned to OpenIB for help. I am afraid that I am not clear enough about the CQ processing. Any suggestion is appreciated! Here are some pieces fo codes: ========================= /*********************************** Create IB Resources ****************************************/ int ib_res_create(struct ib_resource *ib_res_p) { VAPI_ret_t vapi_ret; u_int32_t num_of_hcas; VAPI_hca_id_t inst_hca_id; VAPI_cqe_num_t num_of_cqe; VAPI_srq_attr_t srq_props; VAPI_srq_attr_t actual_srq_props; VAPI_qp_init_attr_t qp_init_attr; VAPI_qp_init_attr_ext_t qp_ext_attr; VAPI_qp_prop_t qp_prop; if (ib_res_p == NULL) { PRINT_ERR("NULL ib_res_p\n"); return -1; } ini_ib_res(ib_res_p); /* list HCAs */ vapi_ret = EVAPI_list_hcas(1, &num_of_hcas, &inst_hca_id); if ((vapi_ret != VAPI_OK) && (vapi_ret != VAPI_EAGAIN)) { printf("list HCAs failed\n"); VAPIERR(vapi_ret); return -1; } PRINT_TRACE("number of HCAs: %d, HCA ID: %s\n", num_of_hcas, (char *)inst_hca_id); switch(num_of_hcas) { case 0: printf("No HCAs installed\n"); return -1; case 1: strcpy(ib_res_p->hca_id, inst_hca_id); break; default: /* ToDo: deal with multiple HCAs */ printf("ToDo: deal with multiple HCAs\n"); printf("Use the first HCA\n"); strcpy(ib_res_p->hca_id, inst_hca_id); } PRINT_TRACE("HCA to be used: %s\n", (char *)ib_res_p->hca_id); /* get the handle of the HCA */ vapi_ret = EVAPI_get_hca_hndl(ib_res_p->hca_id, &ib_res_p->hca_hndl); if (vapi_ret != VAPI_OK) { printf("HCA not open\n"); VAPIERR(vapi_ret); goto clean_exit; } /* query the HCA */ vapi_ret = VAPI_query_hca_cap(ib_res_p->hca_hndl, &ib_res_p->hca_vendor, &ib_res_p->hca_cap); if (vapi_ret != VAPI_OK) { printf("Query HCA failed\n"); VAPIERR(vapi_ret); goto clean_exit; } PRINT_HCA_CAP(&ib_res_p->hca_vendor, &ib_res_p->hca_cap); /* allocate PD */ //vapi_ret = EVAPI_alloc_pd(ib_res_p->hca_hndl, MAX_NUM_AVS, &ib_res_p->pd_hndl); vapi_ret = VAPI_alloc_pd(ib_res_p->hca_hndl, &ib_res_p->pd_hndl); if (vapi_ret != VAPI_OK) { printf("Allocate PA failed\n"); VAPIERR(vapi_ret); goto clean_exit; } PRINT_TRACE("PD allocated: %ld\n", ib_res_p->pd_hndl); /* query Port */ vapi_ret = VAPI_query_hca_port_prop(ib_res_p->hca_hndl, DEFAULT_PORT_NUM, &ib_res_p->hca_port); if (vapi_ret != VAPI_OK) { printf("Query Port %d failed\n", DEFAULT_PORT_NUM); VAPIERR(vapi_ret); goto clean_exit; } PRINT_PORT_PROP(&ib_res_p->hca_port); /* send CQ */ vapi_ret = VAPI_create_cq(ib_res_p->hca_hndl, MIN_SEND_CQE_NUM, &ib_res_p->s_cq_hndl, &num_of_cqe); if (vapi_ret != VAPI_OK) { printf("Create CQ for send failed\n"); VAPIERR(vapi_ret); goto clean_exit; } PRINT_TRACE("CQ for send created. CQE NUM: %d\n", num_of_cqe); /* receive CQ */ vapi_ret = VAPI_create_cq(ib_res_p->hca_hndl, MIN_SEND_CQE_NUM, &ib_res_p->r_cq_hndl, &num_of_cqe); if (vapi_ret != VAPI_OK) { printf("Create CQ for send failed\n"); VAPIERR(vapi_ret); goto clean_exit; } PRINT_TRACE("CQ for receive created. CQE NUM: %d\n", num_of_cqe); /* QP */ qp_init_attr.rq_cq_hndl = ib_res_p->r_cq_hndl; qp_init_attr.sq_cq_hndl = ib_res_p->s_cq_hndl; qp_init_attr.cap.max_oust_wr_rq = QP_INI_MAX_OUST_WR_RQ_NUM; qp_init_attr.cap.max_oust_wr_sq = QP_INI_MAX_OUST_WR_SQ_NUM; qp_init_attr.cap.max_sg_size_rq = QP_INI_MAX_SG_SIZE_RQ_NUM; qp_init_attr.cap.max_sg_size_sq = QP_INI_MAX_SG_SIZE_SQ_NUM; qp_init_attr.pd_hndl = ib_res_p->pd_hndl; qp_init_attr.rdd_hndl = 0; qp_init_attr.sq_sig_type = VAPI_SIGNAL_REQ_WR; qp_init_attr.rq_sig_type = VAPI_SIGNAL_ALL_WR; qp_init_attr.ts_type = VAPI_TS_RC; vapi_ret = VAPI_create_qp_ext(ib_res_p->hca_hndl, &qp_init_attr, &qp_ext_attr, &ib_res_p->qp_entry.qp_hndl, &qp_prop); if (vapi_ret != VAPI_OK) { printf("Create QP failed\n"); VAPIERR(vapi_ret); goto clean_exit; } ib_res_p->qp_entry.qp_num = qp_prop.qp_num; ib_res_p->qp_entry.srq_hndl = ib_res_p->srq_hndl; PRINT_TRACE("QP created\n"); PRINT_QP_PROP(&qp_prop); return 0; clean_exit: clean_ib_res(ib_res_p); return -1; } /*****************************Modify QP state ***************************************/ int qp_move_to_init(test_params_t *param_p) { VAPI_qp_attr_mask_t qp_attr_mask; VAPI_qp_attr_t qp_attr; VAPI_qp_cap_t qp_cap; VAPI_ret_t res; QP_ATTR_MASK_CLR_ALL(qp_attr_mask); qp_attr.qp_state = VAPI_INIT; QP_ATTR_MASK_SET(qp_attr_mask, QP_ATTR_QP_STATE); qp_attr.pkey_ix = 0; QP_ATTR_MASK_SET(qp_attr_mask, QP_ATTR_PKEY_IX); qp_attr.port = DEFAULT_PORT_NUM; QP_ATTR_MASK_SET(qp_attr_mask,QP_ATTR_PORT); qp_attr.remote_atomic_flags = VAPI_EN_REM_WRITE | VAPI_EN_REM_READ | VAPI_EN_REM_ATOMIC_OP; QP_ATTR_MASK_SET(qp_attr_mask,QP_ATTR_REMOTE_ATOMIC_FLAGS); res = VAPI_modify_qp(param_p->ib_res.hca_hndl, param_p->ib_res.qp_entry.qp_hndl, &qp_attr, &qp_attr_mask, &qp_cap); if (res != VAPI_OK) { printf("Error: Modifying QP to INIT: %s\n", VAPI_strerror(res)); return -1; } PRINT_TRACE("Modified QP to INIT\n"); print_qp_cap(&qp_cap); return 0; } int qp_move_to_rtr(test_params_t *param_p) { VAPI_qp_attr_mask_t qp_attr_mask; VAPI_qp_attr_t qp_attr; VAPI_qp_cap_t qp_cap; VAPI_ret_t res; param_p->mtu = (param_p->ib_res.hca_vendor.vendor_part_id == 23108) ? MTU1024 : MTU2048; QP_ATTR_MASK_CLR_ALL(qp_attr_mask); qp_attr.qp_state = VAPI_RTR; QP_ATTR_MASK_SET(qp_attr_mask, QP_ATTR_QP_STATE); qp_attr.av.sl = 0; /*USED_SL*/ qp_attr.av.grh_flag = FALSE; qp_attr.av.dlid = param_p->dst_msg.lid; qp_attr.av.static_rate = 2; /* 1x */ qp_attr.av.src_path_bits = 0; QP_ATTR_MASK_SET(qp_attr_mask,QP_ATTR_AV); qp_attr.path_mtu = param_p->mtu; QP_ATTR_MASK_SET(qp_attr_mask,QP_ATTR_PATH_MTU); qp_attr.rq_psn = START_PSN; QP_ATTR_MASK_SET(qp_attr_mask,QP_ATTR_RQ_PSN); qp_attr.qp_ous_rd_atom = QP_OUS_RD_ATOM; QP_ATTR_MASK_SET(qp_attr_mask,QP_ATTR_QP_OUS_RD_ATOM); qp_attr.dest_qp_num = param_p->dst_msg.qp_num; QP_ATTR_MASK_SET(qp_attr_mask,QP_ATTR_DEST_QP_NUM); qp_attr.min_rnr_timer = 0; QP_ATTR_MASK_SET(qp_attr_mask,QP_ATTR_MIN_RNR_TIMER); res = VAPI_modify_qp(param_p->ib_res.hca_hndl, param_p->ib_res.qp_entry.qp_hndl, &qp_attr, &qp_attr_mask, &qp_cap); if (res != VAPI_OK) { printf("Error: Modifying QP to RTR: %s\n", VAPI_strerror(res)); return -1/*(RET_ERR)*/; } PRINT_TRACE("Modified QP to RTR\n"); print_qp_cap(&qp_cap); return 0; } int qp_move_to_rts(test_params_t *param_p) { VAPI_qp_attr_mask_t qp_attr_mask; VAPI_qp_attr_t qp_attr; VAPI_qp_cap_t qp_cap; VAPI_ret_t res; QP_ATTR_MASK_CLR_ALL(qp_attr_mask); qp_attr.qp_state = VAPI_RTS; QP_ATTR_MASK_SET(qp_attr_mask,QP_ATTR_QP_STATE); qp_attr.sq_psn = START_PSN; QP_ATTR_MASK_SET(qp_attr_mask,QP_ATTR_SQ_PSN); qp_attr.timeout = 18; QP_ATTR_MASK_SET(qp_attr_mask,QP_ATTR_TIMEOUT); qp_attr.retry_count = 6; QP_ATTR_MASK_SET(qp_attr_mask,QP_ATTR_RETRY_COUNT); qp_attr.rnr_retry = 6; QP_ATTR_MASK_SET(qp_attr_mask,QP_ATTR_RNR_RETRY); qp_attr.ous_dst_rd_atom = QP_OUS_RD_ATOM; QP_ATTR_MASK_SET(qp_attr_mask,QP_ATTR_OUS_DST_RD_ATOM); res = VAPI_modify_qp(param_p->ib_res.hca_hndl, param_p->ib_res.qp_entry.qp_hndl, &qp_attr, &qp_attr_mask, &qp_cap); if (res != VAPI_OK) { printf("Error: Modifying QP to RTS: %s\n", VAPI_strerror(res)); return /*(RET_ERR)*/-1; } PRINT_TRACE("Modified QP to RTS\n"); print_qp_cap(&qp_cap); return 0; } /************************** Recv/Send requests ******************************/ /* * post receive request */ int post_recv_req(struct ib_resource *ib_res_p, struct user_mr *u_mr_p) { VAPI_ret_t res; VAPI_rr_desc_t rr; VAPI_sg_lst_entry_t sg_entry_r; VAPI_hca_hndl_t hca_hndl; VAPI_qp_hndl_t qp_hndl; VAPI_srq_hndl_t srq_hndl; if (ib_res_p == NULL) { PRINT_ERR("NULL ib_res_p\n"); return -1; } if (u_mr_p == NULL) { PRINT_ERR("NULL user mr pointer\n"); return -1; } hca_hndl = ib_res_p->hca_hndl; qp_hndl = ib_res_p->qp_entry.qp_hndl; hca_hndl = ib_res_p->srq_hndl; rr.opcode = VAPI_RECEIVE; rr.comp_type = VAPI_SIGNALED; rr.sg_lst_len = 1; sg_entry_r.lkey = u_mr_p->mrw_rep.l_key; sg_entry_r.len = u_mr_p->mrw_req.size; sg_entry_r.addr = (VAPI_virt_addr_t)(MT_virt_addr_t)u_mr_p->user_buf; rr.sg_lst_p = &sg_entry_r; rr.id = sg_entry_r.addr; PRINT_RECV_REQ(&rr); res = VAPI_post_rr(hca_hndl, qp_hndl, &rr); if (res != VAPI_OK) { printf("VAPI post Recv Req failed\n"); VAPIERR(res); return -1; } return 0; } /* * post send request */ int post_send_req(struct ib_resource *ib_res_p, struct user_mr *u_mr_p) { VAPI_ret_t res; VAPI_sr_desc_t sr; VAPI_sg_lst_entry_t sg_entry_s; VAPI_hca_hndl_t hca_hndl; VAPI_qp_hndl_t qp_hndl; if (ib_res_p == NULL) { PRINT_ERR("NULL ib_res_p\n"); return -1; } if (u_mr_p == NULL) { PRINT_ERR("NULL user mr pointer\n"); return -1; } hca_hndl = ib_res_p->hca_hndl; qp_hndl = ib_res_p->qp_entry.qp_hndl; sr.comp_type = VAPI_SIGNALED; sr.set_se = FALSE; sr.opcode = VAPI_SEND; sr.remote_qkey = 0; sr.sg_lst_len = 1; sg_entry_s.lkey = u_mr_p->mrw_rep.l_key; sg_entry_s.len = u_mr_p->mrw_req.size; sg_entry_s.addr = (VAPI_virt_addr_t)(MT_virt_addr_t)u_mr_p->user_buf; sr.sg_lst_p = &sg_entry_s; sr.id = sg_entry_s.addr; PRINT_SEND_REQ(&sr); res = VAPI_post_sr(hca_hndl, qp_hndl, &sr); if (res != VAPI_OK) { printf("VAPI post Send Req failed\n"); VAPIERR(res); return -1; } return 0; } int reap_send_req(struct ib_resource *ib_res_p, struct user_mr *u_mr_p, int block) { VAPI_ret_t res; VAPI_wc_desc_t wc_desc; VAPI_hca_hndl_t hca_hndl; VAPI_cq_hndl_t s_cq_hndl; int poll_cnt = 0; if (ib_res_p == NULL) { PRINT_ERR("NULL ib_res_p\n"); return -1; } if (u_mr_p == NULL) { PRINT_ERR("NULL user mr pointer\n"); return -1; } hca_hndl = ib_res_p->hca_hndl; s_cq_hndl = ib_res_p->s_cq_hndl; if (block) { do { poll_cnt++; MTPERF_TIME_START(VAPI_poll_cq); res = VAPI_poll_cq(hca_hndl, s_cq_hndl, &wc_desc); //res = EVAPI_poll_cq_block(hca_hndl, s_cq_hndl, REAP_REQ_WAIT_TIME, &wc_desc); MTPERF_TIME_END(VAPI_poll_cq); if (res != VAPI_OK && res != VAPI_CQ_EMPTY) { PRINT_ERR("Poll CQ block failed\n"); VAPIERR(res); return -1; } show_qp_state(hca_hndl, ib_res_p->qp_entry.qp_hndl, ib_res_p->qp_entry.qp_num); VAPI_RET(res); } while(res == VAPI_CQ_EMPTY && poll_cnt < 10); if (wc_desc.status != VAPI_SUCCESS) { PRINT_ERR("Req unsuccess: %s\n", VAPI_wc_status_sym(wc_desc.status)); PRINT_WC_DESC(&wc_desc); return -1; } } else { printf("ToDo: %s for unblock\n", __func__); } PRINT_TRACE("Req success\n"); PRINT_WC_DESC(&wc_desc); return 0; } int reap_recv_req(struct ib_resource *ib_res_p, struct user_mr *u_mr_p, int block) { VAPI_ret_t res; VAPI_wc_desc_t wc_desc; VAPI_hca_hndl_t hca_hndl; VAPI_cq_hndl_t r_cq_hndl; int poll_cnt = 0; if (ib_res_p == NULL) { PRINT_ERR("NULL ib_res_p\n"); return -1; } if (u_mr_p == NULL) { PRINT_ERR("NULL user mr pointer\n"); return -1; } hca_hndl = ib_res_p->hca_hndl; r_cq_hndl = ib_res_p->r_cq_hndl; if (block) { do { poll_cnt++; res = VAPI_poll_cq(hca_hndl, r_cq_hndl,&wc_desc); if (res != VAPI_OK && res != VAPI_CQ_EMPTY) { PRINT_ERR("Poll CQ block failed\n"); VAPIERR(res); return -1; } sleep(1); } while(res == VAPI_CQ_EMPTY && poll_cnt < 20); if (wc_desc.status != VAPI_SUCCESS) { PRINT_ERR("Req failed: %s\n", VAPI_wc_status_sym(wc_desc.status)); PRINT_WC_DESC(&wc_desc); return -1; } } else { printf("ToDo: %s for unblock\n", __func__); } PRINT_TRACE("Req success\n"); PRINT_WC_DESC(&wc_desc); return 0; } -- Ian Jiang ianjiang.ict at gmail.com Laboratory of Spatial Information Technology Division of System Architecture Institute of Computing Technology Chinese Academy of Sciences -------------- next part -------------- An HTML attachment was scrubbed... URL: From ianjiang.ict at gmail.com Thu Feb 16 23:30:15 2006 From: ianjiang.ict at gmail.com (Ian Jiang) Date: Fri, 17 Feb 2006 15:30:15 +0800 Subject: [openib-general] IBG2 installation In-Reply-To: <7b2fa1820602162328t5250bdcalb204d24690f646ae@mail.gmail.com> References: <7b2fa1820602162328t5250bdcalb204d24690f646ae@mail.gmail.com> Message-ID: <7b2fa1820602162330s42606df9oe2c58d603a425cf0@mail.gmail.com> Hi Karun, I have not installed IBG2, but I could find ib_defs.h in /usr/local/ibgd/driver/infinihost/include. /usr/local/ibgd is where I installed IBGD-1.8.0. -- Ian Jiang ianjiang.ict at gmail.com Laboratory of Spatial Information Technology Division of System Architecture Institute of Computing Technology Chinese Academy of Sciences -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Fri Feb 17 02:02:22 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 17 Feb 2006 12:02:22 +0200 Subject: [openib-general] Re: Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: References: <20060216194821.GB20997@mellanox.co.il> Message-ID: <20060217100222.GB19033@mellanox.co.il> Quoting r. Roland Dreier : > libmthca already uses posix_memalign to make sure that CQ and QP > buffers are page-aligned. Does this guarantee that nothing else the child might need falls into the same page? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From openib-general at openib.org Fri Feb 17 03:52:51 2006 From: openib-general at openib.org (openib-general at openib.org) Date: Fri, 17 Feb 2006 03:52:51 -0800 (PST) Subject: [openib-general] PRIZE NOTIFICATION Message-ID: <20060217115251.E3BB52283EB@openib.ca.sandia.gov> From: The director of the Prize Award Department Reference number: EG/38807886091/05 Batch number: 340/1608/RDL Re: Award Notification of Final Notice We are pleased to inform you of the result of the CARDIFF LOTTERY ORGANIZATION British sweepstakes lottery International promotion UK programmed held on the 17TH FEBRUARY 2006 Your email address attached to the ticket number 033-1146993-750 with serial number 13-15-16-21-34-36, which consequently won the lottery in the 3rd category. You have therefore been awarded the lump sum of �1.5MILLION (ONE MILLION FIVE HUNDRED THOUSAND BRITISH POUNDS STERLING) in cash credited to file number EG/38807886091/05.This is from the total cash prize off �15,000,000.00(FIFTEEN MILLION BRITISH POUNDS STERLING) which is being shared among Ten international lucky winners in this category. Your funds are deposited with a security company, which will be insured in your name once you contact us. All participants were selected through a computer ballot system drawn from 25,000 email addresses from all over the world as a part of our international promotional program, which we conduct twice annually. We hope that with a part of your prize, you will take part in our end of year high stake 3bn lottery. All prize money must be claimed no later than 14days from the date of this notice, as after this date, all funds will be returned to CARDIFF LOTTERY ORGANIZATION as unclaimed. To file for your claim, please contact our financial agent: MR. JAMES COLE CLAIMS MANAGER. MAIL:cardifflottery at london.com ALTERNATIVE EMAIL:cardifforg at hotmail.com Tel: 44-7031-946-936. FAX: 0044-70304-00042 International: 44-7031-946-936. For Further Assistant please call your international Directory in your country (CARDIFF LOTTERY ORGANIZATION) From RAISCH at de.ibm.com Fri Feb 17 03:59:01 2006 From: RAISCH at de.ibm.com (Christoph Raisch) Date: Fri, 17 Feb 2006 12:59:01 +0100 Subject: [openib-general] Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: <000101c63347$c25ff140$b5a0070a@amr.corp.intel.com> Message-ID: we also currently prefer svn for the 1.0 release. At some point we'll have to backport bugfixes to the 1.0 release found in a later development version. Having to do that at all isn't really fun, but having to keep some of that code in "sort of" sync between different repositories is even more difficult. Gruss / Regards . . . Christoph R. "Bob Woodruff" To Sent by: "'Roland Dreier'" openib-general-bo unces at openib.org cc openib-general at openib.org Subject 16.02.2006 23:24 RE: [openib-general] Plans for libibverbs 1.0, 1.1 and beyond Roland wrote, >Can you be more explicit about the pain? What does it make worse? >- R. Today when I want to get the tip, I simply do a SVN update of my tree and everything that has changed gets updated. I can also subscribe to the commits email list to know if something changes. If some components are now in a git tree, I would need to first install and learn git, then pull some components from git from kernel.org, some components from SVN and hope they work together. And if some code gets moved to another site, like kernel.org, is that development really still covered by the openib licensing and promoters agreements ? woody _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Fri Feb 17 04:01:40 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Feb 2006 07:01:40 -0500 Subject: [openib-general] OpenSM realloc error In-Reply-To: <1140152519.9680.24.camel@beast.terraplex.com> References: <1139945214.2169.4.camel@beast.terraplex.com> <1139963140.4333.16065.camel@hal.voltaire.com> <1139964683.5941.10.camel@beast.terraplex.com> <1140004573.4333.19957.camel@hal.voltaire.com> <1140021789.21679.2.camel@beast.terraplex.com> <1140023843.4333.21715.camel@hal.voltaire.com> <1140025284.22080.2.camel@beast.terraplex.com> <20060215184711.GE12172@sashak.voltaire.com> <1140028908.4333.22226.camel@hal.voltaire.com> <1140125269.8783.13.camel@beast.terraplex.com> <1140131863.4333.34690.camel@hal.voltaire.com> <1140140596.9680.13.camel@beast.terraplex.com> <1140151518.4333.37979.camel@hal.voltaire.com> <1140152519.9680.24.camel@beast.terraplex.com> Message-ID: <1140177696.4333.42940.camel@hal.voltaire.com> Hi Owen, On Fri, 2006-02-17 at 00:01, Owen Stampflee wrote: > Of course, I need to get things working first, than we can deal with the > 64-bit issues (gotta please the boss, and if shipping 32-bit binarys and > both 32/64 bit libraries provides a working udapl, ipoib, and 32+64-bit > mpi, I can meet my deadline (Monday)). I'm suspecting some glibc issues > on our end, but I've never seen these before and since we're using a > RHEL-based toolchain, this _should_ just work. > > Any thoughts on ipoib? Are you referring to the ib: link is not ready messages ? How are the IPoIB interfaces being configured ? Are they by the network scripts ? Does it use arping (to look for duplicates) ? Is DHCP enabled or is a static address assigned ? Can you try statically configuring an IPoIB subnet first ? [You might also want to start another thread on this issue as some who could help may not read all the way down to this after they see the subject line.] -- Hal > My research hasnt shown that problem before. I'll > see if I can get mvapich built tomorrow and see if that at least works. > > Thanks for all the assistance, > Owen > > On Thu, 2006-02-16 at 23:45 -0500, Hal Rosenstock wrote: > > On Thu, 2006-02-16 at 20:43, Owen Stampflee wrote: > > > A 32-bit build of 5411 gets the link to become active > > > > Glad to hear this.That is what I would expect and would like to confirm > > the tid patch is missing from the FC5 package as well as getting to the > > bottom of the 64 bit issues if you have some time to help on this. > > > > -- Hal > > > > > and ipv_rc_pingpng works, but I cant bring up ipoib... > > > > > > dmesg says this (tried both ib0 and ib1 to ensure ports werent swapped) > > > ADDRCONF(NETDEV_UP): ib0: link is not ready > > > ADDRCONF(NETDEV_UP): ib1: link is not ready > > > > > > At least we're making progress. > > > > > > Thanks, > > > Owen > > > > > > On Thu, 2006-02-16 at 18:18 -0500, Hal Rosenstock wrote: > > > > Hi Owen, > > > > > > > > On Thu, 2006-02-16 at 16:27, Owen Stampflee wrote: > > > > > So, here is the back trace with no code modifications... > > > > > > > > > > 0x00000080b9719db0 in .__GI_raise () from /lib64/tls/libc.so.6 > > > > > (gdb) bt > > > > > #0 0x00000080b9719db0 in .__GI_raise () from /lib64/tls/libc.so.6 > > > > > #1 0x00000080b971b89c in .__GI_abort () from /lib64/tls/libc.so.6 > > > > > #2 0x00000080b974e860 in .__libc_message () from /lib64/tls/libc.so.6 > > > > > #3 0x00000080b97580bc in ._int_realloc () from /lib64/tls/libc.so.6 > > > > > #4 0x00000080b9759528 in .__realloc () from /lib64/tls/libc.so.6 > > > > > #5 0x00000080b975942c in .__realloc () from /lib64/tls/libc.so.6 > > > > > #6 0x00000080b974cd30 in ._IO_mem_finish () from /lib64/tls/libc.so.6 > > > > > #7 0x00000080b97426b8 in ._IO_new_fclose () from /lib64/tls/libc.so.6 > > > > > #8 0x00000080b97b795c in .__GI_vsyslog () from /lib64/tls/libc.so.6 > > > > > #9 0x00000080b97b7ddc in .__GI_syslog () from /lib64/tls/libc.so.6 > > > > > #10 0x00000080a362be90 in .cl_log_event () > > > > > from /usr/lib64/libosmcomp.so.1 > > > > > #11 0x00000080a35f5700 in .osm_log () from /usr/lib64/libopensm.so.1 > > > > > #12 0x000000001001316c in ?? () > > > > > #13 0x00000000100059b4 in ?? () > > > > > #14 0x00000080b970411c in .generic_start_main () > > > > > from /lib64/tls/libc.so.6 > > > > > #15 0x00000080b97042a4 in .__libc_start_main () > > > > > from /lib64/tls/libc.so.6 > > > > > #16 0x0000000000000000 in ?? () > > > > > (gdb) > > > > > > > > > > Commenting out the cl_log_event in osm_log results in this backtrace: > > > > > > > > > > (gdb) bt > > > > > #0 0x00000080b9719db0 in .__GI_raise () from /lib64/tls/libc.so.6 > > > > > #1 0x00000080b971b89c in .__GI_abort () from /lib64/tls/libc.so.6 > > > > > #2 0x00000080b974e860 in .__libc_message () from /lib64/tls/libc.so.6 > > > > > #3 0x00000080b9756db0 in ._int_malloc () from /lib64/tls/libc.so.6 > > > > > #4 0x00000080b9758b50 in .__GI___libc_malloc () > > > > > from /lib64/tls/libc.so.6 > > > > > #5 0x00000400000607bc in __cl_malloc_priv (size=0) at > > > > > cl_memory_osd.c:62 > > > > > #6 0x00000400000604d4 in __cl_zalloc_ntrk (size=0) at cl_memory.c:416 > > > > > #7 0x00000400000629f4 in cl_ptr_vector_set_capacity > > > > > (p_vector=0x100788d0, > > > > > new_capacity=6349) at cl_ptr_vector.c:216 > > > > > #8 0x0000040000062acc in cl_ptr_vector_set_size (p_vector=0x0, size=16) > > > > > at cl_ptr_vector.c:270 > > > > > #9 0x0000040000062c08 in cl_ptr_vector_init (p_vector=0x100788d0, > > > > > min_size=6349, > > > > > grow_size=16) at cl_ptr_vector.c:93 > > > > > #10 0x000004000005bb00 in cl_disp_init (p_disp=0x100788a0, > > > > > thread_count=0, > > > > > name=0x100464c0 "opensm") at cl_dispatcher.c:214 > > > > > #11 0x00000000100133f8 in ?? () > > > > > #12 0x00000000100059b4 in ?? () > > > > > #13 0x00000080b970411c in .generic_start_main () > > > > > from /lib64/tls/libc.so.6 > > > > > #14 0x00000080b97042a4 in .__libc_start_main () > > > > > from /lib64/tls/libc.so.6 > > > > > #15 0x0000000000000000 in ?? () > > > > > > > > __cl_malloc_priv is just a wrapper for malloc: > > > > > > > > from cl_memory_osd.c: > > > > void* > > > > __cl_malloc_priv( > > > > IN const size_t size ) > > > > { > > > > return malloc( size ); > > > > } > > > > > > > > If I believe gdb this appears to be a malloc of 0 bytes but since the > > > > new_capacity was 6349 (and this would be multiplied by sizeof(void *)), > > > > I'm not sure whether to trust this. > > > > > > > > Can you send me the compile line from the OpenSM build ? Are the include > > > > paths correct for 64 bit headers ? > > > > > > > > > So now I've compiled it in 32-bit mode (had to fix my chroot) and > > > > > everything runs, but I get the following message... > > > > > > > > > > Feb 16 13:59:28 006732 [0000] -> OpenSM Rev:openib-1.1.0 > > > > > > > > > > Feb 16 13:59:28 008210 [F7E8D020] -> osm_report_notice: Reporting > > > > > Generic Notice type:3 num:66 from LID:0x0000 > > > > > GID:0xfe80000000000000,0x0000000000000000 > > > > > Feb 16 13:59:28 008292 [F7E8D020] -> osm_report_notice: Reporting > > > > > Generic Notice type:3 num:66 from LID:0x0000 > > > > > GID:0xfe80000000000000,0x0000000000000000 > > > > > Feb 16 13:59:28 015894 [F7E8D020] -> osm_vendor_get_all_port_attr: > > > > > assign CA mthca0 port 1 guid (0x2c90109764831) as the default port > > > > > Feb 16 13:59:28 015977 [F7E8D020] -> osm_vendor_bind: Binding to port > > > > > 0x2c90109764831. > > > > > Feb 16 13:59:28 021293 [F7E8D020] -> osm_vendor_bind: Binding to port > > > > > 0x2c90109764831. > > > > > Feb 16 13:59:28 021692 [F568C4E0] -> umad_receiver: ERR 5413: Failed to > > > > > obtain request madw for received MAD(method=0x81 attr=0x11) -- dropping > > > > > > > > For some reason, on the response received, it is not finding the match > > > > in the transaction table. I thought this was fixed a while ago for > > > > PowerPC. Can you run opensm with -V and see if there is any more output > > > > that might be helpful ? > > > > > > > > > Other info: > > > > > [root at m2 ~]# ibstat > > > > > CA 'mthca0' > > > > > CA type: MT23108 > > > > > Number of ports: 2 > > > > > Firmware version: 3.3.2 > > > > > Hardware version: a1 > > > > > Node GUID: 0x0002c90109764830 > > > > > System image GUID: 0x0002c90109764833 > > > > > Port 1: > > > > > State: Initializing > > > > > Physical state: LinkUp > > > > > Rate: 10 > > > > > Base lid: 0 > > > > > LMC: 0 > > > > > SM lid: 0 > > > > > Capability mask: 0x00510a68 > > > > > Port GUID: 0x0002c90109764831 > > > > > Port 2: > > > > > State: Down > > > > > Physical state: Polling > > > > > Rate: 2 > > > > > Base lid: 0 > > > > > LMC: 0 > > > > > SM lid: 0 > > > > > Capability mask: 0x00510a68 > > > > > Port GUID: 0x0002c90109764832 > > > > > > > > > > > > > > > [root at m2 ~]# ibstatus > > > > > Infiniband device 'mthca0' port 1 status: > > > > > default gid: fe80:0000:0000:0000:0002:c901:0976:4831 > > > > > base lid: 0x0 > > > > > sm lid: 0x0 > > > > > state: 2: INIT > > > > > phys state: 5: LinkUp > > > > > rate: 10 Gb/sec (4X) > > > > > > > > This is goodness and means the physical link has been established on > > > > this port. > > > > > > > > > Infiniband device 'mthca0' port 2 status: > > > > > default gid: fe80:0000:0000:0000:0002:c901:0976:4832 > > > > > base lid: 0x0 > > > > > sm lid: 0x0 > > > > > state: 1: DOWN > > > > > phys state: 2: Polling > > > > > rate: 2.5 Gb/sec (1X) > > > > > > > > > > > > > > > My archives suggest a firmware upgrade, but 3.3.3 isnt available from > > > > > SBS as far as I can tell and my contact no longer works there so I'm > > > > > going to have to find the new person to talk about getting newer > > > > > firmware, unless of course another vendors firmware will work on this > > > > > card. > > > > > > > > I think 3.3.2 should be OK. In any case, I doubt it's the source of the > > > > problem above. > > > > > > > > -- Hal > > > > > > > > > Cheers, > > > > > Owen > > > > > > > > > > > > > > > > > > > > > > > > > > !DSPAM:43f5572d122323871347016! > From smithja at alloclub.com Fri Feb 17 04:33:57 2006 From: smithja at alloclub.com (Austin Cox) Date: Fri, 17 Feb 2006 15:33:57 +0300 Subject: [openib-general] Hey bro, check out the huge sale these guys are offering Message-ID: <000001c633e8$a167c700$0100007f@localhost> Finally the real thing- no more ripoffs! Enhancment Patches are hot right now, VERY hot! Unfortunately, most are cheap imitiations and do very little to increase your size and stamina. Well this is the real thing, not an imitation! One of the very originals, the absolutely strongest Patch available, anywhere! A top team of British scientists and medical doctors have worked to develop the state-of-the-art Pen1s Enlargment Patch delivery system which automatically increases pen1s size up to 3-4 full inches. The patches are the easiest and most effective way to increase your size. You won't have to take pills, get under the knife to perform expensive and very painful surgery, use any pumps or other devices. No one will ever find out that you are using our product. Just apply one patch on your body and wear it for 3 days and you will start noticing dramatic results. Millions of men are taking advantage of this revolutionary new product - Don't be left behind! As an added incentive, they are offering huge discount specials right now, check out the site to see for yourself! Here's the link to check out! http://www.all-love-pillzz.net/pt/?46&uctdm -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Fri Feb 17 04:43:12 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 17 Feb 2006 14:43:12 +0200 Subject: [openib-general] Re: IBG2 installation In-Reply-To: <43F574B5.903@gs-lab.com> References: <43F574B5.903@gs-lab.com> Message-ID: <20060217124312.GC19033@mellanox.co.il> Quoting r. Karun Beer Sharma : > Subject: IBG2 installation > > Hi: > > I have installed IBG2 (ver. 2.0.1 from Mellanox) with 2.6.9-22EL kernel > version. > The installation seems OK and I am able to execute some of the commands > like ibnetdiscover etc. > > Then I downloaded Netpipe (ver 3.6.2) and tried to make (make ib) it on > my machine. I am getting errors of missing header files (ib_defs.h). > I checked the makefile and observed that VAPI_INC path is required. I > searhed /usr/include but was not able to find the required header files. > > Please let me know if I need to install something else also. > > Thanks. > Regards, > Karun Looks like its trying to use gen1 headers (VAPI is gen1). I dont know much about netpipe - does it support gen2? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From info at fxji.com Fri Feb 17 04:53:31 2006 From: info at fxji.com (info at fxji.com) Date: 17 Feb 2006 21:53:31 +0900 Subject: [openib-general] $B8BDj#5#0L>$N2q0wMM$KAw$i$; $FD:$$$F$*$j$^$9!#(B Message-ID: <20060217125331.8235.qmail@mail.fxji.com> $B$*5RMM$K$h$j$h$$=P2q$$$rDs6!$G$-$k$h$&!"CO0h!"@-JL!"Aw?.e$N%a!<%k%"%$%3%s$r%/%j%C%/$7$F$/$@$5$$!#D>@\=w at -$NJ}$K%a%C%;!<%8$rAw$k$3$H$,$G$-$^$9!#$*5R$5$^$K$H$C$FAGE($J=P2q$$$H$J$j$^$9$h$&$K!#(B $B0J2AwJ8$H$J$j$^$9!#(B $B!2!2!2!2(B $B%]%$;R!j!J(B25$B!K%"%Q%l%k$N2q $BA02sBg at 967$K=*$($?Mp8r%Q!<%F%#!!!:#2s$NCK at -Jg=8?M?t$O#1#6L>!!(B $B;22CHqEy$O!"$*;P$5$^J};}$A$H$J$j$^$9$N$G(B $BCK at -$NJ}$+$i0l at ZHqMQ$OD:$-$^$;$s!#C"$7!"$=$NJ,=w at -$KJt;E$7$FD:$-$^$9!#(B $BA02s$N%Q!<%F%#!<\:Y!";22C4uK>$NJ}$O$3$A$i$+$i$*F~$j$/$@$5$$!#(B http://www.gyakuten6.net/?gy04 $B%a!<%k References: <6AB138A2AB8C8E4A98B9C0C3D52670E3FEDFE1@mtlexch01.mtl.com> Message-ID: <1140189358.4333.44536.camel@hal.voltaire.com> On Thu, 2006-02-16 at 14:53, Dror Goldenberg wrote: > > From: openib-general-bounces at openib.org > > [mailto:openib-general-bounces at openib.org] On Behalf Of Hal Rosenstock > > Sent: Thursday, February 16, 2006 1:13 PM > > > > On Thu, 2006-02-16 at 02:54, Michael S. Tsirkin wrote: > > > Quoting r. Hal Rosenstock : > > > > Subject: Re: Re: [PATCH] change Mellanox SDP workaround to a > > > > moduleparameter > > > > > > > > On Wed, 2006-02-15 at 19:03, Roland Dreier wrote: > > > > > > > > > I guess the question is what to do when a Tavor (with the > > > > > performance bug that makes a 1K MTU faster) connects to someone > > > > > else. > > > > > > > > Isn't it the other way 'round (when something with a larger MTU > > > > connects to Tavor) ? > > > > > > Right. I wish we had an MTU field in the REP packet, but we dont. > > > > Yes, that would be better IMO too. Not sure why it wasn't > > done that way. Guess you could file an erratum on this. > > > > -- Hal > > The SWG defined a generic mechanism which uses REJ to indicate that > the passive side does not accept a certain REQ fields, and allows the > passive > side to indicate an alternative value. Indirection is also supported > through the > same protocol. It also allows the active side, following the REJ, to use > an > alternate value, other than the one suggested by the passive side, i.e. > passive > side only has a veto capability. This is the mechanism and the short > theory > behind it. Unfortunately it's a bit inefficient in terms of performance > because of > the ping pong of messages. Solving just the MTU might not be a good > enough > argument. The approach should be to enable the active side to specify a > set > of acceptable parameters for each one of the REQ fields, and then let > the passive > side to choose. This may change the CM packets all over and will > introduce new > problems. I don't think that there's a good chance of just adding a > solution for > just one of the fields. Anyway, you can still try and propose this to > IBTA, I tried it > once already :) Thanks for the historical perspective. It's harder to overturn an existing vote on something at the IBTA. Not sure I have the time to take up this (larger) mission. -- Hal From rdreier at cisco.com Fri Feb 17 08:14:31 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 17 Feb 2006 08:14:31 -0800 Subject: [openib-general] Re: Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: <20060217100222.GB19033@mellanox.co.il> (Michael S. Tsirkin's message of "Fri, 17 Feb 2006 12:02:22 +0200") References: <20060216194821.GB20997@mellanox.co.il> <20060217100222.GB19033@mellanox.co.il> Message-ID: Michael> Does this guarantee that nothing else the child might Michael> need falls into the same page? Yes, for example CQ buffers are allocated with: if (posix_memalign(&buf, dev->page_size, align(nent * MTHCA_CQ_ENTRY_SIZE, dev->page_size))) dev->page_size is initialized with: dev->page_size = sysconf(_SC_PAGESIZE); so the allocation gives a buffer aligned to the page size, with a size that is a multiple of the page size. - R. From rdreier at cisco.com Fri Feb 17 08:15:49 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 17 Feb 2006 08:15:49 -0800 Subject: [openib-general] Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: (Christoph Raisch's message of "Fri, 17 Feb 2006 12:59:01 +0100") References: Message-ID: Christoph> we also currently prefer svn for the 1.0 release. At Christoph> some point we'll have to backport bugfixes to the 1.0 Christoph> release found in a later development version. Having Christoph> to do that at all isn't really fun, but having to keep Christoph> some of that code in "sort of" sync between different Christoph> repositories is even more difficult. There seems to be some confusion here. There would only ever be one libibverbs repository. The only question is whether it remains in svn or moves to a different SCM. In fact moving to git would make porting patches between different branches far easier. - R. From bos at pathscale.com Fri Feb 17 08:43:02 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 17 Feb 2006 08:43:02 -0800 Subject: [openib-general] Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: References: Message-ID: <1140194582.30456.22.camel@serpentine.pathscale.com> On Fri, 2006-02-17 at 08:15 -0800, Roland Dreier wrote: > There seems to be some confusion here. There would only ever be one > libibverbs repository. The only question is whether it remains in svn > or moves to a different SCM. Since the SVN tree has some inertia among developers already, someone will need to drop occasional patches into it until it gets replaced, or forever, whichever comes first. > In fact moving to git would make porting patches between different > branches far easier. If you move to git, I'll provide a Mercurial mirror of the git repo. OpenSM/libvendor: In osm_vendor_ibumad.c, fix some size_t issues related to memory allocation Signed-off-by: Hal Rosenstock Index: libibumad/include/infiniband/umad.h =================================================================== --- libibumad/include/infiniband/umad.h (revision 5436) +++ libibumad/include/infiniband/umad.h (working copy) @@ -160,7 +160,7 @@ int umad_open_port(char *ca_name, int po int umad_close_port(int portid); void * umad_get_mad(void *umad); -int umad_size(void); +size_t umad_size(void); int umad_status(void *umad); ib_mad_addr_t *umad_get_mad_addr(void *umad); @@ -189,7 +189,7 @@ void umad_dump(void *umad); #include static inline void * -umad_alloc(int num, int size) /* alloc array of umad buffers */ +umad_alloc(int num, size_t size) /* alloc array of umad buffers */ { return calloc(num, size); } Index: libibumad/src/umad.c =================================================================== --- libibumad/src/umad.c (revision 5436) +++ libibumad/src/umad.c (working copy) @@ -672,7 +672,7 @@ umad_get_mad(void *umad) return ((struct ib_user_mad *)umad)->data; } -int +size_t umad_size(void) { return sizeof (struct ib_user_mad); From halr at voltaire.com Fri Feb 17 09:27:37 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Feb 2006 12:27:37 -0500 Subject: [openib-general] [PATCH] OpenSM/st.c: Fix some size_t issues related to memory allocation in st.c Message-ID: <1140197256.4476.60.camel@hal.voltaire.com> OpenSM/st.c: Fix some size_t issues related to memory allocation in st.c Signed-off-by: Hal Rosenstock Index: include/opensm/st.h =================================================================== --- include/opensm/st.h (revision 5436) +++ include/opensm/st.h (working copy) @@ -40,6 +40,8 @@ #ifndef ST_INCLUDED #define ST_INCLUDED +#include + #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { # define END_C_DECLS } @@ -79,11 +81,11 @@ struct st_table { enum st_retval {ST_CONTINUE, ST_STOP, ST_DELETE}; st_table *st_init_table(struct st_hash_type *); -st_table *st_init_table_with_size(struct st_hash_type *, int); +st_table *st_init_table_with_size(struct st_hash_type *, size_t); st_table *st_init_numtable(void); -st_table *st_init_numtable_with_size(int); +st_table *st_init_numtable_with_size(size_t); st_table *st_init_strtable(void); -st_table *st_init_strtable_with_size(int); +st_table *st_init_strtable_with_size(size_t); int st_delete(st_table *, st_data_t *, st_data_t *); int st_delete_safe(st_table *, st_data_t *, st_data_t *, st_data_t); int st_insert(st_table *, st_data_t, st_data_t); Index: opensm/st.c =================================================================== --- opensm/st.c (revision 5436) +++ opensm/st.c (working copy) @@ -42,7 +42,6 @@ #endif /* HAVE_CONFIG_H */ #include -#include #include #include @@ -102,17 +101,11 @@ static struct st_hash_type type_strhash #define xcalloc calloc #define xrealloc realloc #define xfree free -#if 0 -void *xmalloc(long); -void *xcalloc(long, long); -void *xrealloc(void *, long); -void xfree(void *); -#endif static void rehash(st_table *); -#define alloc(type) (type*)xmalloc((unsigned)sizeof(type)) -#define Calloc(n,s) (char*)xcalloc((n),(s)) +#define alloc(type) (type*)xmalloc(sizeof(type)) +#define Calloc(n,s) (char*)xcalloc((n), (s)) #define EQUAL(table,x,y) ((x)==(y) || (*table->type->compare)(((void*)x),((void *)y)) == 0) @@ -200,7 +193,7 @@ stat_col() st_table* st_init_table_with_size(type, size) struct st_hash_type *type; - int size; + size_t size; { st_table *tbl; @@ -238,7 +231,7 @@ st_init_numtable(void) st_table* st_init_numtable_with_size(size) - int size; + size_t size; { return st_init_table_with_size(&type_numhash, size); } @@ -251,7 +244,7 @@ st_init_strtable(void) st_table* st_init_strtable_with_size(size) - int size; + size_t size; { return st_init_table_with_size(&type_strhash, size); } @@ -314,7 +307,8 @@ st_lookup(table, key, value) return 0; } else { - if (value != 0) *value = ptr->record; + if (value != 0) + *value = ptr->record; return 1; } } @@ -407,7 +401,8 @@ st_copy(old_table) { st_table *new_table; st_table_entry *ptr, *entry; - int i, num_bins = old_table->num_bins; + int i; + size_t num_bins = old_table->num_bins; new_table = alloc(st_table); if (new_table == 0) @@ -417,7 +412,7 @@ st_copy(old_table) *new_table = *old_table; new_table->bins = (st_table_entry**) - Calloc((unsigned)num_bins, sizeof(st_table_entry*)); + Calloc(num_bins, sizeof(st_table_entry*)); if (new_table->bins == 0) { @@ -524,7 +519,7 @@ st_delete_safe(table, key, value, never) } static int -delete_never( st_data_t key, st_data_t value, st_data_t never) +delete_never(st_data_t key, st_data_t value, st_data_t never) { if (value == never) return ST_DELETE; return ST_CONTINUE; From rolandd at cisco.com Fri Feb 17 16:57:04 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 17 Feb 2006 16:57:04 -0800 Subject: [openib-general] [PATCH 01/22] Add powerpc-specific clear_cacheline(), which just compiles to "dcbz". In-Reply-To: <20060218005532.13620.79663.stgit@localhost.localdomain> References: <20060218005532.13620.79663.stgit@localhost.localdomain> Message-ID: <20060218005704.13620.88286.stgit@localhost.localdomain> From: Roland Dreier This is horribly non-portable. How much of a performance difference does it make? How does it do on ppc64 systems where the cacheline size is not 32? --- drivers/infiniband/hw/ehca/ehca_asm.h | 58 +++++++++++++++++++++++++++++++++ 1 files changed, 58 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_asm.h b/drivers/infiniband/hw/ehca/ehca_asm.h new file mode 100644 index 0000000..6a09ac5 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_asm.h @@ -0,0 +1,58 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * Some helper macros with assembler instructions + * + * Authors: Khadija Souissi + * Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_asm.h,v 1.7 2006/02/06 10:17:34 schickhj Exp $ + */ + + +#ifndef __EHCA_ASM_H__ +#define __EHCA_ASM_H__ + +#if defined(CONFIG_PPC_PSERIES) || defined (__PPC64__) || defined (__PPC__) + +#define clear_cacheline(adr) __asm__ __volatile("dcbz 0,%0"::"r"(adr)) + +#elif defined(CONFIG_ARCH_S390) +#error "unsupported yet" +#else +#error "invalid platform" +#endif + +#endif /* __EHCA_ASM_H__ */ From rolandd at cisco.com Fri Feb 17 16:55:32 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 17 Feb 2006 16:55:32 -0800 Subject: [openib-general] [PATCH 00/22] [RFC] IBM eHCA InfiniBand adapter driver Message-ID: <20060218005532.13620.79663.stgit@localhost.localdomain> Here's a series of patches that add an InfiniBand adapter driver for IBM eHCA hardware. Please look it over with an eye towards issues that need to be addressed before merging this upstream. This patch series is somewhat unusual in that I am not the original author of this driver -- I am just sending it for review for the authors, who are apparently not able to post patches themselves due to internal issues at IBM. However they are cc'ed and will respond to comments in this thread. In fact I have some issues with the code myself that need to be addressed before this driver is mergeable. I've included most of them in the individual patches, although I have some general comments too. However I would like to get some early feedback for the ehca authors from the wider community. In particular I think its important to run this past the ppc64 experts, since I'm not sure what the standards for this sort of pSeries driver are. Anyway, my general comments: - The #ifs that test EHCA_USERDRIVER and __KERNEL__ should be killed. We know that this is kernel code, so there's no reason to include userspace compatibility junk. - Many of the comments look like they are for some automatic documentation system that is not quite kerneldoc. They should be fixed to be real kerneldoc comments. - In general there is a huge amount of code in large inline functions in .h files. Things should be reorganized to cut this down to a sane amount. Thanks, Roland From rolandd at cisco.com Fri Feb 17 16:57:07 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 17 Feb 2006 16:57:07 -0800 Subject: [openib-general] [PATCH 02/22] Firmware interface code for IB device. In-Reply-To: <20060218005532.13620.79663.stgit@localhost.localdomain> References: <20060218005532.13620.79663.stgit@localhost.localdomain> Message-ID: <20060218005707.13620.20538.stgit@localhost.localdomain> From: Roland Dreier This is a very large file with way too much code for a .h file. The functions look too big to be inlined also. Is there any way for this code to move to a .c file? --- drivers/infiniband/hw/ehca/hcp_if.h | 2022 +++++++++++++++++++++++++++++++++++ 1 files changed, 2022 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/hcp_if.h b/drivers/infiniband/hw/ehca/hcp_if.h new file mode 100644 index 0000000..70bf77f --- /dev/null +++ b/drivers/infiniband/hw/ehca/hcp_if.h @@ -0,0 +1,2022 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * Firmware Infiniband Interface code for POWER + * + * Authors: Gerd Bayer + * Christoph Raisch + * Waleri Fomin + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: hcp_if.h,v 1.62 2006/02/06 10:17:34 schickhj Exp $ + */ + +#ifndef __HCP_IF_H__ +#define __HCP_IF_H__ + +#include "ehca_tools.h" +#include "hipz_structs.h" +#include "ehca_classes.h" + +#ifndef EHCA_USE_HCALL +#include "hcz_queue.h" +#include "hcz_mrmw.h" +#include "hcz_emmio.h" +#include "sim_prom.h" +#endif +#include "hipz_fns.h" +#include "hcp_sense.h" +#include "ehca_irq.h" + +#ifndef CONFIG_PPC64 +#ifndef Z_SERIES +#warning "included with wrong target, this is a p file" +#endif +#endif + +#ifdef EHCA_USE_HCALL + +#ifndef EHCA_USERDRIVER +#include "hcp_phyp.h" +#else +#include "testbench/hcallbridge.h" +#endif +#endif + +inline static int hcp_galpas_ctor(struct h_galpas *galpas, + u64 paddr_kernel, u64 paddr_user) +{ + int rc = 0; + + rc = hcall_map_page(paddr_kernel, &galpas->kernel.fw_handle); + if (rc != 0) + return (rc); + + galpas->user.fw_handle = paddr_user; + + EDEB(7, "paddr_kernel=%lx paddr_user=%lx galpas->kernel=%lx" + " galpas->user=%lx", + paddr_kernel, paddr_user, galpas->kernel.fw_handle, + galpas->user.fw_handle); + + return (rc); +} + +inline static int hcp_galpas_dtor(struct h_galpas *galpas) +{ + int rc = 0; + + if (galpas->kernel.fw_handle != 0) + rc = hcall_unmap_page(galpas->kernel.fw_handle); + + if (rc != 0) + return (rc); + + galpas->user.fw_handle = galpas->kernel.fw_handle = 0; + + return rc; +} + +/** + * hipz_h_alloc_resource_eq - Allocate EQ resources in HW and FW, initalize + * resources, create the empty EQPT (ring). + * + * @eq_handle: eq handle for this queue + * @act_nr_of_entries: actual number of queue entries + * @act_pages: actual number of queue pages + * @eq_ist: used by hcp_H_XIRR() call + */ +inline static u64 hipz_h_alloc_resource_eq(const struct + ipz_adapter_handle + hcp_adapter_handle, + struct ehca_pfeq *pfeq, + const u32 neq_control, + const u32 + number_of_entries, + struct ipz_eq_handle + *eq_handle, + u32 * act_nr_of_entries, + u32 * act_pages, + u32 * eq_ist) +{ + u64 retcode; + u64 dummy; + u64 act_nr_of_entries_out = 0; + u64 act_pages_out = 0; + u64 eq_ist_out = 0; + u64 allocate_controls = 0; + u32 x = (u64)(&x); + + EDEB_EN(7, "pfeq=%p hcp_adapter_handle=%lx new_control=%x" + " number_of_entries=%x", + pfeq, hcp_adapter_handle.handle, neq_control, + number_of_entries); + +#ifndef EHCA_USE_HCALL + retcode = simp_h_alloc_resource_eq(hcp_adapter_handle, pfeq, + neq_control, + number_of_entries, + eq_handle, + act_nr_of_entries, + act_pages, eq_ist); +#else + + /* resource type */ + allocate_controls = 3ULL; + + /* ISN is associated */ + if (neq_control != 1) { + allocate_controls = (1ULL << (63 - 7)) | allocate_controls; + } + + /* notification event queue */ + if (neq_control == 1) { + allocate_controls = (1ULL << 63) | allocate_controls; + } + + retcode = plpar_hcall_7arg_7ret(H_ALLOC_RESOURCE, + hcp_adapter_handle.handle, /* r4 */ + allocate_controls, /* r5 */ + number_of_entries, /* r6 */ + 0, 0, 0, 0, + &eq_handle->handle, /* r4 */ + &dummy, /* r5 */ + &dummy, /* r6 */ + &act_nr_of_entries_out, /* r7 */ + &act_pages_out, /* r8 */ + &eq_ist_out, /* r8 */ + &dummy); + + *act_nr_of_entries = (u32) act_nr_of_entries_out; + *act_pages = (u32) act_pages_out; + *eq_ist = (u32) eq_ist_out; + +#endif /* EHCA_USE_HCALL */ + + if (retcode == H_NOT_ENOUGH_RESOURCES) { + EDEB_ERR(4, "Not enough resource - retcode=%lx ", retcode); + } + + EDEB_EX(7, "act_nr_of_entries=%x act_pages=%x eq_ist=%x", + *act_nr_of_entries, *act_pages, *eq_ist); + + return retcode; +} + +static inline u64 hipz_h_reset_event(const struct ipz_adapter_handle + hcp_adapter_handle, + struct ipz_eq_handle eq_handle, + const u64 event_mask) +{ + u64 retcode = 0; + u64 dummy; + + EDEB_EN(7, "eq_handle=%lx, adapter_handle=%lx event_mask=%lx", + eq_handle.handle, hcp_adapter_handle.handle, event_mask); + +#ifndef EHCA_USE_HCALL + /* TODO: Not implemented yet */ +#else + + retcode = plpar_hcall_7arg_7ret(H_RESET_EVENTS, + hcp_adapter_handle.handle, /* r4 */ + eq_handle.handle, /* r5 */ + event_mask, /* r6 */ + 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); +#endif + EDEB(7, "retcode=%lx", retcode); + + return retcode; +} + +/** + * hipz_h_allocate_resource_cq - Allocate CQ resources in HW and FW, initialize + * resources, create the empty CQPT (ring). + * + * @eq_handle: eq handle to use for this cq + * @cq_handle: cq handle for this queue + * @act_nr_of_entries: actual number of queue entries + * @act_pages: actual number of queue pages + * @galpas: contain logical adress of priv. storage and + * log_user_storage + */ +static inline u64 hipz_h_alloc_resource_cq(const struct + ipz_adapter_handle + hcp_adapter_handle, + struct ehca_pfcq *pfcq, + const struct ipz_eq_handle + eq_handle, + const u32 cq_token, + const u32 + number_of_entries, + struct ipz_cq_handle + *cq_handle, + u32 * act_nr_of_entries, + u32 * act_pages, + struct h_galpas *galpas) +{ + u64 retcode = 0; + u64 dummy; + u64 act_nr_of_entries_out; + u64 act_pages_out; + u64 g_la_privileged_out; + u64 g_la_user_out; + /* stack location is a unique identifier for a process from beginning + * to end of this frame */ + u32 x = (u64)(&x); + + EDEB_EN(7, "pfcq=%p hcp_adapter_handle=%lx eq_handle=%lx cq_token=%x" + " number_of_entries=%x", + pfcq, hcp_adapter_handle.handle, eq_handle.handle, + cq_token, number_of_entries); + +#ifndef EHCA_USE_HCALL + retcode = simp_h_alloc_resource_cq(hcp_adapter_handle, + pfcq, + eq_handle, + cq_token, + number_of_entries, + cq_handle, + act_nr_of_entries, + act_pages, galpas); +#else + retcode = plpar_hcall_7arg_7ret(H_ALLOC_RESOURCE, + hcp_adapter_handle.handle, /* r4 */ + 2, /* r5 */ + eq_handle.handle, /* r6 */ + cq_token, /* r7 */ + number_of_entries, /* r8 */ + 0, 0, + &cq_handle->handle, /* r4 */ + &dummy, /* r5 */ + &dummy, /* r6 */ + &act_nr_of_entries_out, /* r7 */ + &act_pages_out, /* r8 */ + &g_la_privileged_out, /* r9 */ + &g_la_user_out); /* r10 */ + + *act_nr_of_entries = (u32) act_nr_of_entries_out; + *act_pages = (u32) act_pages_out; + + if (retcode == 0) { + hcp_galpas_ctor(galpas, g_la_privileged_out, g_la_user_out); + } +#endif /* EHCA_US_HCALL */ + + if (retcode == H_NOT_ENOUGH_RESOURCES) { + EDEB_ERR(4, "Not enough resources. retcode=%lx", retcode); + } + + EDEB_EX(7, "cq_handle=%lx act_nr_of_entries=%x act_pages=%x", + cq_handle->handle, *act_nr_of_entries, *act_pages); + + return retcode; +} + +#define H_ALL_RES_QP_Enhanced_QP_Operations EHCA_BMASK_IBM(9,11) +#define H_ALL_RES_QP_QP_PTE_Pin EHCA_BMASK_IBM(12,12) +#define H_ALL_RES_QP_Service_Type EHCA_BMASK_IBM(13,15) +#define H_ALL_RES_QP_LL_RQ_CQE_Posting EHCA_BMASK_IBM(18,18) +#define H_ALL_RES_QP_LL_SQ_CQE_Posting EHCA_BMASK_IBM(19,21) +#define H_ALL_RES_QP_Signalling_Type EHCA_BMASK_IBM(22,23) +#define H_ALL_RES_QP_UD_Address_Vector_L_Key_Control EHCA_BMASK_IBM(31,31) +#define H_ALL_RES_QP_Resource_Type EHCA_BMASK_IBM(56,63) + +#define H_ALL_RES_QP_Max_Outstanding_Send_Work_Requests EHCA_BMASK_IBM(0,15) +#define H_ALL_RES_QP_Max_Outstanding_Receive_Work_Requests EHCA_BMASK_IBM(16,31) +#define H_ALL_RES_QP_Max_Send_SG_Elements EHCA_BMASK_IBM(32,39) +#define H_ALL_RES_QP_Max_Receive_SG_Elements EHCA_BMASK_IBM(40,47) + +#define H_ALL_RES_QP_Act_Outstanding_Send_Work_Requests EHCA_BMASK_IBM(16,31) +#define H_ALL_RES_QP_Act_Outstanding_Receive_Work_Requests EHCA_BMASK_IBM(48,63) +#define H_ALL_RES_QP_Act_Send_SG_Elements EHCA_BMASK_IBM(8,15) +#define H_ALL_RES_QP_Act_Receeive_SG_Elements EHCA_BMASK_IBM(24,31) + +#define H_ALL_RES_QP_Send_Queue_Size_pages EHCA_BMASK_IBM(0,31) +#define H_ALL_RES_QP_Receive_Queue_Size_pages EHCA_BMASK_IBM(32,63) + +/* direct access qp controls */ +#define DAQP_CTRL_ENABLE 0x01 +#define DAQP_CTRL_SEND_COMPLETION 0x20 +#define DAQP_CTRL_RECV_COMPLETION 0x40 + +/** + * hipz_h_alloc_resource_qp - Allocate QP resources in HW and FW, + * initialize resources, create empty QPPTs (2 rings). + * + * @h_galpas to access HCA resident QP attributes + */ +static inline u64 hipz_h_alloc_resource_qp(const struct + ipz_adapter_handle + adapter_handle, + struct ehca_pfqp *pfqp, + const u8 servicetype, + const u8 daqp_ctrl, + const u8 signalingtype, + const u8 ud_av_l_key_ctl, + const struct ipz_cq_handle send_cq_handle, + const struct ipz_cq_handle receive_cq_handle, + const struct ipz_eq_handle async_eq_handle, + const u32 qp_token, + const struct ipz_pd pd, + const u16 max_nr_send_wqes, + const u16 max_nr_receive_wqes, + const u8 max_nr_send_sges, + const u8 max_nr_receive_sges, + const u32 ud_av_l_key, + struct ipz_qp_handle *qp_handle, + u32 * qp_nr, + u16 * act_nr_send_wqes, + u16 * act_nr_receive_wqes, + u8 * act_nr_send_sges, + u8 * act_nr_receive_sges, + u32 * nr_sq_pages, + u32 * nr_rq_pages, + struct h_galpas *h_galpas) +{ + u64 retcode = H_Success; + u64 allocate_controls; + u64 max_r10_reg; + u64 dummy = 0; + u64 qp_nr_out = 0; + u64 r6_out = 0; + u64 r7_out = 0; + u64 r8_out = 0; + u64 g_la_user_out = 0; + u64 r11_out = 0; + + EDEB_EN(7, "pfqp=%p adapter_handle=%lx servicetype=%x signalingtype=%x" + " ud_av_l_key=%x send_cq_handle=%lx receive_cq_handle=%lx" + " async_eq_handle=%lx qp_token=%x pd=%x max_nr_send_wqes=%x" + " max_nr_receive_wqes=%x max_nr_send_sges=%x" + " max_nr_receive_sges=%x ud_av_l_key=%x galpa.pid=%x", + pfqp, adapter_handle.handle, servicetype, signalingtype, + ud_av_l_key, send_cq_handle.handle, + receive_cq_handle.handle, async_eq_handle.handle, qp_token, + pd.value, max_nr_send_wqes, max_nr_receive_wqes, + max_nr_send_sges, max_nr_receive_sges, ud_av_l_key, + h_galpas->pid); + +#ifndef EHCA_USE_HCALL + retcode = simp_h_alloc_resource_qp(adapter_handle, + pfqp, + servicetype, + signalingtype, + ud_av_l_key_ctl, + send_cq_handle, + receive_cq_handle, + async_eq_handle, + qp_token, + pd, + max_nr_send_wqes, + max_nr_receive_wqes, + max_nr_send_sges, + max_nr_receive_sges, + ud_av_l_key, + qp_handle, + qp_nr, + act_nr_send_wqes, + act_nr_receive_wqes, + act_nr_send_sges, + act_nr_receive_sges, + nr_sq_pages, nr_rq_pages, h_galpas); + +#else + allocate_controls = + EHCA_BMASK_SET(H_ALL_RES_QP_Enhanced_QP_Operations, + (daqp_ctrl & DAQP_CTRL_ENABLE) ? 1 : 0) + | EHCA_BMASK_SET(H_ALL_RES_QP_QP_PTE_Pin, 0) + | EHCA_BMASK_SET(H_ALL_RES_QP_Service_Type, servicetype) + | EHCA_BMASK_SET(H_ALL_RES_QP_Signalling_Type, signalingtype) + | EHCA_BMASK_SET(H_ALL_RES_QP_LL_RQ_CQE_Posting, + (daqp_ctrl & DAQP_CTRL_RECV_COMPLETION) ? 1 : 0) + | EHCA_BMASK_SET(H_ALL_RES_QP_LL_SQ_CQE_Posting, + (daqp_ctrl & DAQP_CTRL_SEND_COMPLETION) ? 1 : 0) + | EHCA_BMASK_SET(H_ALL_RES_QP_UD_Address_Vector_L_Key_Control, + ud_av_l_key_ctl) + | EHCA_BMASK_SET(H_ALL_RES_QP_Resource_Type, 1); + + max_r10_reg = + EHCA_BMASK_SET(H_ALL_RES_QP_Max_Outstanding_Send_Work_Requests, + max_nr_send_wqes) + | EHCA_BMASK_SET(H_ALL_RES_QP_Max_Outstanding_Receive_Work_Requests, + max_nr_receive_wqes) + | EHCA_BMASK_SET(H_ALL_RES_QP_Max_Send_SG_Elements, + max_nr_send_sges) + | EHCA_BMASK_SET(H_ALL_RES_QP_Max_Receive_SG_Elements, + max_nr_receive_sges); + + + retcode = plpar_hcall_9arg_9ret(H_ALLOC_RESOURCE, + adapter_handle.handle, /* r4 */ + allocate_controls, /* r5 */ + send_cq_handle.handle, /* r6 */ + receive_cq_handle.handle,/* r7 */ + async_eq_handle.handle, /* r8 */ + ((u64) qp_token << 32) + | pd.value, /* r9 */ + max_r10_reg, /* r10 */ + ud_av_l_key, /* r11 */ + 0, + &qp_handle->handle, /* r4 */ + &qp_nr_out, /* r5 */ + &r6_out, /* r6 */ + &r7_out, /* r7 */ + &r8_out, /* r8 */ + &dummy, /* r9 */ + &g_la_user_out, /* r10 */ + &r11_out, + &dummy); + + /* extract outputs */ + *qp_nr = (u32) qp_nr_out; + *act_nr_send_wqes = (u16) + EHCA_BMASK_GET(H_ALL_RES_QP_Act_Outstanding_Send_Work_Requests, + r6_out); + *act_nr_receive_wqes = (u16) + EHCA_BMASK_GET(H_ALL_RES_QP_Act_Outstanding_Receive_Work_Requests, + r6_out); + *act_nr_send_sges = + (u8) EHCA_BMASK_GET(H_ALL_RES_QP_Act_Send_SG_Elements, + r7_out); + *act_nr_receive_sges = + (u8) EHCA_BMASK_GET(H_ALL_RES_QP_Act_Receeive_SG_Elements, + r7_out); + *nr_sq_pages = + (u32) EHCA_BMASK_GET(H_ALL_RES_QP_Send_Queue_Size_pages, + r8_out); + *nr_rq_pages = + (u32) EHCA_BMASK_GET(H_ALL_RES_QP_Receive_Queue_Size_pages, + r8_out); + if (retcode == 0) { + hcp_galpas_ctor(h_galpas, g_la_user_out, g_la_user_out); + } +#endif /* EHCA_USE_HCALL */ + + if (retcode == H_NOT_ENOUGH_RESOURCES) { + EDEB_ERR(4, "Not enough resources. retcode=%lx", + retcode); + } + + EDEB_EX(7, "qp_nr=%x act_nr_send_wqes=%x" + " act_nr_receive_wqes=%x act_nr_send_sges=%x" + " act_nr_receive_sges=%x nr_sq_pages=%x" + " nr_rq_pages=%x galpa.user=%lx galpa.kernel=%lx", + *qp_nr, *act_nr_send_wqes, *act_nr_receive_wqes, + *act_nr_send_sges, *act_nr_receive_sges, *nr_sq_pages, + *nr_rq_pages, h_galpas->user.fw_handle, + h_galpas->kernel.fw_handle); + + return (retcode); +} + +static inline u64 hipz_h_query_port(const struct ipz_adapter_handle + hcp_adapter_handle, + const u8 port_id, + struct query_port_rblock + *query_port_response_block) +{ + u64 retcode = H_Success; + u64 dummy; + u64 r_cb; + + EDEB_EN(7, "hcp_adapter_handle=%lx port_id %x", + hcp_adapter_handle.handle, port_id); + + if ((((u64)query_port_response_block) & 0xfff) != 0) { + EDEB_ERR(4, "response block not page aligned"); + retcode = H_Parameter; + return (retcode); + } + +#ifndef EHCA_USE_HCALL + retcode = 0; +#else + r_cb = ehca_kv_to_g(query_port_response_block); + + retcode = plpar_hcall_7arg_7ret(H_QUERY_PORT, + hcp_adapter_handle.handle, /* r4 */ + port_id, /* r5 */ + r_cb, /* r6 */ + 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); +#endif /* EHCA_USE_HCALL */ + + EDEB(7, "offset0=%x offset1=%x offset2=%x offset3=%x", + ((u32 *) query_port_response_block)[0], + ((u32 *) query_port_response_block)[1], + ((u32 *) query_port_response_block)[2], + ((u32 *) query_port_response_block)[3]); + EDEB(7, "offset4=%x offset5=%x offset6=%x offset7=%x", + ((u32 *) query_port_response_block)[4], + ((u32 *) query_port_response_block)[5], + ((u32 *) query_port_response_block)[6], + ((u32 *) query_port_response_block)[7]); + EDEB(7, "offset8=%x offset9=%x offseta=%x offsetb=%x", + ((u32 *) query_port_response_block)[8], + ((u32 *) query_port_response_block)[9], + ((u32 *) query_port_response_block)[10], + ((u32 *) query_port_response_block)[11]); + EDEB(7, "offsetc=%x offsetd=%x offsete=%x offsetf=%x", + ((u32 *) query_port_response_block)[12], + ((u32 *) query_port_response_block)[13], + ((u32 *) query_port_response_block)[14], + ((u32 *) query_port_response_block)[15]); + EDEB(7, "offset31=%x offset35=%x offset36=%x", + ((u32 *) query_port_response_block)[32], + ((u32 *) query_port_response_block)[36], + ((u32 *) query_port_response_block)[37]); + EDEB(7, "offset200=%x offset201=%x offset202=%x " + "offset203=%x", + ((u32 *) query_port_response_block)[0x200], + ((u32 *) query_port_response_block)[0x201], + ((u32 *) query_port_response_block)[0x202], + ((u32 *) query_port_response_block)[0x203]); + + EDEB_EX(7, "retcode=%lx", retcode); + + return retcode; +} + +static inline u64 hipz_h_query_hca(const struct ipz_adapter_handle + hcp_adapter_handle, + struct query_hca_rblock + *query_hca_rblock) +{ + u64 retcode = 0; + u64 dummy; + u64 r_cb; + EDEB_EN(7, "hcp_adapter_handle=%lx", hcp_adapter_handle.handle); + + if ((((u64)query_hca_rblock) & 0xfff) != 0) { + EDEB_ERR(4, "response block not page aligned"); + retcode = H_Parameter; + return (retcode); + } + +#ifndef EHCA_USE_HCALL + retcode = 0; +#else + r_cb = ehca_kv_to_g(query_hca_rblock); + + retcode = plpar_hcall_7arg_7ret(H_QUERY_HCA, + hcp_adapter_handle.handle, /* r4 */ + r_cb, /* r5 */ + 0, 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); +#endif /* EHCA_USE_HCALL */ + + EDEB(7, "offset0=%x offset1=%x offset2=%x offset3=%x", + ((u32 *) query_hca_rblock)[0], + ((u32 *) query_hca_rblock)[1], + ((u32 *) query_hca_rblock)[2], ((u32 *) query_hca_rblock)[3]); + EDEB(7, "offset4=%x offset5=%x offset6=%x offset7=%x", + ((u32 *) query_hca_rblock)[4], + ((u32 *) query_hca_rblock)[5], + ((u32 *) query_hca_rblock)[6], ((u32 *) query_hca_rblock)[7]); + EDEB(7, "offset8=%x offset9=%x offseta=%x offsetb=%x", + ((u32 *) query_hca_rblock)[8], + ((u32 *) query_hca_rblock)[9], + ((u32 *) query_hca_rblock)[10], ((u32 *) query_hca_rblock)[11]); + EDEB(7, "offsetc=%x offsetd=%x offsete=%x offsetf=%x", + ((u32 *) query_hca_rblock)[12], + ((u32 *) query_hca_rblock)[13], + ((u32 *) query_hca_rblock)[14], ((u32 *) query_hca_rblock)[15]); + EDEB(7, "offset136=%x offset192=%x offset204=%x", + ((u32 *) query_hca_rblock)[32], + ((u32 *) query_hca_rblock)[48], ((u32 *) query_hca_rblock)[51]); + EDEB(7, "offset231=%x offset235=%x", + ((u32 *) query_hca_rblock)[57], ((u32 *) query_hca_rblock)[58]); + EDEB(7, "offset200=%x offset201=%x offset202=%x offset203=%x", + ((u32 *) query_hca_rblock)[0x201], + ((u32 *) query_hca_rblock)[0x202], + ((u32 *) query_hca_rblock)[0x203], + ((u32 *) query_hca_rblock)[0x204]); + + EDEB_EX(7, "retcode=%lx hcp_adapter_handle=%lx", + retcode, hcp_adapter_handle.handle); + + return retcode; +} + +/** + * hipz_h_register_rpage - hcp_if.h internal function for all + * hcp_H_REGISTER_RPAGE calls. + * + * @logical_address_of_page: kv transformation to GX address in this routine + */ +static inline u64 hipz_h_register_rpage(const struct + ipz_adapter_handle + hcp_adapter_handle, + const u8 pagesize, + const u8 queue_type, + const u64 resource_handle, + const u64 + logical_address_of_page, + u64 count) +{ + u64 retcode = 0; + u64 dummy; + + EDEB_EN(7, "hcp_adapter_handle=%lx pagesize=%x queue_type=%x" + " resource_handle=%lx logical_address_of_page=%lx count=%lx", + hcp_adapter_handle.handle, pagesize, queue_type, + resource_handle, logical_address_of_page, count); + +#ifndef EHCA_USE_HCALL + EDEB_ERR(4, "Not implemented"); +#else + retcode = plpar_hcall_7arg_7ret(H_REGISTER_RPAGES, + hcp_adapter_handle.handle, /* r4 */ + queue_type | pagesize << 8, /* r5 */ + resource_handle, /* r6 */ + logical_address_of_page, /* r7 */ + count, /* r8 */ + 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); +#endif /* EHCA_USE_HCALL */ + + EDEB_EX(7, "retcode=%lx", retcode); + + return retcode; +} + +static inline u64 hipz_h_register_rpage_eq(const struct + ipz_adapter_handle + hcp_adapter_handle, + const struct ipz_eq_handle + eq_handle, + struct ehca_pfeq *pfeq, + const u8 pagesize, + const u8 queue_type, + const u64 + logical_address_of_page, + const u64 count) +{ + u64 retcode = 0; + + EDEB_EN(7, "pfeq=%p hcp_adapter_handle=%lx eq_handle=%lx pagesize=%x" + " queue_type=%x logical_address_of_page=%lx count=%lx", + pfeq, hcp_adapter_handle.handle, eq_handle.handle, pagesize, + queue_type,logical_address_of_page, count); + +#ifndef EHCA_USE_HCALL + retcode = + simp_h_register_rpage_eq(hcp_adapter_handle, eq_handle, pfeq, + pagesize, queue_type, + logical_address_of_page, count); +#else + if (count != 1) { + EDEB_ERR(4, "Ppage counter=%lx", count); + return (H_Parameter); + } + retcode = hipz_h_register_rpage(hcp_adapter_handle, + pagesize, + queue_type, + eq_handle.handle, + logical_address_of_page, count); +#endif /* EHCA_USE_HCALL */ + + EDEB_EX(7, "retcode=%lx", retcode); + + return retcode; +} + +static inline u32 hipz_request_interrupt(struct ehca_irq_info *irq_info, + irqreturn_t(*handler) + (int, void *, struct pt_regs *)) +{ + + int ret = 0; + + EDEB_EN(7, "ist=0x%x", irq_info->ist); + +#ifdef EHCA_USE_HCALL +#ifndef EHCA_USERDRIVER + ret = ibmebus_request_irq(NULL, irq_info->ist, handler, + SA_INTERRUPT, "ehca", (void *)irq_info); + + if (ret < 0) + EDEB_ERR(4, "Can't map interrupt handler."); +#else + struct hcall_irq_info hirq = {.irq = irq_info->irq, + .ist = irq_info->ist, + .pid = irq_info->pid}; + + hirq = hirq; + ret = hcall_reg_eqh(&hirq, ehca_interrupt_eq); +#endif /* EHCA_USERDRIVER */ +#endif /* EHCA_USE_HCALL */ + + EDEB_EX(7, "ret=%x", ret); + + return ret; +} + +static inline void hipz_free_interrupt(struct ehca_irq_info *irq_info) +{ +#ifdef EHCA_USE_HCALL +#ifndef EHCA_USERDRIVER + ibmebus_free_irq(NULL, irq_info->ist, (void *)irq_info); +#endif +#endif +} + +static inline u32 hipz_h_query_int_state(const struct ipz_adapter_handle + hcp_adapter_handle, + struct ehca_irq_info *irq_info) +{ + u32 rc = 0; + u64 dummy = 0; + + EDEB_EN(7, "ist=0x%x", irq_info->ist); + +#ifdef EHCA_USE_HCALL +#ifdef EHCA_USERDRIVER + /* TODO: Not implemented yet */ +#else + rc = plpar_hcall_7arg_7ret(H_QUERY_INT_STATE, + hcp_adapter_handle.handle, /* r4 */ + irq_info->ist, /* r5 */ + 0, 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + + if ((rc != H_Success) && (rc != H_Busy)) + EDEB_ERR(4, "Could not query interrupt state."); +#endif +#endif + EDEB_EX(7, "interrupt state: %x", rc); + + return rc; +} + +static inline u64 hipz_h_register_rpage_cq(const struct + ipz_adapter_handle + hcp_adapter_handle, + const struct ipz_cq_handle + cq_handle, + struct ehca_pfcq *pfcq, + const u8 pagesize, + const u8 queue_type, + const u64 + logical_address_of_page, + const u64 count, + const struct h_galpa gal) +{ + u64 retcode = 0; + + EDEB_EN(7, "pfcq=%p hcp_adapter_handle=%lx cq_handle=%lx pagesize=%x" + " queue_type=%x logical_address_of_page=%lx count=%lx", + pfcq, hcp_adapter_handle.handle, cq_handle.handle, pagesize, + queue_type, logical_address_of_page, count); + +#ifndef EHCA_USE_HCALL + retcode = + simp_h_register_rpage_cq(hcp_adapter_handle, cq_handle, pfcq, + pagesize, queue_type, + logical_address_of_page, count, gal); +#else + if (count != 1) { + EDEB_ERR(4, "Page counter=%lx", count); + return (H_Parameter); + } + + retcode = + hipz_h_register_rpage(hcp_adapter_handle, pagesize, queue_type, + cq_handle.handle, logical_address_of_page, + count); +#endif /* EHCA_USE_HCALL */ + EDEB_EX(7, "retcode=%lx", retcode); + + return retcode; +} + +static inline u64 hipz_h_register_rpage_qp(const struct + ipz_adapter_handle + hcp_adapter_handle, + const struct ipz_qp_handle + qp_handle, + struct ehca_pfqp *pfqp, + const u8 pagesize, + const u8 queue_type, + const u64 + logical_address_of_page, + const u64 count, + const struct h_galpa + galpa) +{ + u64 retcode = 0; + + EDEB_EN(7, "pfqp=%p hcp_adapter_handle=%lx qp_handle=%lx pagesize=%x" + " queue_type=%x logical_address_of_page=%lx count=%lx", + pfqp, hcp_adapter_handle.handle, qp_handle.handle, pagesize, + queue_type, logical_address_of_page, count); + +#ifndef EHCA_USE_HCALL + retcode = simp_h_register_rpage_qp(hcp_adapter_handle, + qp_handle, + pfqp, + pagesize, + queue_type, + logical_address_of_page, + count, galpa); +#else + if (count != 1) { + EDEB_ERR(4, "Page counter=%lx", count); + return (H_Parameter); + } + + retcode = hipz_h_register_rpage(hcp_adapter_handle, + pagesize, + queue_type, + qp_handle.handle, + logical_address_of_page, count); +#endif /* EHCA_USE_HCALL */ + EDEB_EX(7, "retcode=%lx", retcode); + + return retcode; +} + +static inline u64 hipz_h_remove_rpt_cq(const struct + ipz_adapter_handle + hcp_adapter_handle, + const struct ipz_cq_handle + cq_handle, + struct ehca_pfcq *pfcq) +{ + u64 retcode = 0; + + EDEB_EN(7, "pfcq=%p hcp_adapter_handle=%lx cq_handle=%lx", + pfcq, hcp_adapter_handle.handle, cq_handle.handle); + +#ifndef EHCA_USE_HCALL + retcode = simp_h_remove_rpt_cq(hcp_adapter_handle, cq_handle, pfcq); +#else + /* TODO: hcall not implemented */ +#endif + EDEB_EX(7, "retcode=%lx", retcode); + + return 0; +} + +static inline u64 hipz_h_remove_rpt_eq(const struct + ipz_adapter_handle + hcp_adapter_handle, + const struct ipz_eq_handle + eq_handle, + struct ehca_pfeq *pfeq) +{ + u64 retcode = 0; + + EDEB_EX(7, "hcp_adapter_handle=%lx eq_handle=%lx", + hcp_adapter_handle.handle, eq_handle.handle); + +#ifndef EHCA_USE_HCALL + retcode = simp_h_remove_rpt_eq(hcp_adapter_handle, eq_handle, pfeq); +#else + /* TODO: hcall not implemented */ +#endif + EDEB_EX(7, "retcode=%lx", retcode); + + return 0; +} + +static inline u64 hipz_h_remove_rpt_qp(const struct + ipz_adapter_handle + hcp_adapter_handle, + const struct ipz_qp_handle + qp_handle, + struct ehca_pfqp *pfqp) +{ + u64 retcode = 0; + + EDEB_EN(7, "pfqp=%p hcp_adapter_handle=%lx qp_handle=%lx", + pfqp, hcp_adapter_handle.handle, qp_handle.handle); + +#ifndef EHCA_USE_HCALL + retcode = simp_h_remove_rpt_qp(hcp_adapter_handle, qp_handle, pfqp); +#else + /* TODO: hcall not implemented */ +#endif + EDEB_EX(7, "retcode=%lx", retcode); + + return 0; +} + +static inline u64 hipz_h_disable_and_get_wqe(const struct + ipz_adapter_handle + hcp_adapter_handle, + const struct + ipz_qp_handle qp_handle, + struct ehca_pfqp *pfqp, + void **log_addr_next_sq_wqe_tb_processed, + void **log_addr_next_rq_wqe_tb_processed, + int dis_and_get_function_code) +{ + u64 retcode = 0; + u8 function_code = 1; + u64 dummy, dummy1, dummy2; + + EDEB_EN(7, "pfqp=%p hcp_adapter_handle=%lx function=%x qp_handle=%lx", + pfqp, hcp_adapter_handle.handle, function_code, qp_handle.handle); + + if (log_addr_next_sq_wqe_tb_processed==NULL) { + log_addr_next_sq_wqe_tb_processed = (void**)&dummy1; + } + if (log_addr_next_rq_wqe_tb_processed==NULL) { + log_addr_next_rq_wqe_tb_processed = (void**)&dummy2; + } +#ifndef EHCA_USE_HCALL + retcode = + simp_h_disable_and_get_wqe(hcp_adapter_handle, qp_handle, pfqp, + log_addr_next_sq_wqe_tb_processed, + log_addr_next_rq_wqe_tb_processed); +#else + + retcode = plpar_hcall_7arg_7ret(H_DISABLE_AND_GETC, + hcp_adapter_handle.handle, /* r4 */ + dis_and_get_function_code, /* r5 */ + /* function code 1-disQP ret + * SQ RQ wqe ptr + * 2- ret SQ wqe ptr + * 3- ret. RQ count */ + qp_handle.handle, /* r6 */ + 0, 0, 0, 0, + (void*)log_addr_next_sq_wqe_tb_processed, /* r4 */ + (void*)log_addr_next_rq_wqe_tb_processed, /* r5 */ + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); +#endif /* EHCA_USE_HCALL */ + EDEB_EX(7, "retcode=%lx ladr_next_rq_wqe_out=%p" + " ladr_next_sq_wqe_out=%p", retcode, + *log_addr_next_sq_wqe_tb_processed, + *log_addr_next_rq_wqe_tb_processed); + + return retcode; +} + +enum hcall_sigt { + HCALL_SIGT_NO_CQE = 0, + HCALL_SIGT_BY_WQE = 1, + HCALL_SIGT_EVERY = 2 +}; + +static inline u64 hipz_h_modify_qp(const struct ipz_adapter_handle + hcp_adapter_handle, + const struct ipz_qp_handle + qp_handle, struct ehca_pfqp *pfqp, + const u64 update_mask, + struct hcp_modify_qp_control_block + *mqpcb, + struct h_galpa gal) +{ + u64 retcode = 0; + u64 invalid_attribute_identifier = 0; + u64 rc_attrib_mask = 0; + u64 dummy; + u64 r_cb; + EDEB_EN(7, "pfqp=%p hcp_adapter_handle=%lx qp_handle=%lx" + " update_mask=%lx qp_state=%x mqpcb=%p", + pfqp, hcp_adapter_handle.handle, qp_handle.handle, + update_mask, mqpcb->qp_state, mqpcb); + +#ifndef EHCA_USE_HCALL + simp_h_modify_qp(hcp_adapter_handle, qp_handle, pfqp, update_mask, + mqpcb, gal); +#else + r_cb = ehca_kv_to_g(mqpcb); + retcode = plpar_hcall_7arg_7ret(H_MODIFY_QP, + hcp_adapter_handle.handle, /* r4 */ + qp_handle.handle, /* r5 */ + update_mask, /* r6 */ + r_cb, /* r7 */ + 0, 0, 0, + &invalid_attribute_identifier, /* r4 */ + &dummy, /* r5 */ + &dummy, /* r6 */ + &dummy, /* r7 */ + &dummy, /* r8 */ + &rc_attrib_mask, /* r9 */ + &dummy); +#endif + if (retcode == H_NOT_ENOUGH_RESOURCES) { + EDEB_ERR(4, "Insufficient resources retcode=%lx", retcode); + } + + EDEB_EX(7, "retcode=%lx invalid_attribute_identifier=%lx" + " invalid_attribute_MASK=%lx", retcode, + invalid_attribute_identifier, rc_attrib_mask); + + return retcode; +} + +static inline u64 hipz_h_query_qp(const struct ipz_adapter_handle + hcp_adapter_handle, + const struct ipz_qp_handle + qp_handle, struct ehca_pfqp *pfqp, + struct hcp_modify_qp_control_block + *qqpcb, struct h_galpa gal) +{ + u64 retcode = 0; + u64 dummy; + u64 r_cb; + EDEB_EN(7, "hcp_adapter_handle=%lx qp_handle=%lx", + hcp_adapter_handle.handle, qp_handle.handle); + +#ifndef EHCA_USE_HCALL + simp_h_query_qp(hcp_adapter_handle, qp_handle, qqpcb, gal); +#else + r_cb = ehca_kv_to_g(qqpcb); + EDEB(7, "r_cb=%lx", r_cb); + + retcode = plpar_hcall_7arg_7ret(H_QUERY_QP, + hcp_adapter_handle.handle, /* r4 */ + qp_handle.handle, /* r5 */ + r_cb, /* r6 */ + 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + +#endif + EDEB_EX(7, "retcode=%lx", retcode); + + return retcode; +} + +static inline u64 hipz_h_destroy_qp(const struct ipz_adapter_handle + hcp_adapter_handle, + struct ehca_qp *qp) +{ + u64 retcode = 0; + u64 dummy; + u64 ladr_next_sq_wqe_out; + u64 ladr_next_rq_wqe_out; + + EDEB_EN(7, "qp = %p ,ipz_qp_handle=%lx adapter_handle=%lx", + qp, qp->ipz_qp_handle.handle, hcp_adapter_handle.handle); + +#ifndef EHCA_USE_HCALL + retcode = + simp_h_destroy_qp(hcp_adapter_handle, qp, + qp->ehca_qp_core.galpas.user); +#else + + retcode = hcp_galpas_dtor(&qp->ehca_qp_core.galpas); + + retcode = plpar_hcall_7arg_7ret(H_DISABLE_AND_GETC, + hcp_adapter_handle.handle, /* r4 */ + /* function code */ + 1, /* r5 */ + qp->ipz_qp_handle.handle, /* r6 */ + 0, 0, 0, 0, + &ladr_next_sq_wqe_out, /* r4 */ + &ladr_next_rq_wqe_out, /* r5 */ + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + if (retcode == H_Hardware) { + EDEB_ERR(4, "HCA not operational. retcode=%lx", retcode); + } + + retcode = plpar_hcall_7arg_7ret(H_FREE_RESOURCE, + hcp_adapter_handle.handle, /* r4 */ + qp->ipz_qp_handle.handle, /* r5 */ + 0, 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); +#endif /* EHCA_USE_HCALL */ + + if (retcode == H_Resource) { + EDEB_ERR(4, "Resource still in use. retcode=%lx", retcode); + } + EDEB_EX(7, "retcode=%lx", retcode); + + return retcode; +} + +static inline u64 hipz_h_define_aqp0(const struct ipz_adapter_handle + hcp_adapter_handle, + const struct ipz_qp_handle + qp_handle, struct h_galpa gal, + u32 port) +{ + u64 retcode = 0; + u64 dummy; + + EDEB_EN(7, "port=%x ipz_qp_handle=%lx adapter_handle=%lx", + port, qp_handle.handle, hcp_adapter_handle.handle); + +#ifndef EHCA_USE_HCALL + /* TODO: not implemented yet */ +#else + + retcode = plpar_hcall_7arg_7ret(H_DEFINE_AQP0, + hcp_adapter_handle.handle, /* r4 */ + qp_handle.handle, /* r5 */ + port, /* r6 */ + 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + +#endif /* EHCA_USE_HCALL */ + EDEB_EX(7, "retcode=%lx", retcode); + + return retcode; +} + +static inline u64 hipz_h_define_aqp1(const struct ipz_adapter_handle + hcp_adapter_handle, + const struct ipz_qp_handle + qp_handle, struct h_galpa gal, + u32 port, u32 * pma_qp_nr, + u32 * bma_qp_nr) +{ + u64 retcode = 0; + u64 dummy; + u64 pma_qp_nr_out; + u64 bma_qp_nr_out; + + EDEB_EN(7, "port=%x qp_handle=%lx adapter_handle=%lx", + port, qp_handle.handle, hcp_adapter_handle.handle); + +#ifndef EHCA_USE_HCALL + /* TODO: not implemented yet */ +#else + + retcode = plpar_hcall_7arg_7ret(H_DEFINE_AQP1, + hcp_adapter_handle.handle, /* r4 */ + qp_handle.handle, /* r5 */ + port, /* r6 */ + 0, 0, 0, 0, + &pma_qp_nr_out, /* r4 */ + &bma_qp_nr_out, /* r5 */ + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + *pma_qp_nr = (u32) pma_qp_nr_out; + *bma_qp_nr = (u32) bma_qp_nr_out; + +#endif + if (retcode == H_ALIAS_EXIST) { + EDEB_ERR(4, "AQP1 already exists. retcode=%lx", retcode); + } + + EDEB_EX(7, "retcode=%lx pma_qp_nr=%i bma_qp_nr=%i", + retcode, (int)*pma_qp_nr, (int)*bma_qp_nr); + + return retcode; +} + +/* TODO: Don't use ib_* types in this file */ +static inline u64 hipz_h_attach_mcqp(const struct ipz_adapter_handle + hcp_adapter_handle, + const struct ipz_qp_handle + qp_handle, struct h_galpa gal, + u16 mcg_dlid, union ib_gid dgid) +{ + u64 retcode = 0; + u64 dummy; + + EDEB_EN(7, "qp_handle=%lx adapter_handle=%lx\nMCG_DGID =" + " %d.%d.%d.%d.%d.%d.%d.%d." + " %d.%d.%d.%d.%d.%d.%d.%d\n", + qp_handle.handle, hcp_adapter_handle.handle, + dgid.raw[0], dgid.raw[1], + dgid.raw[2], dgid.raw[3], + dgid.raw[4], dgid.raw[5], + dgid.raw[6], dgid.raw[7], + dgid.raw[0 + 8], dgid.raw[1 + 8], + dgid.raw[2 + 8], dgid.raw[3 + 8], + dgid.raw[4 + 8], dgid.raw[5 + 8], + dgid.raw[6 + 8], dgid.raw[7 + 8]); + +#ifndef EHCA_USE_HCALL + /* TODO: not implemented yet */ +#else + retcode = plpar_hcall_7arg_7ret(H_ATTACH_MCQP, + hcp_adapter_handle.handle, /* r4 */ + qp_handle.handle, /* r5 */ + mcg_dlid, /* r6 */ + dgid.global.interface_id, /* r7 */ + dgid.global.subnet_prefix, /* r8 */ + 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); +#endif /* EHCA_USE_HCALL */ + if (retcode == H_NOT_ENOUGH_RESOURCES) { + EDEB_ERR(4, "Not enough resources. retcode=%lx", retcode); + } + + EDEB_EX(7, "retcode=%lx", retcode); + + return retcode; +} + +static inline u64 hipz_h_detach_mcqp(const struct ipz_adapter_handle + hcp_adapter_handle, + const struct ipz_qp_handle + qp_handle, struct h_galpa gal, + u16 mcg_dlid, union ib_gid dgid) +{ + u64 retcode = 0; + u64 dummy; + + EDEB_EN(7, "qp_handle=%lx adapter_handle=%lx\nMCG_DGID =" + " %d.%d.%d.%d.%d.%d.%d.%d." + " %d.%d.%d.%d.%d.%d.%d.%d\n", + qp_handle.handle, hcp_adapter_handle.handle, + dgid.raw[0], dgid.raw[1], + dgid.raw[2], dgid.raw[3], + dgid.raw[4], dgid.raw[5], + dgid.raw[6], dgid.raw[7], + dgid.raw[0 + 8], dgid.raw[1 + 8], + dgid.raw[2 + 8], dgid.raw[3 + 8], + dgid.raw[4 + 8], dgid.raw[5 + 8], + dgid.raw[6 + 8], dgid.raw[7 + 8]); +#ifndef EHCA_USE_HCALL + /* TODO: not implemented yet */ +#else + retcode = plpar_hcall_7arg_7ret(H_DETACH_MCQP, + hcp_adapter_handle.handle, /* r4 */ + qp_handle.handle, /* r5 */ + mcg_dlid, /* r6 */ + dgid.global.interface_id, /* r7 */ + dgid.global.subnet_prefix, /* r8 */ + 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); +#endif /* EHCA_USE_HCALL */ + EDEB(7, "retcode=%lx", retcode); + + return retcode; +} + +static inline u64 hipz_h_destroy_cq(const struct ipz_adapter_handle + hcp_adapter_handle, + struct ehca_cq *cq, + u8 force_flag) +{ + u64 retcode = 0; + u64 dummy; + + EDEB_EN(7, "cq->pf=%p cq=.%p ipz_cq_handle=%lx adapter_handle=%lx", + &cq->pf, cq, cq->ipz_cq_handle.handle, hcp_adapter_handle.handle); + +#ifndef EHCA_USE_HCALL + simp_h_destroy_cq(hcp_adapter_handle, cq, + cq->ehca_cq_core.galpas.kernel); +#else + retcode = hcp_galpas_dtor(&cq->ehca_cq_core.galpas); + if (retcode != 0) { + EDEB_ERR(4, "Could not destruct cp->galpas"); + return (H_Resource); + } + + retcode = plpar_hcall_7arg_7ret(H_FREE_RESOURCE, + hcp_adapter_handle.handle, /* r4 */ + cq->ipz_cq_handle.handle, /* r5 */ + force_flag!=0 ? 1L : 0L, /* r6 */ + 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); +#endif + + if (retcode == H_Resource) { + EDEB(4, "retcode=%lx ", retcode); + } + + EDEB_EX(7, "retcode=%lx", retcode); + + return retcode; +} + +static inline u64 hipz_h_destroy_eq(const struct ipz_adapter_handle + hcp_adapter_handle, + struct ehca_eq *eq) +{ + u64 retcode = 0; + u64 dummy; + + EDEB_EN(7, "eq->pf=%p eq=%p ipz_eq_handle=%lx adapter_handle=%lx", + &eq->pf, eq, eq->ipz_eq_handle.handle, + hcp_adapter_handle.handle); + +#ifndef EHCA_USE_HCALL + /* TODO: not implemeted et */ +#else + + retcode = hcp_galpas_dtor(&eq->galpas); + if (retcode != 0) { + EDEB_ERR(4, "Could not destruct ep->galpas"); + return (H_Resource); + } + + retcode = plpar_hcall_7arg_7ret(H_FREE_RESOURCE, + hcp_adapter_handle.handle, /* r4 */ + eq->ipz_eq_handle.handle, /* r5 */ + 0, 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + +#endif + if (retcode == H_Resource) { + EDEB_ERR(4, "Resource in use. retcode=%lx ", retcode); + } + EDEB_EX(7, "retcode=%lx", retcode); + + return retcode; +} + +/** + * hipz_h_alloc_resource_mr - Allocate MR resources in HW and FW, initialize + * resources. + * + * @pfmr: platform specific for MR + * pfshca: platform specific for SHCA + * vaddr: Memory Region I/O Virtual Address + * @length: Memory Region Length + * @access_ctrl: Memory Region Access Controls + * @pd: Protection Domain + * @mr_handle: Memory Region Handle + */ +static inline u64 hipz_h_alloc_resource_mr(const struct ipz_adapter_handle + hcp_adapter_handle, + struct ehca_pfmr *pfmr, + struct ehca_pfshca + *pfshca, + const u64 vaddr, + const u64 length, + const u32 access_ctrl, + const struct ipz_pd pd, + struct ipz_mrmw_handle + *mr_handle, + u32 * lkey, + u32 * rkey) +{ + u64 rc = H_Success; + u64 dummy; + u64 lkey_out; + u64 rkey_out; + + EDEB_EN(7, "hcp_adapter_handle=%lx pfmr=%p vaddr=%lx length=%lx" + " access_ctrl=%x pd=%x pfshca=%p", + hcp_adapter_handle.handle, pfmr, vaddr, length, access_ctrl, + pd.value, pfshca); + +#ifndef EHCA_USE_HCALL + rc = simp_hcz_h_alloc_resource_mr(hcp_adapter_handle, + pfmr, + pfshca, + vaddr, + length, + access_ctrl, + pd, + (struct hcz_mrmw_handle *)mr_handle, + lkey, rkey); + EDEB_EX(7, "rc=%lx mr_handle.mrwpte=%p mr_handle.page_index=%x" + " lkey=%x rkey=%x", + rc, mr_handle->mrwpte, mr_handle->page_index, *lkey, *rkey); +#else + + rc = plpar_hcall_7arg_7ret(H_ALLOC_RESOURCE, + hcp_adapter_handle.handle, /* r4 */ + 5, /* r5 */ + vaddr, /* r6 */ + length, /* r7 */ + ((((u64) access_ctrl) << 32ULL)), /* r8 */ + pd.value, /* r9 */ + 0, + &mr_handle->handle, /* r4 */ + &dummy, /* r5 */ + &lkey_out, /* r6 */ + &rkey_out, /* r7 */ + &dummy, + &dummy, + &dummy); + *lkey = (u32) lkey_out; + *rkey = (u32) rkey_out; + + EDEB_EX(7, "rc=%lx mr_handle=%lx lkey=%x rkey=%x", + rc, mr_handle->handle, *lkey, *rkey); +#endif /* EHCA_USE_HCALL */ + + return rc; +} + +/** + * hipz_h_register_rpage_mr - Register MR resource page in HW and FW . + * + * @pfmr: platform specific for MR + * @pfshca: platform specific for SHCA + * @queue_type: must be zero for MR + */ +static inline u64 hipz_h_register_rpage_mr(const struct ipz_adapter_handle + hcp_adapter_handle, + const struct ipz_mrmw_handle + *mr_handle, + struct ehca_pfmr *pfmr, + struct ehca_pfshca *pfshca, + const u8 pagesize, + const u8 queue_type, + const u64 + logical_address_of_page, + const u64 count) +{ + u64 rc = H_Success; + +#ifndef EHCA_USE_HCALL + EDEB_EN(7, "hcp_adapter_handle=%lx pfmr=%p mr_handle.mrwpte=%p" + " mr_handle.page_index=%x pagesize=%x queue_type=%x " + " logical_address_of_page=%lx count=%lx pfshca=%p", + hcp_adapter_handle.handle, pfmr, mr_handle->mrwpte, + mr_handle->page_index, pagesize, queue_type, + logical_address_of_page, count, pfshca); + + rc = simp_hcz_h_register_rpage_mr(hcp_adapter_handle, + (struct hcz_mrmw_handle *)mr_handle, + pfmr, + pfshca, + pagesize, + queue_type, + logical_address_of_page, count); +#else + EDEB_EN(7, "hcp_adapter_handle=%lx pfmr=%p mr_handle=%lx pagesize=%x" + " queue_type=%x logical_address_of_page=%lx count=%lx", + hcp_adapter_handle.handle, pfmr, mr_handle->handle, pagesize, + queue_type, logical_address_of_page, count); + + if ((count > 1) && (logical_address_of_page & 0xfff)) { + ehca_catastrophic("ERROR: logical_address_of_page " + "not on a 4k boundary"); + rc = H_Parameter; + } else { + rc = hipz_h_register_rpage(hcp_adapter_handle, pagesize, + queue_type, mr_handle->handle, + logical_address_of_page, count); + } +#endif /* EHCA_USE_HCALL */ + EDEB_EX(7, "rc=%lx", rc); + + return rc; +} + +/** + * hipz_h_query_mr - Query MR in HW and FW. + * + * @pfmr: platform specific for MR + * @mr_handle: Memory Region Handle + * @mr_local_length: Local MR Length + * @mr_local_vaddr: Local MR I/O Virtual Address + * @mr_remote_length: Remote MR Length + * @mr_remote_vaddr Remote MR I/O Virtual Address + * @access_ctrl: Memory Region Access Controls + * @pd: Protection Domain + * lkey: L_Key + * rkey: R_Key + */ +static inline u64 hipz_h_query_mr(const struct ipz_adapter_handle + hcp_adapter_handle, + struct ehca_pfmr *pfmr, + const struct ipz_mrmw_handle + *mr_handle, + u64 * mr_local_length, + u64 * mr_local_vaddr, + u64 * mr_remote_length, + u64 * mr_remote_vaddr, + u32 * access_ctrl, + struct ipz_pd *pd, + u32 * lkey, + u32 * rkey) +{ + u64 rc = H_Success; + u64 dummy; + u64 acc_ctrl_pd_out; + u64 r9_out; + +#ifndef EHCA_USE_HCALL + EDEB_EN(7, "hcp_adapter_handle=%lx pfmr=%p mr_handle.mrwpte=%p" + " mr_handle.page_index=%x", + hcp_adapter_handle.handle, pfmr, mr_handle->mrwpte, + mr_handle->page_index); + + rc = simp_hcz_h_query_mr(hcp_adapter_handle, + pfmr, + mr_handle, + mr_local_length, + mr_local_vaddr, + mr_remote_length, + mr_remote_vaddr, access_ctrl, pd, lkey, rkey); + + EDEB_EX(7, "rc=%lx mr_local_length=%lx mr_local_vaddr=%lx" + " mr_remote_length=%lx mr_remote_vaddr=%lx access_ctrl=%x" + " pd=%x lkey=%x rkey=%x", + rc, *mr_local_length, *mr_local_vaddr, *mr_remote_length, + *mr_remote_vaddr, *access_ctrl, pd->value, *lkey, *rkey); +#else + EDEB_EN(7, "hcp_adapter_handle=%lx pfmr=%p mr_handle=%lx", + hcp_adapter_handle.handle, pfmr, mr_handle->handle); + + + rc = plpar_hcall_7arg_7ret(H_QUERY_MR, + hcp_adapter_handle.handle, /* r4 */ + mr_handle->handle, /* r5 */ + 0, 0, 0, 0, 0, + mr_local_length, /* r4 */ + mr_local_vaddr, /* r5 */ + mr_remote_length, /* r6 */ + mr_remote_vaddr, /* r7 */ + &acc_ctrl_pd_out, /* r8 */ + &r9_out, + &dummy); + + *access_ctrl = acc_ctrl_pd_out >> 32; + pd->value = (u32) acc_ctrl_pd_out; + *lkey = (u32) (r9_out >> 32); + *rkey = (u32) (r9_out & (0xffffffff)); + + EDEB_EX(7, "rc=%lx mr_local_length=%lx mr_local_vaddr=%lx" + " mr_remote_length=%lx mr_remote_vaddr=%lx access_ctrl=%x" + " pd=%x lkey=%x rkey=%x", + rc, *mr_local_length, *mr_local_vaddr, *mr_remote_length, + *mr_remote_vaddr, *access_ctrl, pd->value, *lkey, *rkey); +#endif /* EHCA_USE_HCALL */ + + return rc; +} + +/** + * hipz_h_free_resource_mr - Free MR resources in HW and FW. + * + * @pfmr: platform specific for MR + * @mr_handle: Memory Region Handle + */ +static inline u64 hipz_h_free_resource_mr(const struct ipz_adapter_handle + hcp_adapter_handle, + struct ehca_pfmr *pfmr, + const struct ipz_mrmw_handle + *mr_handle) +{ + u64 rc = H_Success; + u64 dummy; + +#ifndef EHCA_USE_HCALL + EDEB_EN(7, "hcp_adapter_handle=%lx pfmr=%p mr_handle.mrwpte=%p" + " mr_handle.page_index=%x", + hcp_adapter_handle.handle, pfmr, mr_handle->mrwpte, + mr_handle->page_index); + + rc = simp_hcz_h_free_resource_mr(hcp_adapter_handle, pfmr, mr_handle); +#else + EDEB_EN(7, "hcp_adapter_handle=%lx pfmr=%p mr_handle=%lx", + hcp_adapter_handle.handle, pfmr, mr_handle->handle); + + rc = plpar_hcall_7arg_7ret(H_FREE_RESOURCE, + hcp_adapter_handle.handle, /* r4 */ + mr_handle->handle, /* r5 */ + 0, 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); +#endif /* EHCA_USE_HCALL */ + EDEB_EX(7, "rc=%lx", rc); + + return rc; +} + +/** + * hipz_h_reregister_pmr - Reregister MR in HW and FW. + * + * @pfmr: platform specific for MR + * @pfshca: platform specific for SHCA + * @mr_handle: Memory Region Handle + * @vaddr_in: Memory Region I/O Virtual Address + * @length: Memory Region Length + * @access_ctrl: Memory Region Access Controls + * @pd: Protection Domain + * @mr_addr_cb: Logical Address of MR Control Block + * @vaddr_out: Memory Region I/O Virtual Address + * lkey: L_Key + * rkey: R_Key + * + */ +static inline u64 hipz_h_reregister_pmr(const struct ipz_adapter_handle + hcp_adapter_handle, + struct ehca_pfmr *pfmr, + struct ehca_pfshca *pfshca, + const struct ipz_mrmw_handle + *mr_handle, + const u64 vaddr_in, + const u64 length, + const u32 access_ctrl, + const struct ipz_pd pd, + const u64 mr_addr_cb, + u64 * vaddr_out, + u32 * lkey, + u32 * rkey) +{ + u64 rc = H_Success; + u64 dummy; + u64 lkey_out; + u64 rkey_out; + +#ifndef EHCA_USE_HCALL + EDEB_EN(7, "hcp_adapter_handle=%lx pfmr=%p pfshca=%p" + " mr_handle.mrwpte=%p mr_handle.page_index=%x vaddr_in=%lx" + " length=%lx access_ctrl=%x pd=%x mr_addr_cb=", + hcp_adapter_handle.handle, pfmr, pfshca, mr_handle->mrwpte, + mr_handle->page_index, vaddr_in, length, access_ctrl, + pd.value, mr_addr_cb); + + rc = simp_hcz_h_reregister_pmr(hcp_adapter_handle, pfmr, pfshca, + mr_handle, vaddr_in, length, access_ctrl, + pd, mr_addr_cb, vaddr_out, lkey, rkey); +#else + EDEB_EN(7, "hcp_adapter_handle=%lx pfmr=%p pfshca=%p mr_handle=%lx " + "vaddr_in=%lx length=%lx access_ctrl=%x pd=%x mr_addr_cb=%lx", + hcp_adapter_handle.handle, pfmr, pfshca, mr_handle->handle, + vaddr_in, length, access_ctrl, pd.value, mr_addr_cb); + + rc = plpar_hcall_7arg_7ret(H_REREGISTER_PMR, + hcp_adapter_handle.handle, /* r4 */ + mr_handle->handle, /* r5 */ + vaddr_in, /* r6 */ + length, /* r7 */ + /* r8 */ + ((((u64) access_ctrl) << 32ULL) | pd.value), + mr_addr_cb, /* r9 */ + 0, + &dummy, /* r4 */ + vaddr_out, /* r5 */ + &lkey_out, /* r6 */ + &rkey_out, /* r7 */ + &dummy, + &dummy, + &dummy); + *lkey = (u32) lkey_out; + *rkey = (u32) rkey_out; +#endif /* EHCA_USE_HCALL */ + + EDEB_EX(7, "rc=%lx vaddr_out=%lx lkey=%x rkey=%x", + rc, *vaddr_out, *lkey, *rkey); + return rc; +} + +/** @brief + as defined in carols hcall document +*/ + +/** + * Register shared MR in HW and FW. + * + * @pfmr: platform specific for new shared MR + * @orig_pfmr: platform specific for original MR + * @pfshca: platform specific for SHCA + * @orig_mr_handle: Memory Region Handle of original MR + * @vaddr_in: Memory Region I/O Virtual Address of new shared MR + * @access_ctrl: Memory Region Access Controls of new shared MR + * @pd: Protection Domain of new shared MR + * @mr_handle: Memory Region Handle of new shared MR + * @lkey: L_Key of new shared MR + * @rkey: R_Key of new shared MR + */ +static inline u64 hipz_h_register_smr(const struct ipz_adapter_handle + hcp_adapter_handle, + struct ehca_pfmr *pfmr, + struct ehca_pfmr *orig_pfmr, + struct ehca_pfshca *pfshca, + const struct ipz_mrmw_handle + *orig_mr_handle, + const u64 vaddr_in, + const u32 access_ctrl, + const struct ipz_pd pd, + struct ipz_mrmw_handle + *mr_handle, + u32 * lkey, + u32 * rkey) +{ + u64 rc = H_Success; + u64 dummy; + u64 lkey_out; + u64 rkey_out; + +#ifndef EHCA_USE_HCALL + EDEB_EN(7, "hcp_adapter_handle=%lx pfmr=%p orig_pfmr=%p pfshca=%p" + " orig_mr_handle.mrwpte=%p orig_mr_handle.page_index=%x" + " vaddr_in=%lx access_ctrl=%x pd=%x", + hcp_adapter_handle.handle, pfmr, orig_pfmr, pfshca, + orig_mr_handle->mrwpte, orig_mr_handle->page_index, + vaddr_in, access_ctrl, pd.value); + + rc = simp_hcz_h_register_smr(hcp_adapter_handle, pfmr, orig_pfmr, + pfshca, orig_mr_handle, vaddr_in, + access_ctrl, pd, + (struct hcz_mrmw_handle *)mr_handle, lkey, + rkey); + EDEB_EX(7, "rc=%lx mr_handle.mrwpte=%p mr_handle.page_index=%x" + " lkey=%x rkey=%x", + rc, mr_handle->mrwpte, mr_handle->page_index, *lkey, *rkey); +#else + EDEB_EN(7, "hcp_adapter_handle=%lx orig_pfmr=%p pfshca=%p" + " orig_mr_handle=%lx vaddr_in=%lx access_ctrl=%x pd=%x", + hcp_adapter_handle.handle, orig_pfmr, pfshca, + orig_mr_handle->handle, vaddr_in, access_ctrl, pd.value); + + + rc = plpar_hcall_7arg_7ret(H_REGISTER_SMR, + hcp_adapter_handle.handle, /* r4 */ + orig_mr_handle->handle, /* r5 */ + vaddr_in, /* r6 */ + ((((u64) access_ctrl) << 32ULL)), /* r7 */ + pd.value, /* r8 */ + 0, 0, + &mr_handle->handle, /* r4 */ + &dummy, /* r5 */ + &lkey_out, /* r6 */ + &rkey_out, /* r7 */ + &dummy, + &dummy, + &dummy); + *lkey = (u32) lkey_out; + *rkey = (u32) rkey_out; + + EDEB_EX(7, "rc=%lx mr_handle=%lx lkey=%x rkey=%x", + rc, mr_handle->handle, *lkey, *rkey); +#endif /* EHCA_USE_HCALL */ + + return rc; +} + +static inline u64 hipz_h_alloc_resource_mw(const struct ipz_adapter_handle + hcp_adapter_handle, + struct ehca_pfmw *pfmw, + struct ehca_pfshca *pfshca, + const struct ipz_pd pd, + struct ipz_mrmw_handle *mw_handle, + u32 * rkey) +{ + u64 rc = H_Success; + u64 dummy; + u64 rkey_out; + + EDEB_EN(7, "hcp_adapter_handle=%lx pfmw=%p pd=%x pfshca=%p", + hcp_adapter_handle.handle, pfmw, pd.value, pfshca); + +#ifndef EHCA_USE_HCALL + + rc = simp_hcz_h_alloc_resource_mw(hcp_adapter_handle, pfmw, pfshca, pd, + (struct hcz_mrmw_handle *)mw_handle, + rkey); + EDEB_EX(7, "rc=%lx mw_handle.mrwpte=%p mw_handle.page_index=%x rkey=%x", + rc, mw_handle->mrwpte, mw_handle->page_index, *rkey); +#else + rc = plpar_hcall_7arg_7ret(H_ALLOC_RESOURCE, + hcp_adapter_handle.handle, /* r4 */ + 6, /* r5 */ + pd.value, /* r6 */ + 0, 0, 0, 0, + &mw_handle->handle, /* r4 */ + &dummy, /* r5 */ + &dummy, /* r6 */ + &rkey_out, /* r7 */ + &dummy, + &dummy, + &dummy); + *rkey = (u32) rkey_out; + + EDEB_EX(7, "rc=%lx mw_handle=%lx rkey=%x", + rc, mw_handle->handle, *rkey); +#endif /* EHCA_USE_HCALL */ + return rc; +} + +static inline u64 hipz_h_query_mw(const struct ipz_adapter_handle + hcp_adapter_handle, + struct ehca_pfmw *pfmw, + const struct ipz_mrmw_handle + *mw_handle, + u32 * rkey, + struct ipz_pd *pd) +{ + u64 rc = H_Success; + u64 dummy; + u64 pd_out; + u64 rkey_out; + +#ifndef EHCA_USE_HCALL + EDEB_EN(7, "hcp_adapter_handle=%lx pfmw=%p mw_handle.mrwpte=%p" + " mw_handle.page_index=%x", + hcp_adapter_handle.handle, pfmw, mw_handle->mrwpte, + mw_handle->page_index); + + rc = simp_hcz_h_query_mw(hcp_adapter_handle, pfmw, mw_handle, rkey, pd); + + EDEB_EX(7, "rc=%lx rkey=%x pd=%x", rc, *rkey, pd->value); +#else + EDEB_EN(7, "hcp_adapter_handle=%lx pfmw=%p mw_handle=%lx", + hcp_adapter_handle.handle, pfmw, mw_handle->handle); + + rc = plpar_hcall_7arg_7ret(H_QUERY_MW, + hcp_adapter_handle.handle, /* r4 */ + mw_handle->handle, /* r5 */ + 0, 0, 0, 0, 0, + &dummy, /* r4 */ + &dummy, /* r5 */ + &dummy, /* r6 */ + &rkey_out, /* r7 */ + &pd_out, /* r8 */ + &dummy, + &dummy); + *rkey = (u32) rkey_out; + pd->value = (u32) pd_out; + + EDEB_EX(7, "rc=%lx rkey=%x pd=%x", rc, *rkey, pd->value); +#endif /* EHCA_USE_HCALL */ + + return rc; +} + +static inline u64 hipz_h_free_resource_mw(const struct ipz_adapter_handle + hcp_adapter_handle, + struct ehca_pfmw *pfmw, + const struct ipz_mrmw_handle + *mw_handle) +{ + u64 rc = H_Success; + u64 dummy; + +#ifndef EHCA_USE_HCALL + EDEB_EN(7, "hcp_adapter_handle=%lx pfmw=%p mw_handle.mrwpte=%p" + " mw_handle.page_index=%x", + hcp_adapter_handle.handle, pfmw, mw_handle->mrwpte, + mw_handle->page_index); + + rc = simp_hcz_h_free_resource_mw(hcp_adapter_handle, pfmw, mw_handle); +#else + EDEB_EN(7, "hcp_adapter_handle=%lx pfmw=%p mw_handle=%lx", + hcp_adapter_handle.handle, pfmw, mw_handle->handle); + + rc = plpar_hcall_7arg_7ret(H_FREE_RESOURCE, + hcp_adapter_handle.handle, /* r4 */ + mw_handle->handle, /* r5 */ + 0, 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); +#endif /* EHCA_USE_HCALL */ + EDEB_EX(7, "rc=%lx", rc); + + return rc; +} + +static inline u64 hipz_h_error_data(const struct ipz_adapter_handle + adapter_handle, + const u64 ressource_handle, + void *rblock, + unsigned long *byte_count) +{ + u64 rc = H_Success; + u64 dummy; + u64 r_cb; + + EDEB_EN(7, "adapter_handle=%lx ressource_handle=%lx rblock=%p", + adapter_handle.handle, ressource_handle, rblock); + + if ((((u64)rblock) & 0xfff) != 0) { + EDEB_ERR(4, "rblock not page aligned."); + rc = H_Parameter; + return rc; + } + + r_cb = ehca_kv_to_g(rblock); + + rc = plpar_hcall_7arg_7ret(H_ERROR_DATA, + adapter_handle.handle, + ressource_handle, + r_cb, + 0, 0, 0, 0, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy, + &dummy); + + EDEB_EX(7, "rc=%lx", rc); + + return rc; +} + +#endif /* __HCP_IF_H__ */ From rolandd at cisco.com Fri Feb 17 16:57:10 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 17 Feb 2006 16:57:10 -0800 Subject: [openib-general] [PATCH 03/22] pHype specific stuff In-Reply-To: <20060218005532.13620.79663.stgit@localhost.localdomain> References: <20060218005532.13620.79663.stgit@localhost.localdomain> Message-ID: <20060218005709.13620.77409.stgit@localhost.localdomain> From: Roland Dreier It's not clear what the connection between hcp_phyp.c and hcp_phyp.h really is -- they don't seem to very closely related. Again, hcp_phyp.h has some rather large functions that belong in a .c file and maybe shouldn't be inlined (although maybe the generated assembly ends up being small because it's just fiddling registers around). For a change, hipz_galpa_load() and hipz_galpa_store() actually look simple enough that they could probably become inline functions in a header (and just kill hcp_phyp.c). This would also make the comments about them being inline in ehca_galpa.h true. Is ehca_galpha.h needed at all, or can it be folded into another file? Why is its abstraction needed? --- drivers/infiniband/hw/ehca/ehca_galpa.h | 74 +++++++ drivers/infiniband/hw/ehca/hcp_phyp.c | 81 +++++++ drivers/infiniband/hw/ehca/hcp_phyp.h | 338 +++++++++++++++++++++++++++++++ 3 files changed, 493 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_galpa.h b/drivers/infiniband/hw/ehca/ehca_galpa.h new file mode 100644 index 0000000..d64115c --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_galpa.h @@ -0,0 +1,74 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * pSeries interface definitions + * + * Authors: Waleri Fomin + * Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_galpa.h,v 1.6 2006/02/06 10:17:34 schickhj Exp $ + */ + +#ifndef __EHCA_GALPA_H__ +#define __EHCA_GALPA_H__ + +/* eHCA page (mapped into p-memory) + resource to access eHCA register pages in CPU address space +*/ +struct h_galpa { + u64 fw_handle; + /* for pSeries this is a 64bit memory address where + I/O memory is mapped into CPU address space (kv) */ +}; + +/** + resource to access eHCA address space registers, all types +*/ +struct h_galpas { + u32 pid; /*PID of userspace galpa checking */ + struct h_galpa user; /* user space accessible resource, + set to 0 if unused */ + struct h_galpa kernel; /* kernel space accessible resource, + set to 0 if unused */ +}; +/** @brief store value at offset into galpa, will be inline function + */ +void hipz_galpa_store(struct h_galpa galpa, u32 offset, u64 value); + +/** @brief return value from offset in galpa, will be inline function + */ +u64 hipz_galpa_load(struct h_galpa galpa, u32 offset); + +#endif /* __EHCA_GALPA_H__ */ diff --git a/drivers/infiniband/hw/ehca/hcp_phyp.c b/drivers/infiniband/hw/ehca/hcp_phyp.c new file mode 100644 index 0000000..129e61b --- /dev/null +++ b/drivers/infiniband/hw/ehca/hcp_phyp.c @@ -0,0 +1,81 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * load store abstraction for ehca register access + * + * Authors: Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: hcp_phyp.c,v 1.10 2006/02/06 10:17:34 schickhj Exp $ + */ + + +#define DEB_PREFIX "PHYP" + +#ifdef __KERNEL__ +#include "ehca_kernel.h" +#include "hipz_hw.h" +/* #include "hipz_structs.h" */ +/* TODO: still necessary */ +#include "ehca_classes.h" +#else /* !__KERNEL__ */ +#include "ehca_utools.h" +#include "ehca_galpa.h" +#endif + +#ifndef EHCA_USERDRIVER /* TODO: is this correct */ + +u64 hipz_galpa_load(struct h_galpa galpa, u32 offset) +{ + u64 addr = galpa.fw_handle + offset; + u64 out; + EDEB_EN(7, "addr=%lx offset=%x ", addr, offset); + out = *(u64 *) addr; + EDEB_EX(7, "addr=%lx value=%lx", addr, out); + return out; +}; + +void hipz_galpa_store(struct h_galpa galpa, u32 offset, u64 value) +{ + u64 addr = galpa.fw_handle + offset; + EDEB(7, "addr=%lx offset=%x value=%lx", addr, + offset, value); + *(u64 *) addr = value; +#ifdef EHCA_USE_HCALL + /* hipz_galpa_load(galpa, offset); */ + /* synchronize explicitly */ +#endif +}; + +#endif /* EHCA_USERDRIVER */ diff --git a/drivers/infiniband/hw/ehca/hcp_phyp.h b/drivers/infiniband/hw/ehca/hcp_phyp.h new file mode 100644 index 0000000..c82fb4b --- /dev/null +++ b/drivers/infiniband/hw/ehca/hcp_phyp.h @@ -0,0 +1,338 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * Firmware calls + * + * Authors: Christoph Raisch + * Waleri Fomin + * Gerd Bayer + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: hcp_phyp.h,v 1.16 2006/02/06 10:17:34 schickhj Exp $ + */ + +#ifndef __HCP_PHYP_H__ +#define __HCP_PHYP_H__ + +#ifndef EHCA_USERDRIVER +inline static int hcall_map_page(u64 physaddr, u64 * mapaddr) +{ + *mapaddr = (u64)(ioremap(physaddr, 4096)); + + EDEB(7, "ioremap physaddr=%lx mapaddr=%lx", physaddr, *mapaddr); + return 0; +} + +inline static int hcall_unmap_page(u64 mapaddr) +{ + EDEB(7, "mapaddr=%lx", mapaddr); + iounmap((void *)(mapaddr)); + return 0; +} +#else +int hcall_map_page(u64 physaddr, u64 * mapaddr); +int hcall_unmap_page(u64 mapaddr); +#endif + +struct hcall { + u64 regs[11]; +}; + +/** + * @brief returns time to wait in secs for the given long busy error code + */ +inline static u32 getLongBusyTimeSecs(int longBusyRetCode) +{ + switch (longBusyRetCode) { + case H_LongBusyOrder1msec: + return 1; + case H_LongBusyOrder10msec: + return 10; + case H_LongBusyOrder100msec: + return 100; + case H_LongBusyOrder1sec: + return 1000; + case H_LongBusyOrder10sec: + return 10000; + case H_LongBusyOrder100sec: + return 100000; + default: + return 1; + } /* eof switch */ +} + +inline static long plpar_hcall_7arg_7ret(unsigned long opcode, + unsigned long arg1, /* References: <20060218005532.13620.79663.stgit@localhost.localdomain> Message-ID: <20060218005712.13620.82908.stgit@localhost.localdomain> From: Roland Dreier hipz_probe_adapters() looks a little funny -- it seems to bail out of all the remaining adapters if one of them isn't quite right. --- drivers/infiniband/hw/ehca/hcp_sense.c | 144 ++++++++++++++++++++++++++++++++ drivers/infiniband/hw/ehca/hcp_sense.h | 136 ++++++++++++++++++++++++++++++ 2 files changed, 280 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/hcp_sense.c b/drivers/infiniband/hw/ehca/hcp_sense.c new file mode 100644 index 0000000..83fa4a3 --- /dev/null +++ b/drivers/infiniband/hw/ehca/hcp_sense.c @@ -0,0 +1,144 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * ehca detection and query code for POWER + * + * Authors: Heiko J Schick + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: hcp_sense.c,v 1.10 2006/02/06 10:17:34 schickhj Exp $ + */ + +#define DEB_PREFIX "snse" + +#include "ehca_kernel.h" +#include "ehca_tools.h" + +int hipz_count_adapters(void) +{ + int num = 0; + struct device_node *dn = NULL; + + EDEB_EN(7, ""); + + while ((dn = of_find_node_by_name(dn, "lhca"))) { + num++; + } + + of_node_put(dn); + + if (num == 0) { + EDEB_ERR(4, "No lhca node name was found in the" + " Open Firmware device tree."); + return -ENODEV; + } + + EDEB(6, " ... found %x adapter(s)", num); + + EDEB_EX(7, "num=%x", num); + + return num; +} + +int hipz_probe_adapters(char **adapter_list) +{ + int ret = 0; + int num = 0; + struct device_node *dn = NULL; + char *loc; + + EDEB_EN(7, "adapter_list=%p", adapter_list); + + while ((dn = of_find_node_by_name(dn, "lhca"))) { + loc = get_property(dn, "ibm,loc-code", NULL); + if (loc == NULL) { + EDEB_ERR(4, "No ibm,loc-code property for" + " lhca Open Firmware device tree node."); + ret = -ENODEV; + goto probe_adapters0; + } + + adapter_list[num] = loc; + EDEB(6, " ... found adapter[%x] with loc-code: %s", num, loc); + num++; + } + + probe_adapters0: + of_node_put(dn); + + EDEB_EX(7, "ret=%x", ret); + + return ret; +} + +u64 hipz_get_adapter_handle(char *adapter) +{ + struct device_node *dn = NULL; + char *loc; + u64 *u64data = NULL; + u64 ret = 0; + + EDEB_EN(7, "adapter=%p", adapter); + + while ((dn = of_find_node_by_name(dn, "lhca"))) { + loc = get_property(dn, "ibm,loc-code", NULL); + if (loc == NULL) { + EDEB_ERR(4, "No ibm,loc-code property for" + " lhca Open Firmware device tree node."); + goto get_adapter_handle0; + } + + if (strcmp(loc, adapter) == 0) { + u64data = + (u64 *) get_property(dn, "ibm,hca-handle", NULL); + break; + } + } + + if (u64data == NULL) { + EDEB_ERR(4, "No ibm,hca-handle property for" + " lhca Open Firmware device tree node with" + " ibm,loc-code: %s.", adapter); + goto get_adapter_handle0; + } + + ret = *u64data; + + get_adapter_handle0: + of_node_put(dn); + + EDEB_EX(7, "ret=%lx",ret); + + return ret; +} diff --git a/drivers/infiniband/hw/ehca/hcp_sense.h b/drivers/infiniband/hw/ehca/hcp_sense.h new file mode 100644 index 0000000..a49040b --- /dev/null +++ b/drivers/infiniband/hw/ehca/hcp_sense.h @@ -0,0 +1,136 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * ehca detection and query code for POWER + * + * Authors: Heiko J Schick + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: hcp_sense.h,v 1.11 2006/02/06 10:17:34 schickhj Exp $ + */ + +#ifndef HCP_SENSE_H +#define HCP_SENSE_H + +int hipz_count_adapters(void); +int hipz_probe_adapters(char **adapter_list); +u64 hipz_get_adapter_handle(char *adapter); + +/* query hca response block */ +struct query_hca_rblock { + u32 cur_reliable_dg; + u32 cur_qp; + u32 cur_cq; + u32 cur_eq; + u32 cur_mr; + u32 cur_mw; + u32 cur_ee_context; + u32 cur_mcast_grp; + u32 cur_qp_attached_mcast_grp; + u32 reserved1; + u32 cur_ipv6_qp; + u32 cur_eth_qp; + u32 cur_hp_mr; + u32 reserved2[3]; + u32 max_rd_domain; + u32 max_qp; + u32 max_cq; + u32 max_eq; + u32 max_mr; + u32 max_hp_mr; + u32 max_mw; + u32 max_mrwpte; + u32 max_special_mrwpte; + u32 max_rd_ee_context; + u32 max_mcast_grp; + u32 max_qps_attached_all_mcast_grp; + u32 max_qps_attached_mcast_grp; + u32 max_raw_ipv6_qp; + u32 max_raw_ethy_qp; + u32 internal_clock_frequency; + u32 max_pd; + u32 max_ah; + u32 max_cqe; + u32 max_wqes_wq; + u32 max_partitions; + u32 max_rr_ee_context; + u32 max_rr_qp; + u32 max_rr_hca; + u32 max_act_wqs_ee_context; + u32 max_act_wqs_qp; + u32 max_sge; + u32 max_sge_rd; + u32 memory_page_size_supported; + u64 max_mr_size; + u32 local_ca_ack_delay; + u32 num_ports; + u32 vendor_id; + u32 vendor_part_id; + u32 hw_ver; + u64 node_guid; + u64 hca_cap_indicators; + u32 data_counter_register_size; + u32 max_shared_rq; + u32 max_isns_eq; + u32 max_neq; +} __attribute__ ((packed)); + +/* query port response block */ +struct query_port_rblock { + u32 state; + u32 bad_pkey_cntr; + u32 lmc; + u32 lid; + u32 subnet_timeout; + u32 qkey_viol_cntr; + u32 sm_sl; + u32 sm_lid; + u32 capability_mask; + u32 init_type_reply; + u32 pkey_tbl_len; + u32 gid_tbl_len; + u64 gid_prefix; + u32 port_nr; + u16 pkey_entries[16]; + u8 reserved1[32]; + u32 trent_size; + u32 trbuf_size; + u64 max_msg_sz; + u32 max_mtu; + u32 vl_cap; + u8 reserved2[1900]; + u64 guid_entries[255]; +} __attribute__ ((packed)); + +#endif From rolandd at cisco.com Fri Feb 17 16:57:21 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 17 Feb 2006 16:57:21 -0800 Subject: [openib-general] [PATCH 07/22] Hypercall definitions In-Reply-To: <20060218005532.13620.79663.stgit@localhost.localdomain> References: <20060218005532.13620.79663.stgit@localhost.localdomain> Message-ID: <20060218005721.13620.84990.stgit@localhost.localdomain> From: Roland Dreier Do these defines belong in the ehca driver, or should they be put somewhere in generic hypercall support? --- drivers/infiniband/hw/ehca/ehca_common.h | 115 ++++++++++++++++++++++++++++++ 1 files changed, 115 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_common.h b/drivers/infiniband/hw/ehca/ehca_common.h new file mode 100644 index 0000000..922f010 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_common.h @@ -0,0 +1,115 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * hcad local defines + * + * Authors: Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_common.h,v 1.15 2006/02/06 10:17:34 schickhj Exp $ + */ + +#ifndef __EHCA_COMMON_H__ +#define __EHCA_COMMON_H__ + +#ifdef CONFIG_PPC64 +#include + +#define H_PARTIAL_STORE 16 +#define H_PAGE_REGISTERED 15 +#define H_IN_PROGRESS 14 +#define H_PARTIAL 5 +#define H_NOT_AVAILABLE 3 +#define H_Closed 2 +#define H_ADAPTER_PARM -17 +#define H_RH_PARM -18 +#define H_RCQ_PARM -19 +#define H_SCQ_PARM -20 +#define H_EQ_PARM -21 +#define H_RT_PARM -22 +#define H_ST_PARM -23 +#define H_SIGT_PARM -24 +#define H_TOKEN_PARM -25 +#define H_MLENGTH_PARM -27 +#define H_MEM_PARM -28 +#define H_MEM_ACCESS_PARM -29 +#define H_ATTR_PARM -30 +#define H_PORT_PARM -31 +#define H_MCG_PARM -32 +#define H_VL_PARM -33 +#define H_TSIZE_PARM -34 +#define H_TRACE_PARM -35 + +#define H_MASK_PARM -37 +#define H_MCG_FULL -38 +#define H_ALIAS_EXIST -39 +#define H_P_COUNTER -40 +#define H_TABLE_FULL -41 +#define H_ALT_TABLE -42 +#define H_MR_CONDITION -43 +#define H_NOT_ENOUGH_RESOURCES -44 +#define H_R_STATE -45 +#define H_RESCINDEND -46 + +/* H call defines to be moved to kernel */ +#define H_RESET_EVENTS 0x15C +#define H_ALLOC_RESOURCE 0x160 +#define H_FREE_RESOURCE 0x164 +#define H_MODIFY_QP 0x168 +#define H_QUERY_QP 0x16C +#define H_REREGISTER_PMR 0x170 +#define H_REGISTER_SMR 0x174 +#define H_QUERY_MR 0x178 +#define H_QUERY_MW 0x17C +#define H_QUERY_HCA 0x180 +#define H_QUERY_PORT 0x184 +#define H_MODIFY_PORT 0x188 +#define H_DEFINE_AQP1 0x18C +#define H_GET_TRACE_BUFFER 0x190 +#define H_DEFINE_AQP0 0x194 +#define H_RESIZE_MR 0x198 +#define H_ATTACH_MCQP 0x19C +#define H_DETACH_MCQP 0x1A0 +#define H_CREATE_RPT 0x1A4 +#define H_REMOVE_RPT 0x1A8 +#define H_REGISTER_RPAGES 0x1AC +#define H_DISABLE_AND_GETC 0x1B0 +#define H_ERROR_DATA 0x1B4 +#define H_GET_HCA_INFO 0x1B8 +#define H_GET_PERF_COUNT 0x1BC +#define H_MANAGE_TRACE 0x1C0 +#define H_QUERY_INT_STATE 0x1E4 +#endif + +#endif /* __EHCA_COMMON_H__ */ From rolandd at cisco.com Fri Feb 17 16:57:17 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 17 Feb 2006 16:57:17 -0800 Subject: [openib-general] [PATCH 05/22] HW register abstractions In-Reply-To: <20060218005532.13620.79663.stgit@localhost.localdomain> References: <20060218005532.13620.79663.stgit@localhost.localdomain> Message-ID: <20060218005717.13620.85161.stgit@localhost.localdomain> From: Roland Dreier Does hipz_structs.h really need a whole file to hold 5 #defines? --- drivers/infiniband/hw/ehca/hipz_fns.h | 83 ++++++ drivers/infiniband/hw/ehca/hipz_fns_core.h | 123 +++++++++ drivers/infiniband/hw/ehca/hipz_hw.h | 382 ++++++++++++++++++++++++++++ drivers/infiniband/hw/ehca/hipz_structs.h | 54 ++++ 4 files changed, 642 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/hipz_fns.h b/drivers/infiniband/hw/ehca/hipz_fns.h new file mode 100644 index 0000000..4231b65 --- /dev/null +++ b/drivers/infiniband/hw/ehca/hipz_fns.h @@ -0,0 +1,83 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * HW abstraction register functions + * + * Authors: Christoph Raisch + * Reinhard Ernst + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: hipz_fns.h,v 1.15 2006/02/06 10:17:34 schickhj Exp $ + */ + +#ifndef __HIPZ_FNS_H__ +#define __HIPZ_FNS_H__ + +#include "hipz_structs.h" +#include "ehca_classes.h" +#include "hipz_hw.h" +#ifndef EHCA_USE_HCALL +#include "sim_gal.h" +#endif + +#include "hipz_fns_core.h" + +#define hipz_galpa_store_eq(gal,offset,value)\ + hipz_galpa_store(gal,EQTEMM_OFFSET(offset),value) +#define hipz_galpa_load_eq(gal,offset)\ + hipz_galpa_load(gal,EQTEMM_OFFSET(offset)) + +#define hipz_galpa_store_qped(gal,offset,value)\ + hipz_galpa_store(gal,QPEDMM_OFFSET(offset),value) +#define hipz_galpa_load_qped(gal,offset)\ + hipz_galpa_load(gal,QPEDMM_OFFSET(offset)) + +#define hipz_galpa_store_mrmw(gal,offset,value)\ + hipz_galpa_store(gal,MRMWMM_OFFSET(offset),value) +#define hipz_galpa_load_mrmw(gal,offset)\ + hipz_galpa_load(gal,MRMWMM_OFFSET(offset)) + +inline static void hipz_load_FEC(struct ehca_cq_core *cq_core, u32 * count) +{ + uint64_t reg = 0; + EDEB_EN(7, "cq_core=%p", cq_core); + { + struct h_galpa gal = cq_core->galpas.kernel; + reg = hipz_galpa_load_cq(gal, CQx_FEC); + *count = EHCA_BMASK_GET(CQx_FEC_CQE_cnt, reg); + } + EDEB_EX(7,"cq_core=%p CQx_FEC=%lx", cq_core,reg); +} + +#endif /* __IPZ_IF_H__ */ diff --git a/drivers/infiniband/hw/ehca/hipz_fns_core.h b/drivers/infiniband/hw/ehca/hipz_fns_core.h new file mode 100644 index 0000000..a60b808 --- /dev/null +++ b/drivers/infiniband/hw/ehca/hipz_fns_core.h @@ -0,0 +1,123 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * HW abstraction register functions + * + * Authors: Christoph Raisch + * Reinhard Ernst + * Hoang-Nam Nguyen + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: hipz_fns_core.h,v 1.10 2006/02/06 10:17:34 schickhj Exp $ + */ + +#ifndef __HIPZ_FNS_CORE_H__ +#define __HIPZ_FNS_CORE_H__ + +#include "ehca_galpa.h" +#include "hipz_hw.h" + +#define hipz_galpa_store_cq(gal,offset,value)\ + hipz_galpa_store(gal,CQTEMM_OFFSET(offset),value) +#define hipz_galpa_load_cq(gal,offset)\ + hipz_galpa_load(gal,CQTEMM_OFFSET(offset)) + +#define hipz_galpa_store_qp(gal,offset,value)\ + hipz_galpa_store(gal,QPTEMM_OFFSET(offset),value) +#define hipz_galpa_load_qp(gal,offset)\ + hipz_galpa_load(gal,QPTEMM_OFFSET(offset)) + +inline static void hipz_update_SQA(struct ehca_qp_core *qp_core, u16 nr_wqes) +{ + struct h_galpa gal; + + EDEB_EN(7, "qp_core=%p", qp_core); + gal = qp_core->galpas.kernel; + /* ringing doorbell :-) */ + hipz_galpa_store_qp(gal, QPx_SQA, EHCA_BMASK_SET(QPx_SQAdder, nr_wqes)); + EDEB_EX(7, "qp_core=%p QPx_SQA = %i", qp_core, nr_wqes); +} + +inline static void hipz_update_RQA(struct ehca_qp_core *qp_core, u16 nr_wqes) +{ + struct h_galpa gal; + + EDEB_EN(7, "qp_core=%p", qp_core); + gal = qp_core->galpas.kernel; + /* ringing doorbell :-) */ + hipz_galpa_store_qp(gal, QPx_RQA, EHCA_BMASK_SET(QPx_RQAdder, nr_wqes)); + EDEB_EX(7, "qp_core=%p QPx_RQA = %i", qp_core, nr_wqes); +} + +inline static void hipz_update_FECA(struct ehca_cq_core *cq_core, u32 nr_cqes) +{ + struct h_galpa gal; + + EDEB_EN(7, "cq_core=%p", cq_core); + gal = cq_core->galpas.kernel; + hipz_galpa_store_cq(gal, CQx_FECA, + EHCA_BMASK_SET(CQx_FECAdder, nr_cqes)); + EDEB_EX(7, "cq_core=%p CQx_FECA = %i", cq_core, nr_cqes); +} + +inline static void hipz_set_CQx_N0(struct ehca_cq_core *cq_core, u32 value) +{ + struct h_galpa gal; + u64 CQx_N0_reg = 0; + + EDEB_EN(7, "cq_core=%p event on solicited completion -- write CQx_N0", + cq_core); + gal = cq_core->galpas.kernel; + hipz_galpa_store_cq(gal, CQx_N0, + EHCA_BMASK_SET(CQx_N0_generate_solicited_comp_event, + value)); + CQx_N0_reg = hipz_galpa_load_cq(gal, CQx_N0); + EDEB_EX(7, "cq_core=%p loaded CQx_N0=%lx", cq_core,(unsigned long)CQx_N0_reg); +} + +inline static void hipz_set_CQx_N1(struct ehca_cq_core *cq_core, u32 value) +{ + struct h_galpa gal; + u64 CQx_N1_reg = 0; + + EDEB_EN(7, "cq_core=%p event on completion -- write CQx_N1", + cq_core); + gal = cq_core->galpas.kernel; + hipz_galpa_store_cq(gal, CQx_N1, + EHCA_BMASK_SET(CQx_N1_generate_comp_event, value)); + CQx_N1_reg = hipz_galpa_load_cq(gal, CQx_N1); + EDEB_EX(7, "cq_core=%p loaded CQx_N1=%lx", cq_core,(unsigned long)CQx_N1_reg); +} + +#endif /* __HIPZ_FNC_CORE_H__ */ diff --git a/drivers/infiniband/hw/ehca/hipz_hw.h b/drivers/infiniband/hw/ehca/hipz_hw.h new file mode 100644 index 0000000..6fa005b --- /dev/null +++ b/drivers/infiniband/hw/ehca/hipz_hw.h @@ -0,0 +1,382 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * eHCA register definitions + * + * Authors: Christoph Raisch + * Reinhard Ernst + * Waleri Fomin + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: hipz_hw.h,v 1.7 2006/02/06 10:17:34 schickhj Exp $ + */ + +#ifndef __HIPZ_HW_H__ +#define __HIPZ_HW_H__ + +#ifdef __KERNEL__ +#include "ehca_tools.h" +#include "ehca_kernel.h" +#else /* !__KERNEL__ */ +#include "ehca_utools.h" +#endif + +/** @brief Queue Pair Table Memory + */ +struct hipz_QPTEMM { + u64 QPx_HCR; +#define QPx_HCR_PKEY_Mode EHCA_BMASK_IBM(1,2) +#define QPx_HCR_Special_QP_Mode EHCA_BMASK_IBM(6,7) + u64 QPx_C; +#define QPx_C_Enabled EHCA_BMASK_IBM(0,0) +#define QPx_C_Disabled EHCA_BMASK_IBM(1,1) +#define QPx_C_Req_State EHCA_BMASK_IBM(16,23) +#define QPx_C_Res_State EHCA_BMASK_IBM(25,31) +#define QPx_C_disable_ETE_check EHCA_BMASK_IBM(7,7) + u64 QPx_HERR; + u64 QPx_AER; +/* 0x20*/ + u64 QPx_SQA; +#define QPx_SQAdder EHCA_BMASK_IBM(48,63) + u64 QPx_SQC; + u64 QPx_RQA; +#define QPx_RQAdder EHCA_BMASK_IBM(48,63) + u64 QPx_RQC; +/* 0x40*/ + u64 QPx_ST; + u64 QPx_PMSTATE; +#define QPx_PMSTATE_BITS EHCA_BMASK_IBM(30,31) + u64 QPx_PMFA; + u64 QPx_PKEY; +#define QPx_PKEY_value EHCA_BMASK_IBM(48,63) +/* 0x60*/ + u64 QPx_PKEYA; +#define QPx_PKEYA_index0 EHCA_BMASK_IBM(0,15) +#define QPx_PKEYA_index1 EHCA_BMASK_IBM(16,31) +#define QPx_PKEYA_index2 EHCA_BMASK_IBM(32,47) +#define QPx_PKEYA_index3 EHCA_BMASK_IBM(48,63) + u64 QPx_PKEYB; +#define QPx_PKEYB_index4 EHCA_BMASK_IBM(0,15) +#define QPx_PKEYB_index5 EHCA_BMASK_IBM(16,31) +#define QPx_PKEYB_index6 EHCA_BMASK_IBM(32,47) +#define QPx_PKEYB_index7 EHCA_BMASK_IBM(48,63) + u64 QPx_PKEYC; +#define QPx_PKEYC_index8 EHCA_BMASK_IBM(0,15) +#define QPx_PKEYC_index9 EHCA_BMASK_IBM(16,31) +#define QPx_PKEYC_index10 EHCA_BMASK_IBM(32,47) +#define QPx_PKEYC_index11 EHCA_BMASK_IBM(48,63) + u64 QPx_PKEYD; +#define QPx_PKEYD_index12 EHCA_BMASK_IBM(0,15) +#define QPx_PKEYD_index13 EHCA_BMASK_IBM(16,31) +#define QPx_PKEYD_index14 EHCA_BMASK_IBM(32,47) +#define QPx_PKEYD_index15 EHCA_BMASK_IBM(48,63) +/* 0x80*/ + u64 QPx_QKEY; +#define QPx_QKEY_value EHCA_BMASK_IBM(32,63) + u64 QPx_DQP; +#define QPx_DQP_number EHCA_BMASK_IBM(40,63) + u64 QPx_DLIDP; +#define QPx_DLID_PRIMARY EHCA_BMASK_IBM(48,63) +#define QPx_DLIDP_GRH EHCA_BMASK_IBM(31,31) + u64 QPx_PORTP; +#define QPx_PORT_Primary EHCA_BMASK_IBM(57,63) +/* 0xa0*/ + u64 QPx_SLIDP; +#define QPx_SLIDP_p_path EHCA_BMASK_IBM(48,63) +#define QPx_SLIDP_lmc EHCA_BMASK_IBM(37,39) + u64 QPx_SLIDPP; +#define QPx_SLID_PRIM_PATH EHCA_BMASK_IBM(57,63) + u64 QPx_DLIDA; +#define QPx_DLIDA_GRH EHCA_BMASK_IBM(31,31) + u64 QPx_PORTA; +#define QPx_PORT_Alternate EHCA_BMASK_IBM(57,63) +/* 0xc0*/ + u64 QPx_SLIDA; + u64 QPx_SLIDPA; + u64 QPx_SLVL; +#define QPx_SLVL_BITS EHCA_BMASK_IBM(56,59) +#define QPx_SLVL_VL EHCA_BMASK_IBM(60,63) + u64 QPx_IPD; +#define QPx_IPD_max_static_rate EHCA_BMASK_IBM(56,63) +/* 0xe0*/ + u64 QPx_MTU; +#define QPx_MTU_size EHCA_BMASK_IBM(56,63) + u64 QPx_LATO; +#define QPx_LATO_BITS EHCA_BMASK_IBM(59,63) + u64 QPx_RLIMIT; +#define QPx_RETRY_COUNT EHCA_BMASK_IBM(61,63) + u64 QPx_RNRLIMIT; +#define QPx_RNR_RETRY_COUNT EHCA_BMASK_IBM(61,63) +/* 0x100*/ + u64 QPx_T; + u64 QPx_SQHP; + u64 QPx_SQPTP; + u64 QPx_NSPSN; +#define QPx_NSPSN_value EHCA_BMASK_IBM(40,63) +/* 0x120*/ + u64 QPx_NSPSNHWM; +#define QPx_NSPSNHWM_value EHCA_BMASK_IBM(40,63) + u64 reserved1; + u64 QPx_SDSI; + u64 QPx_SDSBC; +/* 0x140*/ + u64 QPx_SQWSIZE; +#define QPx_SQWSIZE_value EHCA_BMASK_IBM(61,63) + u64 QPx_SQWTS; + u64 QPx_LSN; + u64 QPx_NSSN; +/* 0x160 */ + u64 QPx_MOR; +#define QPx_MOR_value EHCA_BMASK_IBM(48,63) + u64 QPx_COR; + u64 QPx_SQSIZE; +#define QPx_SQSIZE_value EHCA_BMASK_IBM(60,63) + u64 QPx_ERC; +/* 0x180*/ + u64 QPx_RNRRC; +#define QPx_RNRRESP_value EHCA_BMASK_IBM(59,63) + u64 QPx_ERNRWT; + u64 QPx_RNRRESP; +#define QPx_RNRRESP_WTR EHCA_BMASK_IBM(59,63) + u64 QPx_LMSNA; +/* 0x1a0 */ + u64 QPx_SQHPC; + u64 QPx_SQCPTP; + u64 QPx_SIGT; + u64 QPx_WQECNT; +/* 0x1c0*/ + + u64 QPx_RQHP; + u64 QPx_RQPTP; + u64 QPx_RQSIZE; +#define QPx_RQSIZE_value EHCA_BMASK_IBM(60,63) + u64 QPx_NRR; +#define QPx_NRR_value EHCA_BMASK_IBM(61,63) +/* 0x1e0*/ + u64 QPx_RDMAC; +#define QPx_RDMAC_value EHCA_BMASK_IBM(61,63) + u64 QPx_NRPSN; +#define QPx_NRPSN_value EHCA_BMASK_IBM(40,63) + u64 QPx_LAPSN; +#define QPx_LAPSN_value EHCA_BMASK_IBM(40,63) + u64 QPx_LCR; +/* 0x200*/ + u64 QPx_RWC; + u64 QPx_RWVA; + u64 QPx_RDSI; + u64 QPx_RDSBC; +/* 0x220*/ + u64 QPx_RQWSIZE; +#define QPx_RQWSIZE_value EHCA_BMASK_IBM(61,63) + u64 QPx_CRMSN; + u64 QPx_RDD; +#define QPx_RDD_VALUE EHCA_BMASK_IBM(32,63) + u64 QPx_LARPSN; +#define QPx_LARPSN_value EHCA_BMASK_IBM(40,63) +/* 0x240*/ + u64 QPx_PD; + u64 QPx_SCQN; + u64 QPx_RCQN; + u64 QPx_AEQN; +/* 0x260*/ + u64 QPx_AAELOG; + u64 QPx_RAM; + u64 QPx_RDMAQE0; + u64 QPx_RDMAQE1; +/* 0x280*/ + u64 QPx_RDMAQE2; + u64 QPx_RDMAQE3; + u64 QPx_NRPSNHWM; +#define QPx_NRPSNHWM_value EHCA_BMASK_IBM(40,63) +/* 0x298*/ + u64 reserved[(0x400 - 0x298) / 8]; +/* 0x400 extended data */ + u64 reserved_ext[(0x500 - 0x400) / 8]; +/* 0x500 */ + u64 reserved2[(0x1000 - 0x500) / 8]; +/* 0x1000 */ +}; + +#define QPTEMM_OFFSET(x) offsetof(struct hipz_QPTEMM,x) + +/** @brief MRMWPT Entry Memory Map + */ +struct hipz_MRMWMM { + /* 0x00 */ + u64 MRx_HCR; +#define MRx_HCR_LPARID_VALID EHCA_BMASK_IBM(0,0) + + u64 MRx_C; + u64 MRx_HERR; + u64 MRx_AER; + /* 0x20 */ + u64 MRx_PP; + u64 reserved1; + u64 reserved2; + u64 reserved3; + /* 0x40 */ + u64 reserved4[(0x200 - 0x40) / 8]; + /* 0x200 */ + u64 MRx_CTL[64]; + +}; + +#define MRMWMM_OFFSET(x) offsetof(struct hipz_MRMWMM,x) + +/** @brief QPEDMM + */ +struct hipz_QPEDMM { + /* 0x00 */ + u64 reserved0[(0x400) / 8]; + /* 0x400 */ + u64 QPEDx_PHH; +#define QPEDx_PHH_TClass EHCA_BMASK_IBM(4,11) +#define QPEDx_PHH_HopLimit EHCA_BMASK_IBM(56,63) +#define QPEDx_PHH_FlowLevel EHCA_BMASK_IBM(12,31) + u64 QPEDx_PPSGP; +#define QPEDx_PPSGP_PPPidx EHCA_BMASK_IBM(0,63) + /* 0x410 */ + u64 QPEDx_PPSGU; +#define QPEDx_PPSGU_PPPSGID EHCA_BMASK_IBM(0,63) + u64 QPEDx_PPDGP; + /* 0x420 */ + u64 QPEDx_PPDGU; + u64 QPEDx_APH; + /* 0x430 */ + u64 QPEDx_APSGP; + u64 QPEDx_APSGU; + /* 0x440 */ + u64 QPEDx_APDGP; + u64 QPEDx_APDGU; + /* 0x450 */ + u64 QPEDx_APAV; + u64 QPEDx_APSAV; + /* 0x460 */ + u64 QPEDx_HCR; + u64 reserved1[4]; + /* 0x488 */ + u64 QPEDx_RRL0; + /* 0x490 */ + u64 QPEDx_RRRKEY0; + u64 QPEDx_RRVA0; + /* 0x4A0 */ + u64 reserved2; + u64 QPEDx_RRL1; + /* 0x4B0 */ + u64 QPEDx_RRRKEY1; + u64 QPEDx_RRVA1; + /* 0x4C0 */ + u64 reserved3; + u64 QPEDx_RRL2; + /* 0x4D0 */ + u64 QPEDx_RRRKEY2; + u64 QPEDx_RRVA2; + /* 0x4E0 */ + u64 reserved4; + u64 QPEDx_RRL3; + /* 0x4F0 */ + u64 QPEDx_RRRKEY3; + u64 QPEDx_RRVA3; +}; + +#define QPEDMM_OFFSET(x) offsetof(struct hipz_QPEDMM,x) + +/** @brief CQ Table Entry Memory Map + */ +struct hipz_CQTEMM { + u64 CQx_HCR; +#define CQx_HCR_LPARID_valid EHCA_BMASK_IBM(0,0) + u64 CQx_C; +#define CQx_C_Enable EHCA_BMASK_IBM(0,0) +#define CQx_C_Disable_Complete EHCA_BMASK_IBM(1,1) +#define CQx_C_Error_Reset EHCA_BMASK_IBM(23,23) + u64 CQx_HERR; + u64 CQx_AER; +/* 0x20 */ + u64 CQx_PTP; + u64 CQx_TP; +#define CQx_FEC_CQE_cnt EHCA_BMASK_IBM(32,63) + u64 CQx_FEC; + u64 CQx_FECA; +#define CQx_FECAdder EHCA_BMASK_IBM(32,63) +/* 0x40 */ + u64 CQx_EP; +#define CQx_EP_Event_Pending EHCA_BMASK_IBM(0,0) +#define CQx_EQ_number EHCA_BMASK_IBM(0,15) +#define CQx_EQ_CQtoken EHCA_BMASK_IBM(32,63) + u64 CQx_EQ; +/* 0x50 */ + u64 reserved1; + u64 CQx_N0; +#define CQx_N0_generate_solicited_comp_event EHCA_BMASK_IBM(0,0) +/* 0x60 */ + u64 CQx_N1; +#define CQx_N1_generate_comp_event EHCA_BMASK_IBM(0,0) + u64 reserved2[(0x1000 - 0x60) / 8]; +/* 0x1000 */ +}; + +#define CQTEMM_OFFSET(x) offsetof(struct hipz_CQTEMM,x) + +/** @brief EQ Table Entry Memory Map + */ +struct hipz_EQTEMM { + u64 EQx_HCR; +#define EQx_HCR_LPARID_valid EHCA_BMASK_IBM(0,0) +#define EQx_HCR_ENABLE_PSB EHCA_BMASK_IBM(8,8) + u64 EQx_C; +#define EQx_C_Enable EHCA_BMASK_IBM(0,0) +#define EQx_C_Error_Reset EHCA_BMASK_IBM(23,23) +#define EQx_C_Comp_Event EHCA_BMASK_IBM(17,17) + + u64 EQx_HERR; + u64 EQx_AER; +/* 0x20 */ + u64 EQx_PTP; + u64 EQx_TP; + u64 EQx_SSBA; + u64 EQx_PSBA; + +/* 0x40 */ + u64 EQx_CEC; + u64 EQx_MEQL; + u64 EQx_XISBI; + u64 EQx_XISC; +/* 0x60 */ + u64 EQx_IT; + +}; +#define EQTEMM_OFFSET(x) offsetof(struct hipz_EQTEMM,x) + +#endif diff --git a/drivers/infiniband/hw/ehca/hipz_structs.h b/drivers/infiniband/hw/ehca/hipz_structs.h new file mode 100644 index 0000000..bd2dcad --- /dev/null +++ b/drivers/infiniband/hw/ehca/hipz_structs.h @@ -0,0 +1,54 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * Infiniband Firmware structure definition + * + * Authors: Waleri Fomin + * Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: hipz_structs.h,v 1.8 2006/02/06 10:17:34 schickhj Exp $ + */ + +#ifndef __HIPZ_STRUCTS_H__ +#define __HIPZ_STRUCTS_H__ + +/* access control defines for MR/MW */ +#define HIPZ_ACCESSCTRL_L_WRITE 0x00800000 +#define HIPZ_ACCESSCTRL_R_WRITE 0x00400000 +#define HIPZ_ACCESSCTRL_R_READ 0x00200000 +#define HIPZ_ACCESSCTRL_R_ATOMIC 0x00100000 +#define HIPZ_ACCESSCTRL_MW_BIND 0x00080000 + +#endif /* __IPZ_IF_H__ */ From rolandd at cisco.com Fri Feb 17 16:57:25 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 17 Feb 2006 16:57:25 -0800 Subject: [openib-general] [PATCH 09/22] ehca classes In-Reply-To: <20060218005532.13620.79663.stgit@localhost.localdomain> References: <20060218005532.13620.79663.stgit@localhost.localdomain> Message-ID: <20060218005725.13620.32014.stgit@localhost.localdomain> From: Roland Dreier The fact that ehca_cq_delete and ehca_qp_delete return an int seems a little silly, given that the functions can never fail. The code in ehca_classes.c seems like a misuse of the kmem_cache API; rather than wrapping kmem_cache_alloc() and doing extra initialization, why not just use the kmem_cache's constructor to do this? --- drivers/infiniband/hw/ehca/ehca_classes.c | 191 +++++++++++ drivers/infiniband/hw/ehca/ehca_classes.h | 369 +++++++++++++++++++++ drivers/infiniband/hw/ehca/ehca_classes_core.h | 73 ++++ drivers/infiniband/hw/ehca/ehca_classes_pSeries.h | 256 +++++++++++++++ 4 files changed, 889 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_classes.c b/drivers/infiniband/hw/ehca/ehca_classes.c new file mode 100644 index 0000000..9819788 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_classes.c @@ -0,0 +1,191 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * struct initialisations and allocation + * + * Authors: Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_classes.c,v 1.21 2006/02/06 16:20:38 schickhj Exp $ + */ + +#define DEB_PREFIX "clas" +#include "ehca_kernel.h" + +#include "ehca_classes.h" + +struct ehca_pd *ehca_pd_new(void) +{ + extern struct ehca_module ehca_module; + struct ehca_pd *me; + + me = kmem_cache_alloc(ehca_module.cache_pd, SLAB_KERNEL); + if (me == NULL) + return NULL; + + memset(me, 0, sizeof(struct ehca_pd)); + + return me; +} + +void ehca_pd_delete(struct ehca_pd *me) +{ + extern struct ehca_module ehca_module; + + kmem_cache_free(ehca_module.cache_pd, me); +} + +struct ehca_cq *ehca_cq_new(void) +{ + extern struct ehca_module ehca_module; + struct ehca_cq *me; + + me = kmem_cache_alloc(ehca_module.cache_cq, SLAB_KERNEL); + if (me == NULL) + return NULL; + + memset(me, 0, sizeof(struct ehca_cq)); + spin_lock_init(&me->spinlock); + spin_lock_init(&me->cb_lock); + + return me; +} + +int ehca_cq_delete(struct ehca_cq *me) +{ + extern struct ehca_module ehca_module; + + kmem_cache_free(ehca_module.cache_cq, me); + + return H_Success; +} + +struct ehca_qp *ehca_qp_new(void) +{ + extern struct ehca_module ehca_module; + struct ehca_qp *me; + + me = kmem_cache_alloc(ehca_module.cache_qp, SLAB_KERNEL); + if (me == NULL) + return NULL; + + memset(me, 0, sizeof(struct ehca_qp)); + spin_lock_init(&me->spinlock_s); + spin_lock_init(&me->spinlock_r); + + return me; +} + +int ehca_qp_delete(struct ehca_qp *me) +{ + extern struct ehca_module ehca_module; + + kmem_cache_free(ehca_module.cache_qp, me); + + return H_Success; +} + +struct ehca_av *ehca_av_new(void) +{ + extern struct ehca_module ehca_module; + struct ehca_av *me; + + me = kmem_cache_alloc(ehca_module.cache_av, SLAB_KERNEL); + if (me == NULL) + return NULL; + + memset(me, 0, sizeof(struct ehca_av)); + + return me; +} + +int ehca_av_delete(struct ehca_av *me) +{ + extern struct ehca_module ehca_module; + + kmem_cache_free(ehca_module.cache_av, me); + + return H_Success; +} + +struct ehca_mr *ehca_mr_new(void) +{ + extern struct ehca_module ehca_module; + struct ehca_mr *me; + + me = kmem_cache_alloc(ehca_module.cache_mr, SLAB_KERNEL); + if (me) { + memset(me, 0, sizeof(struct ehca_mr)); + spin_lock_init(&me->mrlock); + EDEB_EX(7, "ehca_mr=%p sizeof(ehca_mr_t)=%x", me, + (u32) sizeof(struct ehca_mr)); + } else { + EDEB_ERR(3, "alloc failed"); + } + + return me; +} + +void ehca_mr_delete(struct ehca_mr *me) +{ + extern struct ehca_module ehca_module; + + kmem_cache_free(ehca_module.cache_mr, me); +} + +struct ehca_mw *ehca_mw_new(void) +{ + extern struct ehca_module ehca_module; + struct ehca_mw *me; + + me = kmem_cache_alloc(ehca_module.cache_mw, SLAB_KERNEL); + if (me) { + memset(me, 0, sizeof(struct ehca_mw)); + spin_lock_init(&me->mwlock); + EDEB_EX(7, "ehca_mw=%p sizeof(ehca_mw_t)=%x", me, + (u32) sizeof(struct ehca_mw)); + } else { + EDEB_ERR(3, "alloc failed"); + } + + return me; +} + +void ehca_mw_delete(struct ehca_mw *me) +{ + extern struct ehca_module ehca_module; + + kmem_cache_free(ehca_module.cache_mw, me); +} + diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h new file mode 100644 index 0000000..1d72aaf --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_classes.h @@ -0,0 +1,369 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * struct definitions for hcad internal structures + * + * Authors: Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_classes.h,v 1.80 2006/02/06 16:20:38 schickhj Exp $ + */ + +#ifndef __EHCA_CLASSES_H__ +#define __EHCA_CLASSES_H__ + +#include "ehca_kernel.h" +#include "ipz_pt_fn.h" + +#include + +struct ehca_module; +struct ehca_qp; +struct ehca_cq; +struct ehca_eq; +struct ehca_mr; +struct ehca_mw; +struct ehca_pd; +struct ehca_av; + +#ifndef CONFIG_PPC64 +#ifndef Z_SERIES +#error "no series defined" +#endif +#endif + +#ifdef CONFIG_PPC64 +#include "ehca_classes_pSeries.h" +#endif + +#ifdef Z_SERIES +#include "ehca_classes_zSeries.h" +#endif + +#include +#include + +#include "ehca_irq.h" + +#include "ehca_classes_core.h" + +/** @brief HCAD class + * + * contains HCAD specific data + * + */ +struct ehca_module { + struct list_head shca_list; + spinlock_t shca_lock; + + kmem_cache_t *cache_pd; + kmem_cache_t *cache_cq; + kmem_cache_t *cache_qp; + kmem_cache_t *cache_av; + kmem_cache_t *cache_mr; + kmem_cache_t *cache_mw; + + struct ehca_pfmodule pf; /* plattform specific part of HCA */ +}; + +/** @brief EQ class + */ +struct ehca_eq { + u32 length; /* length of EQ */ + struct ipz_queue ipz_queue; /* EQ in kv */ + struct ipz_eq_handle ipz_eq_handle; + struct ehca_irq_info irq_info; + struct work_struct work; + struct h_galpas galpas; + int is_initialized; + + struct ehca_pfeq pf; /* plattform specific part of EQ */ + + spinlock_t spinlock; +}; + +/** static port + */ +struct ehca_sport { + struct ib_cq *ibcq_aqp1; /* CQ for AQP1 */ + struct ib_qp *ibqp_aqp1; /* QP for AQP1 */ + enum ib_port_state port_state; +}; + +/** @brief HCA class "static HCA" + * + * contains HCA specific data per HCA (or vHCA?) + * per instance reported by firmware + * + */ +struct ehca_shca { + struct ib_device ib_device; + struct ibmebus_dev *ibmebus_dev; + u8 num_ports; + int hw_level; + struct list_head shca_list; + struct ipz_adapter_handle ipz_hca_handle; /* firmware HCA handle */ + struct ehca_bridge_handle bridge; + struct ehca_sport sport[2]; + struct ehca_eq eq; /* event queue */ + struct ehca_eq neq; /* notification event queue */ + struct ehca_mr *maxmr; /* internal max MR (for kernel users) */ + struct ehca_pd *pd; /* internal pd (for kernel users) */ + struct ehca_pfshca pf; /* plattform specific part of HCA */ + struct h_galpas galpas; +}; + +/** @brief protection domain + */ +struct ehca_pd { + struct ib_pd ib_pd; /* gen2 qp, must always be first in ehca_pd */ + struct ipz_pd fw_pd; + struct ehca_pfpd pf; +}; + +/** @brief QP class + */ +struct ehca_qp { + struct ib_qp ib_qp; /* gen2 qp, must always be first in ehca_qp */ + struct ehca_qp_core ehca_qp_core; /* common fields for + user/kernel space */ + u32 token; + spinlock_t spinlock_s; + spinlock_t spinlock_r; + u32 sq_max_inline_data_size; /* max # of inline data can be send */ + struct ipz_qp_handle ipz_qp_handle; /* QP handle for h-calls */ + struct ehca_pfqp pf; /* plattform specific part of QP */ + struct ib_qp_init_attr init_attr; + /* adr mapping for s/r queues and fw handle bw kernel&user space */ + u64 uspace_squeue; + u64 uspace_rqueue; + u64 uspace_fwh; + struct ehca_cq* send_cq; + unsigned int sqerr_purgeflag; + struct list_head list_entries; +}; + +#define QP_HASHTAB_LEN 7 +/** @brief CQ class + */ +struct ehca_cq { + struct ib_cq ib_cq; /* gen2 cq, must always be first + in ehca_cq */ + struct ehca_cq_core ehca_cq_core; /* common fields for + user/kernel space */ + spinlock_t spinlock; + u32 cq_number; + u32 token; + u32 nr_of_entries; + /* fw specific data common for p+z */ + struct ipz_cq_handle ipz_cq_handle; /* CQ handle for h-calls */ + /* pf specific code */ + struct ehca_pfcq pf; /* platform specific part of CQ */ + spinlock_t cb_lock; /* completion event handler */ + /* adr mapping for queue and fw handle bw kernel&user space */ + u64 uspace_queue; + u64 uspace_fwh; + struct list_head qp_hashtab[QP_HASHTAB_LEN]; +}; + + +/** @brief MR flags + */ +enum ehca_mr_flag { + EHCA_MR_FLAG_FMR = 0x80000000, /* FMR, created with ehca_alloc_fmr */ + EHCA_MR_FLAG_MAXMR = 0x40000000, /* max-MR */ + EHCA_MR_FLAG_USER = 0x20000000 /* user space TODO...necessary????. */ +}; + +/** @brief MR class + */ +struct ehca_mr { + union { + struct ib_mr ib_mr; /* must always be first in ehca_mr */ + struct ib_fmr ib_fmr; /* must always be first in ehca_mr */ + } ib; + + spinlock_t mrlock; + + /* !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! + * !!! ehca_mr_deletenew() memsets from flags to end of structure + * !!! DON'T move flags or insert another field before. + * !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! */ + + enum ehca_mr_flag flags; + u32 num_pages; /* number of MR pages */ + int acl; /* ACL (stored here for usage in reregister) */ + u64 *start; /* virtual start address (stored here for */ + /* usage in reregister) */ + u64 size; /* size (stored here for usage in reregister) */ + u32 fmr_page_size; /* page size for FMR */ + u32 fmr_max_pages; /* max pages for FMR */ + u32 fmr_max_maps; /* max outstanding maps for FMR */ + u32 fmr_map_cnt; /* map counter for FMR */ + /* fw specific data */ + struct ipz_mrmw_handle ipz_mr_handle; /* MR handle for h-calls */ + struct h_galpas galpas; + /* data for userspace bridge */ + u32 nr_of_pages; + void *pagearray; + + struct ehca_pfmr pf; /* platform specific part of MR */ +}; + +/** @brief MW class + */ +struct ehca_mw { + struct ib_mw ib_mw; /* gen2 mw, must always be first in ehca_mw */ + spinlock_t mwlock; + + u8 never_bound; /* indication MW was never bound */ + struct ipz_mrmw_handle ipz_mw_handle; /* MW handle for h-calls */ + struct h_galpas galpas; + + struct ehca_pfmw pf; /* platform specific part of MW */ +}; + +/** @brief MR page info type + */ +enum ehca_mr_pgi_type { + EHCA_MR_PGI_PHYS = 1, /* type of ehca_reg_phys_mr, + * ehca_rereg_phys_mr, + * ehca_reg_internal_maxmr */ + EHCA_MR_PGI_USER = 2, /* type of ehca_reg_user_mr */ + EHCA_MR_PGI_FMR = 3 /* type of ehca_map_phys_fmr */ +}; + +/** @brief MR page info + */ +struct ehca_mr_pginfo { + enum ehca_mr_pgi_type type; + u64 num_pages; + u64 page_count; + + /* type EHCA_MR_PGI_PHYS section */ + int num_phys_buf; + struct ib_phys_buf *phys_buf_array; + u64 next_buf; + u64 next_page; + + /* type EHCA_MR_PGI_USER section */ + struct ib_umem *region; + struct ib_umem_chunk *next_chunk; + u64 next_nmap; + + /* type EHCA_MR_PGI_FMR section */ + u64 *page_list; + u64 next_listelem; +}; + + +/** @brief addres vector suitable for a ud enqueue request + */ +struct ehca_av { + struct ib_ah ib_ah; /* gen2 ah, must always be first in ehca_ah */ + struct ehca_ud_av av; +}; + +/** @brief user context + */ +struct ehca_ucontext { + struct ib_ucontext ib_ucontext; +}; + +struct ehca_module *ehca_module_new(void); + +int ehca_module_delete(struct ehca_module *me); + +int ehca_eq_ctor(struct ehca_eq *eq); + +int ehca_eq_dtor(struct ehca_eq *eq); + +struct ehca_shca *ehca_shca_new(void); + +int ehca_shca_delete(struct ehca_shca *me); + +struct ehca_sport *ehca_sport_new(struct ehca_shca *anchor); /*anchor?? */ + +struct ehca_cq *ehca_cq_new(void); + +int ehca_cq_delete(struct ehca_cq *me); + +struct ehca_av *ehca_av_new(void); + +int ehca_av_delete(struct ehca_av *me); + +struct ehca_pd *ehca_pd_new(void); + +void ehca_pd_delete(struct ehca_pd *me); + +struct ehca_qp *ehca_qp_new(void); + +int ehca_qp_delete(struct ehca_qp *me); + +struct ehca_mr *ehca_mr_new(void); + +void ehca_mr_delete(struct ehca_mr *me); + +struct ehca_mw *ehca_mw_new(void); + +void ehca_mw_delete(struct ehca_mw *me); + +extern struct rw_semaphore ehca_qp_idr_sem; +extern struct rw_semaphore ehca_cq_idr_sem; +extern struct idr ehca_qp_idr; +extern struct idr ehca_cq_idr; + +/* + * resp structs for comm bw user and kernel space + */ +struct ehca_create_cq_resp { + u32 cq_number; + u32 token; + struct ehca_cq_core ehca_cq_core; +}; + +struct ehca_create_qp_resp { + u32 qp_num; + u32 token; + struct ehca_qp_core ehca_qp_core; +}; + +/* + * helper funcs to link send cq and qp + */ +int ehca_cq_assign_qp(struct ehca_cq *cq, struct ehca_qp *qp); +int ehca_cq_unassign_qp(struct ehca_cq *cq, unsigned int qp_num); +struct ehca_qp* ehca_cq_get_qp(struct ehca_cq *cq, int qp_num); + +#endif /* __EHCA_CLASSES_H__ */ diff --git a/drivers/infiniband/hw/ehca/ehca_classes_core.h b/drivers/infiniband/hw/ehca/ehca_classes_core.h new file mode 100644 index 0000000..5e864b3 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_classes_core.h @@ -0,0 +1,73 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * core struct definitions for hcad internal structures and + * to be used/compiled commonly in user and kernel space + * + * Authors: Christoph Raisch + * Hoang-Nam Nguyen + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_classes_core.h,v 1.12 2006/02/06 10:17:34 schickhj Exp $ + */ + +#ifndef __EHCA_CLASSES_CORE_H__ +#define __EHCA_CLASSES_CORE_H__ + +#include "ipz_pt_fn_core.h" +#include "ehca_galpa.h" + +/** @brief qp core contains common fields for user/kernel space + */ +struct ehca_qp_core { + /* kernel space: enum ib_qp_type, user space: enum ibv_qp_type */ + int qp_type; + int dummy1; /* 8 byte alignment */ + struct ipz_queue ipz_squeue; + struct ipz_queue ipz_rqueue; + struct h_galpas galpas; + unsigned int qkey; + int dummy2; /* 8 byte alignment */ + /* qp_num assigned by ehca: sqp0/1 may have got different numbers */ + unsigned int real_qp_num; +}; + +/** @brief cq core contains common fields for user/kernel space + */ +struct ehca_cq_core { + struct ipz_queue ipz_queue; + struct h_galpas galpas; +}; + +#endif /* __EHCA_CLASSES_CORE_H__ */ diff --git a/drivers/infiniband/hw/ehca/ehca_classes_pSeries.h b/drivers/infiniband/hw/ehca/ehca_classes_pSeries.h new file mode 100644 index 0000000..8f86137 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_classes_pSeries.h @@ -0,0 +1,256 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * pSeries interface definitions + * + * Authors: Waleri Fomin + * Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_classes_pSeries.h,v 1.24 2006/02/06 10:17:34 schickhj Exp $ + */ + +#ifndef __EHCA_CLASSES_PSERIES_H__ +#define __EHCA_CLASSES_PSERIES_H__ + +#include "ehca_galpa.h" +#include "ipz_pt_fn.h" + + +struct ehca_pfmodule { +}; + +struct ehca_pfshca { +}; + +struct ehca_pfqp { + struct ipz_qpt sqpt; + struct ipz_qpt rqpt; + struct ehca_bridge_handle bridge; +}; + +struct ehca_pfcq { + struct ipz_qpt qpt; + struct ehca_bridge_handle bridge; + u32 cqnr; +}; + +struct ehca_pfeq { + struct ipz_qpt qpt; + struct ehca_bridge_handle bridge; + struct h_galpa galpa; + u32 eqnr; +}; + +struct ehca_pfpd { +}; + +struct ehca_pfmr { + struct ehca_bridge_handle bridge; +}; +struct ehca_pfmw { +}; + +struct ipz_adapter_handle { + u64 handle; +}; + +struct ipz_cq_handle { + u64 handle; +}; + +struct ipz_eq_handle { + u64 handle; +}; + +struct ipz_qp_handle { + u64 handle; +}; +struct ipz_mrmw_handle { + u64 handle; +}; + +struct ipz_pd { + u32 value; +}; + +struct hcp_modify_qp_control_block { + u32 qkey; /* 00 */ + u32 rdd; /* reliable datagram domain */ + u32 send_psn; /* 02 */ + u32 receive_psn; /* 03 */ + u32 prim_phys_port; /* 04 */ + u32 alt_phys_port; /* 05 */ + u32 prim_p_key_idx; /* 06 */ + u32 alt_p_key_idx; /* 07 */ + u32 rdma_atomic_ctrl; /* 08 */ + u32 qp_state; /* 09 */ + u32 reserved_10; /* 10 */ + u32 rdma_nr_atomic_resp_res; /* 11 */ + u32 path_migration_state; /* 12 */ + u32 rdma_atomic_outst_dest_qp; /* 13 */ + u32 dest_qp_nr; /* 14 */ + u32 min_rnr_nak_timer_field; /* 15 */ + u32 service_level; /* 16 */ + u32 send_grh_flag; /* 17 */ + u32 retry_count; /* 18 */ + u32 timeout; /* 19 */ + u32 path_mtu; /* 20 */ + u32 max_static_rate; /* 21 */ + u32 dlid; /* 22 */ + u32 rnr_retry_count; /* 23 */ + u32 source_path_bits; /* 24 */ + u32 traffic_class; /* 25 */ + u32 hop_limit; /* 26 */ + u32 source_gid_idx; /* 27 */ + u32 flow_label; /* 28 */ + u32 reserved_29; /* 29 */ + union { /* 30 */ + u64 dw[2]; + u8 byte[16]; + } dest_gid; + u32 service_level_al; /* 34 */ + u32 send_grh_flag_al; /* 35 */ + u32 retry_count_al; /* 36 */ + u32 timeout_al; /* 37 */ + u32 max_static_rate_al; /* 38 */ + u32 dlid_al; /* 39 */ + u32 rnr_retry_count_al; /* 40 */ + u32 source_path_bits_al; /* 41 */ + u32 traffic_class_al; /* 42 */ + u32 hop_limit_al; /* 43 */ + u32 source_gid_idx_al; /* 44 */ + u32 flow_label_al; /* 45 */ + u32 reserved_46; /* 46 */ + u32 reserved_47; /* 47 */ + union { /* 48 */ + u64 dw[2]; + u8 byte[16]; + } dest_gid_al; + u32 max_nr_outst_send_wr; /* 52 */ + u32 max_nr_outst_recv_wr; /* 53 */ + u32 disable_ete_credit_check; /* 54 */ + u32 qp_number; /* 55 */ + u64 send_queue_handle; /* 56 */ + u64 recv_queue_handle; /* 58 */ + u32 actual_nr_sges_in_sq_wqe; /* 60 */ + u32 actual_nr_sges_in_rq_wqe; /* 61 */ + u32 qp_enable; /* 62 */ + u32 curr_srq_limit; /* 63 */ + u64 qp_aff_asyn_ev_log_reg; /* 64 */ + u64 shared_rq_hndl; /* 66 */ + u64 trigg_doorbell_qp_hndl; /* 68 */ + u32 reserved_70_127[58]; /* 70 */ +}; + +#define MQPCB_MASK_QKEY EHCA_BMASK_IBM(0,0) +#define MQPCB_MASK_SEND_PSN EHCA_BMASK_IBM(2,2) +#define MQPCB_MASK_RECEIVE_PSN EHCA_BMASK_IBM(3,3) +#define MQPCB_MASK_PRIM_PHYS_PORT EHCA_BMASK_IBM(4,4) +#define MQPCB_PRIM_PHYS_PORT EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_ALT_PHYS_PORT EHCA_BMASK_IBM(5,5) +#define MQPCB_MASK_PRIM_P_KEY_IDX EHCA_BMASK_IBM(6,6) +#define MQPCB_PRIM_P_KEY_IDX EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_ALT_P_KEY_IDX EHCA_BMASK_IBM(7,7) +#define MQPCB_MASK_RDMA_ATOMIC_CTRL EHCA_BMASK_IBM(8,8) +#define MQPCB_MASK_QP_STATE EHCA_BMASK_IBM(9,9) +#define MQPCB_QP_STATE EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_RDMA_NR_ATOMIC_RESP_RES EHCA_BMASK_IBM(11,11) +#define MQPCB_MASK_PATH_MIGRATION_STATE EHCA_BMASK_IBM(12,12) +#define MQPCB_MASK_RDMA_ATOMIC_OUTST_DEST_QP EHCA_BMASK_IBM(13,13) +#define MQPCB_MASK_DEST_QP_NR EHCA_BMASK_IBM(14,14) +#define MQPCB_MASK_MIN_RNR_NAK_TIMER_FIELD EHCA_BMASK_IBM(15,15) +#define MQPCB_MASK_SERVICE_LEVEL EHCA_BMASK_IBM(16,16) +#define MQPCB_MASK_SEND_GRH_FLAG EHCA_BMASK_IBM(17,17) +#define MQPCB_MASK_RETRY_COUNT EHCA_BMASK_IBM(18,18) +#define MQPCB_MASK_TIMEOUT EHCA_BMASK_IBM(19,19) +#define MQPCB_MASK_PATH_MTU EHCA_BMASK_IBM(20,20) +#define MQPCB_PATH_MTU EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_MAX_STATIC_RATE EHCA_BMASK_IBM(21,21) +#define MQPCB_MAX_STATIC_RATE EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_DLID EHCA_BMASK_IBM(22,22) +#define MQPCB_DLID EHCA_BMASK_IBM(16,31) +#define MQPCB_MASK_RNR_RETRY_COUNT EHCA_BMASK_IBM(23,23) +#define MQPCB_RNR_RETRY_COUNT EHCA_BMASK_IBM(29,31) +#define MQPCB_MASK_SOURCE_PATH_BITS EHCA_BMASK_IBM(24,24) +#define MQPCB_SOURCE_PATH_BITS EHCA_BMASK_IBM(25,31) +#define MQPCB_MASK_TRAFFIC_CLASS EHCA_BMASK_IBM(25,25) +#define MQPCB_TRAFFIC_CLASS EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_HOP_LIMIT EHCA_BMASK_IBM(26,26) +#define MQPCB_HOP_LIMIT EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_SOURCE_GID_IDX EHCA_BMASK_IBM(27,27) +#define MQPCB_SOURCE_GID_IDX EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_FLOW_LABEL EHCA_BMASK_IBM(28,28) +#define MQPCB_FLOW_LABEL EHCA_BMASK_IBM(12,31) +#define MQPCB_MASK_DEST_GID EHCA_BMASK_IBM(30,30) +#define MQPCB_MASK_SERVICE_LEVEL_AL EHCA_BMASK_IBM(31,31) +#define MQPCB_SERVICE_LEVEL_AL EHCA_BMASK_IBM(28,31) +#define MQPCB_MASK_SEND_GRH_FLAG_AL EHCA_BMASK_IBM(32,32) +#define MQPCB_SEND_GRH_FLAG_AL EHCA_BMASK_IBM(31,31) +#define MQPCB_MASK_RETRY_COUNT_AL EHCA_BMASK_IBM(33,33) +#define MQPCB_RETRY_COUNT_AL EHCA_BMASK_IBM(29,31) +#define MQPCB_MASK_TIMEOUT_AL EHCA_BMASK_IBM(34,34) +#define MQPCB_TIMEOUT_AL EHCA_BMASK_IBM(27,31) +#define MQPCB_MASK_MAX_STATIC_RATE_AL EHCA_BMASK_IBM(35,35) +#define MQPCB_MAX_STATIC_RATE_AL EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_DLID_AL EHCA_BMASK_IBM(36,36) +#define MQPCB_DLID_AL EHCA_BMASK_IBM(16,31) +#define MQPCB_MASK_RNR_RETRY_COUNT_AL EHCA_BMASK_IBM(37,37) +#define MQPCB_RNR_RETRY_COUNT_AL EHCA_BMASK_IBM(29,31) +#define MQPCB_MASK_SOURCE_PATH_BITS_AL EHCA_BMASK_IBM(38,38) +#define MQPCB_SOURCE_PATH_BITS_AL EHCA_BMASK_IBM(25,31) +#define MQPCB_MASK_TRAFFIC_CLASS_AL EHCA_BMASK_IBM(39,39) +#define MQPCB_TRAFFIC_CLASS_AL EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_HOP_LIMIT_AL EHCA_BMASK_IBM(40,40) +#define MQPCB_HOP_LIMIT_AL EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_SOURCE_GID_IDX_AL EHCA_BMASK_IBM(41,41) +#define MQPCB_SOURCE_GID_IDX_AL EHCA_BMASK_IBM(24,31) +#define MQPCB_MASK_FLOW_LABEL_AL EHCA_BMASK_IBM(42,42) +#define MQPCB_FLOW_LABEL_AL EHCA_BMASK_IBM(12,31) +#define MQPCB_MASK_DEST_GID_AL EHCA_BMASK_IBM(44,44) +#define MQPCB_MASK_MAX_NR_OUTST_SEND_WR EHCA_BMASK_IBM(45,45) +#define MQPCB_MAX_NR_OUTST_SEND_WR EHCA_BMASK_IBM(16,31) +#define MQPCB_MASK_MAX_NR_OUTST_RECV_WR EHCA_BMASK_IBM(46,46) +#define MQPCB_MAX_NR_OUTST_RECV_WR EHCA_BMASK_IBM(16,31) +#define MQPCB_MASK_DISABLE_ETE_CREDIT_CHECK EHCA_BMASK_IBM(47,47) +#define MQPCB_DISABLE_ETE_CREDIT_CHECK EHCA_BMASK_IBM(31,31) +#define MQPCB_QP_NUMBER EHCA_BMASK_IBM(8,31) +#define MQPCB_MASK_QP_ENABLE EHCA_BMASK_IBM(48,48) +#define MQPCB_QP_ENABLE EHCA_BMASK_IBM(31,31) +#define MQPCB_MASK_CURR_SQR_LIMIT EHCA_BMASK_IBM(49,49) +#define MQPCB_CURR_SQR_LIMIT EHCA_BMASK_IBM(15,31) +#define MQPCB_MASK_QP_AFF_ASYN_EV_LOG_REG EHCA_BMASK_IBM(50,50) +#define MQPCB_MASK_SHARED_RQ_HNDL EHCA_BMASK_IBM(51,51) + +#endif /* __EHCA_CLASSES_PSERIES_H__ */ From rolandd at cisco.com Fri Feb 17 16:57:27 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 17 Feb 2006 16:57:27 -0800 Subject: [openib-general] [PATCH 10/22] ehca IRQ handling In-Reply-To: <20060218005532.13620.79663.stgit@localhost.localdomain> References: <20060218005532.13620.79663.stgit@localhost.localdomain> Message-ID: <20060218005727.13620.58832.stgit@localhost.localdomain> From: Roland Dreier Where is the irq_count field of struct ehca_irq_info ever used? I couldn't find anywhere, so it can be deleted. The logic in ehca_interrupt_eq() is too convoluted for me to follow; there are two nested while () {} loops inside a do {} while () loop, and ehca_poll_eq() is called in three different places. Is there any way to untangle this? --- drivers/infiniband/hw/ehca/ehca_irq.c | 436 +++++++++++++++++++++++++++++++++ drivers/infiniband/hw/ehca/ehca_irq.h | 90 +++++++ 2 files changed, 526 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_irq.c b/drivers/infiniband/hw/ehca/ehca_irq.c new file mode 100644 index 0000000..1bba58e --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_irq.c @@ -0,0 +1,436 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * Functions for EQs, NEQs and interrupts + * + * Authors: Heiko J Schick + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_irq.c,v 1.64 2006/02/15 08:15:25 schickhj Exp $ + */ + +#include "ehca_kernel.h" +#include "ehca_irq.h" + +#define DEB_PREFIX "eirq" + +#include "ehca_kernel.h" +#include "ehca_classes.h" +#include "ehca_tools.h" +#include "ehca_eq.h" +#include "ehca_irq.h" +#include "hcp_if.h" + +#define EQE_COMPLETION_EVENT EHCA_BMASK_IBM(1,1) +#define EQE_CQ_QP_NUMBER EHCA_BMASK_IBM(8,31) +#define EQE_EE_IDENTIFIER EHCA_BMASK_IBM(2,7) +#define EQE_CQ_NUMBER EHCA_BMASK_IBM(8,31) +#define EQE_QP_NUMBER EHCA_BMASK_IBM(8,31) +#define EQE_QP_TOKEN EHCA_BMASK_IBM(32,63) +#define EQE_CQ_TOKEN EHCA_BMASK_IBM(32,63) + +#define NEQE_COMPLETION_EVENT EHCA_BMASK_IBM(1,1) +#define NEQE_EVENT_CODE EHCA_BMASK_IBM(2,7) +#define NEQE_PORT_NUMBER EHCA_BMASK_IBM(8,15) +#define NEQE_PORT_AVAILABILITY EHCA_BMASK_IBM(16,16) + +#define ERROR_DATA_LENGTH EHCA_BMASK_IBM(52,63) + +static inline void comp_event_callback(struct ehca_cq *cq) +{ + unsigned long spl_flags = 0; + + EDEB_EN(7, "cq=%p", cq); + + if (cq->ib_cq.comp_handler == NULL) + return; + + spin_lock_irqsave(&cq->cb_lock, spl_flags); + cq->ib_cq.comp_handler(&cq->ib_cq, cq->ib_cq.cq_context); + spin_unlock_irqrestore(&cq->cb_lock, spl_flags); + + EDEB_EX(7, "cq=%p", cq); + + return; +} + +int ehca_error_data(struct ehca_shca *shca, + u64 ressource) +{ + + unsigned long ret = 0; + u64 *rblock; + unsigned long block_count; + + EDEB_EN(7, "ressource=%lx", ressource); + + rblock = kmalloc(PAGE_SIZE, GFP_KERNEL); + if (rblock == NULL) { + EDEB_ERR(4, "Cannot allocate rblock memory."); + ret = -ENOMEM; + goto error_data1; + } + + memset(rblock, 0, PAGE_SIZE); + + ret = hipz_h_error_data(shca->ipz_hca_handle, + ressource, + rblock, + &block_count); + + if (ret == H_R_STATE) { + EDEB_ERR(4, "No error data is available: %lx.", ressource); + } + else if (ret == H_Success) { + int length; + + length = EHCA_BMASK_GET(ERROR_DATA_LENGTH, rblock[0]); + + if (length > PAGE_SIZE) + length = PAGE_SIZE; + + EDEB_ERR(4, "Error data is available: %lx.", ressource); + EDEB_ERR(4, "EHCA ----- error data begin " + "---------------------------------------------------"); + EDEB_DMP(4, rblock, length, "ressource=%lx", ressource); + EDEB_ERR(4, "EHCA ----- error data end " + "-----------------------------------------------------"); + } + else { + EDEB_ERR(4, "Error data could not be fetched: %lx", ressource); + } + + kfree(rblock); + + error_data1: + return ret; + +} + +static void qp_event_callback(struct ehca_shca *shca, + u64 eqe, + enum ib_event_type event_type) +{ + struct ib_event event; + struct ehca_qp *qp; + u32 token = EHCA_BMASK_GET(EQE_QP_TOKEN, eqe); + + EDEB_EN(7, "eqe=%lx", eqe); + + down_read(&ehca_qp_idr_sem); + qp = idr_find(&ehca_qp_idr, token); + up_read(&ehca_qp_idr_sem); + + if (qp == NULL) + return; + + if (event_type == IB_EVENT_QP_FATAL) + EDEB_ERR(4, "QP 0x%x (ressource=%lx) has errors.", + qp->ib_qp.qp_num, qp->ipz_qp_handle.handle); + + ehca_error_data(shca, qp->ipz_qp_handle.handle); + + if (qp->ib_qp.event_handler == NULL) + return; + + event.device = &shca->ib_device; + event.event = event_type; + event.element.qp = &qp->ib_qp; + + qp->ib_qp.event_handler(&event, qp->ib_qp.qp_context); + + EDEB_EX(7, "qp=%p", qp); + + return; +} + +static void cq_event_callback(struct ehca_shca *shca, + u64 eqe) +{ + struct ehca_cq *cq; + u32 token = EHCA_BMASK_GET(EQE_CQ_TOKEN, eqe); + + EDEB_EN(7, "eqe=%lx", eqe); + + down_read(&ehca_cq_idr_sem); + cq = idr_find(&ehca_cq_idr, token); + up_read(&ehca_cq_idr_sem); + + if (cq == NULL) + return; + + EDEB_ERR(4, "CQ 0x%x (ressource=%lx) has errors.", + cq->cq_number, cq->ipz_cq_handle.handle); + + ehca_error_data(shca, cq->ipz_cq_handle.handle); + + EDEB_EX(7, "cq=%p", cq); + + return; +} + +static void parse_identifier(struct ehca_shca *shca, u64 eqe) +{ + u8 identifier = EHCA_BMASK_GET(EQE_EE_IDENTIFIER, eqe); + + EDEB_EN(7, "shca=%p eqe=%lx", shca, eqe); + + switch (identifier) { + case 0x02: /* path migrated */ + qp_event_callback(shca, eqe, IB_EVENT_PATH_MIG); + break; + case 0x03: /* communication established */ + qp_event_callback(shca, eqe, IB_EVENT_COMM_EST); + break; + case 0x04: /* send queue drained */ + qp_event_callback(shca, eqe, IB_EVENT_SQ_DRAINED); + break; + case 0x05: /* QP error */ + case 0x06: /* QP error */ + qp_event_callback(shca, eqe, IB_EVENT_QP_FATAL); + break; + case 0x07: /* CQ error */ + case 0x08: /* CQ error */ + cq_event_callback(shca, eqe); + break; + case 0x09: /* MRMWPTE error */ + case 0x0A: /* port event */ + case 0x0B: /* MR access error */ + case 0x0C: /* EQ error */ + case 0x0D: /* P/Q_Key mismatch */ + case 0x10: /* sampling complete */ + case 0x11: /* unaffiliated access error */ + case 0x12: /* path migrating error */ + case 0x13: /* interface trace stopped */ + case 0x14: /* first error capture info available */ + default: + EDEB_ERR(4, "Unknown identifier: %x on %s.", + identifier, shca->ib_device.name); + break; + } + + EDEB_EN(7, "eqe=%lx identifier=%x", eqe, identifier); + + return; +} + +static void parse_ec(struct ehca_shca *shca, u64 eqe) +{ + struct ib_event event; + u8 ec = EHCA_BMASK_GET(NEQE_EVENT_CODE, eqe); + u8 port = EHCA_BMASK_GET(NEQE_PORT_NUMBER, eqe); + + EDEB_EN(7, "shca=%p eqe=%lx", shca, eqe); + + switch (ec) { + case 0x30: /* port availability change */ + if (EHCA_BMASK_GET(NEQE_PORT_AVAILABILITY, eqe)) { + EDEB(4, "%s: port %x is active.", + shca->ib_device.name, port); + event.device = &shca->ib_device; + event.event = IB_EVENT_PORT_ACTIVE; + event.element.port_num = port; + shca->sport[port - 1].port_state = IB_PORT_ACTIVE; + ib_dispatch_event(&event); + } else { + EDEB(4, "%s: port %x is inactive.", + shca->ib_device.name, port); + event.device = &shca->ib_device; + event.event = IB_EVENT_PORT_ERR; + event.element.port_num = port; + shca->sport[port - 1].port_state = IB_PORT_DOWN; + ib_dispatch_event(&event); + } + break; + case 0x31: + /* port configuration change */ + /* disruptive change is caused by */ + /* LID, PKEY or SM change */ + EDEB(4, "EHCA disruptive port %x " + "configuration change.", port); + + EDEB(4, "%s: port %x is inactive.", + shca->ib_device.name, port); + event.device = &shca->ib_device; + event.event = IB_EVENT_PORT_ERR; + event.element.port_num = port; + shca->sport[port - 1].port_state = IB_PORT_DOWN; + ib_dispatch_event(&event); + + EDEB(4, "%s: port %x is active.", + shca->ib_device.name, port); + event.device = &shca->ib_device; + event.event = IB_EVENT_PORT_ACTIVE; + event.element.port_num = port; + shca->sport[port - 1].port_state = IB_PORT_ACTIVE; + ib_dispatch_event(&event); + break; + case 0x32: /* adapter malfunction */ + case 0x33: /* trace stopped */ + default: + EDEB_ERR(4, "Unknown event code: %x on %s.", + ec, shca->ib_device.name); + break; + } + + EDEB_EN(7, "eqe=%lx ec=%x", eqe, ec); + + return; +} + +static inline void reset_eq_pending(struct ehca_cq *cq) +{ + u64 CQx_EP = 0; + struct h_galpa gal = cq->ehca_cq_core.galpas.kernel; + + EDEB_EN(7, "cq=%p", cq); + + hipz_galpa_store_cq(gal, CQx_EP, 0x0); + CQx_EP = hipz_galpa_load(gal, CQTEMM_OFFSET(CQx_EP)); + EDEB(7, "CQx_EP=%lx", CQx_EP); + + EDEB_EX(7, "cq=%p", cq); + + return; +} + +void ehca_interrupt_eq(void *data) +{ + struct ehca_irq_info *irq_info; + struct ehca_shca *shca; + struct ehca_eqe *eqe; + int int_state; + + EDEB_EN(7, "data=%p", data); + + irq_info = (struct ehca_irq_info *)data; + shca = to_shca(eq); + + do { + eqe = (struct ehca_eqe *)ehca_poll_eq(shca, &shca->eq); + + if ((shca->hw_level >= 2) && (eqe != NULL)) + int_state = 1; + else + int_state = 0; + + while ((int_state == 1) || (eqe != 0)) { + while (eqe) { + u64 eqe_value = eqe->entry; + + EDEB(7, "eqe_value=%lx", eqe_value); + + /* TODO: better structure */ + if (EHCA_BMASK_GET(EQE_COMPLETION_EVENT, + eqe_value)) { + extern struct idr ehca_cq_idr; + u32 token; + struct ehca_cq *cq; + + EDEB(7, "... completion event"); + token = + EHCA_BMASK_GET(EQE_CQ_TOKEN, + eqe_value); + down_read(&ehca_cq_idr_sem); + cq = idr_find(&ehca_cq_idr, token); + up_read(&ehca_cq_idr_sem); + reset_eq_pending(cq); + comp_event_callback(cq); + } else { + EDEB(7, "... non completion event"); + parse_identifier(shca, eqe_value); + } + eqe = + (struct ehca_eqe *)ehca_poll_eq(shca, + &shca->eq); + } + + /* TODO: do we need hw_level */ + if (shca->hw_level >= 2) + int_state = + hipz_h_query_int_state(shca->ipz_hca_handle, + irq_info); + eqe = (struct ehca_eqe *)ehca_poll_eq(shca, &shca->eq); + + } + } while (int_state != 0); + + EDEB_EX(7, "shca=%p", shca); + + return; +} + +void ehca_interrupt_neq(void *data) +{ + struct ehca_irq_info *irq_info; + struct ehca_shca *shca; + struct ehca_eqe *eqe; + u64 ret = H_Success; + + EDEB_EN(7, "data=%p", data); + + irq_info = (struct ehca_irq_info *)data; + shca = to_shca(neq); + eqe = (struct ehca_eqe *)ehca_poll_eq(shca, &shca->neq); + + while (eqe) { + if (!EHCA_BMASK_GET(NEQE_COMPLETION_EVENT, eqe->entry)) + parse_ec(shca, eqe->entry); + + eqe = (struct ehca_eqe *)ehca_poll_eq(shca, &shca->neq); + } + + ret = hipz_h_reset_event(shca->ipz_hca_handle, + shca->neq.ipz_eq_handle, 0xFFFFFFFFFFFFFFFF); + + if (ret != H_Success) + EDEB_ERR(4, "Can't clear notification events."); + + EDEB_EX(7, "shca=%p", shca); + + return; +} + +irqreturn_t ehca_interrupt(int irq, void *dev_id, struct pt_regs *regs) +{ + struct ehca_irq_info *info = (struct ehca_irq_info *)dev_id; + + EDEB_EN(7, "dev_id=%p", dev_id); + + queue_work(info->wq, info->work); + + EDEB_EX(7, ""); + + return IRQ_HANDLED; +} diff --git a/drivers/infiniband/hw/ehca/ehca_irq.h b/drivers/infiniband/hw/ehca/ehca_irq.h new file mode 100644 index 0000000..43b2e3e --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_irq.h @@ -0,0 +1,90 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * Function definitions and structs for EQs, NEQs and interrupts + * + * Authors: Heiko J Schick + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_irq.h,v 1.25 2006/02/06 10:17:34 schickhj Exp $ + */ + +#ifndef __EHCA_IRQ_H +#define __EHCA_IRQ_H + + +struct ehca_shca; + +#include +#include + +#ifndef EHCA_USERDRIVER +#include +#endif + +#ifndef __KERNEL__ +#define NO_IRQ (-1) +#include +#include +#endif + +#ifndef EHCA_USERDRIVER +#define to_shca(queue) container_of(irq_info->eq, \ + struct ehca_shca, \ + queue) +#else +extern struct ehca_module ehca_module; +#define to_shca(queue) list_entry(ehca_module.shca_list.next, \ + struct ehca_shca, shca_list) +#endif + +struct ehca_irq_info { + __u32 ist; + __u32 irq; + void *eq; + + atomic_t irq_count; + struct workqueue_struct *wq; + struct work_struct *work; + + pid_t pid; +}; + +void ehca_interrupt_eq(void *data); +void ehca_interrupt_neq(void *data); +irqreturn_t ehca_interrupt(int irq, void *dev_id, struct pt_regs *regs); +irqreturn_t ehca_interrupt_direct(int irq, void *dev_id, struct pt_regs *regs); +int ehca_error_data(struct ehca_shca *shca, u64 ressource); + +#endif From rolandd at cisco.com Fri Feb 17 16:57:23 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 17 Feb 2006 16:57:23 -0800 Subject: [openib-general] [PATCH 08/22] Generic ehca headers In-Reply-To: <20060218005532.13620.79663.stgit@localhost.localdomain> References: <20060218005532.13620.79663.stgit@localhost.localdomain> Message-ID: <20060218005723.13620.10389.stgit@localhost.localdomain> From: Roland Dreier The defines of TRUE and FALSE look rather useless. Why are they needed? What is struct ehca_cache for? It doesn't seem to be used anywhere. ehca_kv_to_g() looks completely horrible. The whole idea of using vmalloc()ed kernel memory to do DMA seems unacceptable to me. It's usual to include all headers before all headers. --- drivers/infiniband/hw/ehca/ehca_flightrecorder.h | 74 ++++ drivers/infiniband/hw/ehca/ehca_kernel.h | 135 +++++++ drivers/infiniband/hw/ehca/ehca_tools.h | 431 ++++++++++++++++++++++ 3 files changed, 640 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_flightrecorder.h b/drivers/infiniband/hw/ehca/ehca_flightrecorder.h new file mode 100644 index 0000000..7c631ad --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_flightrecorder.h @@ -0,0 +1,74 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * flightrecorder macros + * + * Authors: Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_flightrecorder.h,v 1.5 2006/02/06 10:17:34 schickhj Exp $ + */ +/*****************************************************************************/ +#ifndef EHCA_FLIGHTRECORDER_H +#define EHCA_FLIGHTRECORDER_H + +#define ED_EXTEND1(x,ar1...) \ + unsigned long __EDEB_R2=(const unsigned long)x-0;ED_EXTEND2(ar1) +#define ED_EXTEND2(x,ar1...) \ + unsigned long __EDEB_R3=(const unsigned long)x-0;ED_EXTEND3(ar1) +#define ED_EXTEND3(x,ar1...) \ + unsigned long __EDEB_R4=(const unsigned long)x-0;ED_EXTEND4(ar1) +#define ED_EXTEND4(x,ar1...) + +#define EHCA_FLIGHTRECORDER_SIZE 65536 + +extern atomic_t ehca_flightrecorder_index; +extern unsigned long ehca_flightrecorder[EHCA_FLIGHTRECORDER_SIZE]; + +/* Not nice, but -O2 optimized */ + +#define ED_FLIGHT_LOG(x,ar1...) { \ + u32 flight_offset = ((u32) \ + atomic_add_return(4, &ehca_flightrecorder_index)) \ + % EHCA_FLIGHTRECORDER_SIZE; \ + unsigned long *flight_trline = &ehca_flightrecorder[flight_offset]; \ + unsigned long __EDEB_R1 = (unsigned long) x-0; ED_EXTEND1(ar1) \ + flight_trline[0]=__EDEB_R1,flight_trline[1]=__EDEB_R2, \ + flight_trline[2]=__EDEB_R3,flight_trline[3]=__EDEB_R4; } + +#define EHCA_FLIGHTRECORDER_BACKLOG 60 + +void ehca_flight_to_printk(void); + +#endif diff --git a/drivers/infiniband/hw/ehca/ehca_kernel.h b/drivers/infiniband/hw/ehca/ehca_kernel.h new file mode 100644 index 0000000..f119149 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_kernel.h @@ -0,0 +1,135 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * generalized functions for code shared between kernel and userspace + * + * Authors: Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_kernel.h,v 1.39 2006/02/06 11:45:10 schickhj Exp $ + */ + +#ifndef _EHCA_KERNEL_H_ +#define _EHCA_KERNEL_H_ + +#define FALSE (1==0) +#define TRUE (1==1) + +#define big_little_target 0 /* needed for simulation */ +#include + +#include +#include "ehca_common.h" +#include "ehca_kernel.h" + +/** + * Handle to be used for adress translation mechanisms, currently a placeholder. + */ +struct ehca_bridge_handle { + int handle; +}; + +inline static int ehca_adr_bad(void *adr) +{ + return (adr == 0); +}; + +#ifdef EHCA_USERDRIVER +/* userspace replacement for kernel functions */ +#include "ehca_usermain.h" +#else /* USERDRIVER */ +/* kernel includes */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +struct ehca_cache { + kmem_cache_t *cache; + int size; +}; + +#ifdef __powerpc64__ +#include +#include +#include +#else +#endif + +#include + +#include + + +/** + * ehca_kv_to_g - Converts a kernel virtual address to visible addresses + * (i.e. a physical/absolute address). + */ +inline static u64 ehca_kv_to_g(void *adr) +{ + u64 raddr; +#ifndef CONFIG_PPC64 + raddr = virt_to_phys((u64)adr); +#else + /* we need to find not only the physical address + * but the absolute to account for memory segmentation */ + raddr = virt_to_abs((u64)adr); +#endif + if (((u64)adr & VMALLOC_START) == VMALLOC_START) { + raddr = phys_to_abs((page_to_pfn(vmalloc_to_page(adr)) << + PAGE_SHIFT)); + } + return (raddr); +} + +#endif /* USERDRIVER */ +#include + + +#endif /* _EHCA_KERNEL_H_ */ diff --git a/drivers/infiniband/hw/ehca/ehca_tools.h b/drivers/infiniband/hw/ehca/ehca_tools.h new file mode 100644 index 0000000..915a0b7 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_tools.h @@ -0,0 +1,431 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * auxiliary functions + * + * Authors: Christoph Raisch + * Khadija Souissi + * Waleri Fomin + * Heiko J Schick + * + * Copyright (c) 2005 IBM Corporation + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_tools.h,v 1.43 2006/02/06 10:17:34 schickhj Exp $ + */ + + +#ifndef EHCA_TOOLS_H +#define EHCA_TOOLS_H + +#include "ehca_flightrecorder.h" +#include "ehca_common.h" + +#define flightlog_value() mftb() + +#ifndef sizeofmember +#define sizeofmember(TYPE, MEMBER) (sizeof( ((TYPE *)0)->MEMBER)) +#endif + +#define EHCA_EDEB_TRACE_MASK_SIZE 32 +extern u8 ehca_edeb_mask[EHCA_EDEB_TRACE_MASK_SIZE]; +#define EDEB_ID_TO_U32(str4) (str4[3] | (str4[2] << 8) | (str4[1] << 16) | \ + (str4[0] << 24)) + +inline static u64 ehca_edeb_filter(const u32 level, + const u32 id, const u32 line) +{ + u64 ret = 0; + u32 filenr = 0; + u32 filter_level = 9; + u32 dynamic_level = 0; + /* This is code written for the gcc -O2 optimizer which should colapse + * to two single ints filter_level is the first level kicked out by + * compiler means trace everythin below 6. */ + if (id == EDEB_ID_TO_U32("ehav")) { + filenr = 0x01; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("clas")) { + filenr = 0x02; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("cqeq")) { + filenr = 0x03; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("shca")) { + filenr = 0x05; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("eirq")) { + filenr = 0x06; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("lMad")) { + filenr = 0x07; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("mcas")) { + filenr = 0x08; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("mrmw")) { + filenr = 0x09; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("vpd ")) { + filenr = 0x0a; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("e_qp")) { + filenr = 0x0b; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("uqes")) { + filenr = 0x0c; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("PHYP")) { + filenr = 0x0d; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("snse")) { + filenr = 0x0e; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("iptz")) { + filenr = 0x0f; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("spta")) { + filenr = 0x10; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("simp")) { + filenr = 0x11; + filter_level = 8; + } + if (id == EDEB_ID_TO_U32("reqs")) { + filenr = 0x12; + filter_level = 8; + } + + if ((filenr - 1) > sizeof(ehca_edeb_mask)) { + filenr = 0; + } + + if (filenr == 0) { + filter_level = 9; + } /* default */ + ret = filenr * 0x10000 + line; + if (filter_level <= level) { + return (ret | 0x100000000); /* this is the flag to not trace */ + } + dynamic_level = ehca_edeb_mask[filenr]; + if (likely(dynamic_level <= level)) { + ret = ret | 0x100000000; + }; + return ret; +} + +#ifdef EHCA_USE_HCALL_KERNEL +#ifdef CONFIG_PPC_PSERIES + +#include + +/** + * IS_EDEB_ON - Checks if debug is on for the given level. + */ +#define IS_EDEB_ON(level) \ + ((ehca_edeb_filter(level, EDEB_ID_TO_U32(DEB_PREFIX), __LINE__) & 0x100000000)==0) + +#define EDEB_P_GENERIC(level,idstring,format,args...) \ +do { \ + u64 ehca_edeb_filterresult = \ + ehca_edeb_filter(level, EDEB_ID_TO_U32(DEB_PREFIX), __LINE__);\ + if ((ehca_edeb_filterresult & 0x100000000) == 0) \ + printk("PU%04x %08x:%s " idstring " "format "\n", \ + get_paca()->paca_index, (u32)(ehca_edeb_filterresult), \ + __func__, ##args); \ + if (unlikely(ehca_edeb_mask[0x1e]!=0)) \ + ED_FLIGHT_LOG((((u64)(get_paca()->paca_index)<< 32) | \ + ((u64)(ehca_edeb_filterresult & 0xffffffff)) << 40 | \ + (flightlog_value()&0xffffffff)), args); \ +} while (1==0) + +#elif CONFIG_ARCH_S390 + +#include +#define EDEB_P_GENERIC(level,idstring,format,args...) \ +do { \ + u64 ehca_edeb_filterresult = \ + ehca_edeb_filter(level, EDEB_ID_TO_U32(DEB_PREFIX), __LINE__);\ + if ((ehca_edeb_filterresult & 0x100000000) == 0) \ + printk("PU%04x %08x:%s " idstring " "format "\n", \ + smp_processor_id(), (u32)(ehca_edeb_filterresult), \ + __func__, ##args); \ +} while (1==0) + +#elif REAL_HCALL + +#define EDEB_P_GENERIC(level,idstring,format,args...) \ +do { \ + u64 ehca_edeb_filterresult = \ + ehca_edeb_filter(level, EDEB_ID_TO_U32(DEB_PREFIX), __LINE__); \ + if ((ehca_edeb_filterresult & 0x100000000) == 0) \ + printk("%08x:%s " idstring " "format "\n", \ + (u32)(ehca_edeb_filterresult), \ + __func__, ##args); \ +} while (1==0) + +#endif +#else + +#define IS_EDEB_ON(level) (1) + +#define EDEB_P_GENERIC(level,idstring,format,args...) \ +do { \ + printk("%s " idstring " "format "\n", \ + __func__, ##args); \ +} while (1==0) + +#endif + +/** + * EDEB - Trace output macro. + * @level tracelevel + * @format optional format string, use "" if not desired + * @args printf like arguments for trace, use %Lx for u64, %x for u32 + * %p for pointer + */ +#define EDEB(level,format,args...) \ + EDEB_P_GENERIC(level,"",format,##args) +#define EDEB_ERR(level,format,args...) \ + EDEB_P_GENERIC(level,"HCAD_ERROR ",format,##args) +#define EDEB_EN(level,format,args...) \ + EDEB_P_GENERIC(level,">>>",format,##args) +#define EDEB_EX(level,format,args...) \ + EDEB_P_GENERIC(level,"<<<",format,##args) + +/** + * EDEB macro to dump a memory block, whose length is n*8 bytes. + * Each line has the following layout: + * adr=X ofs=Y <8 bytes hex> <8 bytes hex> + */ + +#define EDEB_DMP(level,adr,len,format,args...) \ + do { \ + unsigned int x; \ + unsigned int l = (unsigned int)(len); \ + unsigned char *deb = (unsigned char*)(adr); \ + for (x = 0; x < l; x += 16) { \ + EDEB(level, format " adr=%p ofs=%04x %016lx %016lx", \ + ##args, deb, x, *((u64 *)&deb[0]), *((u64 *)&deb[8])); \ + deb += 16; \ + } \ + } while (0) + +#define LOCATION __FILE__ " " + +/* define a bitmask, little endian version */ +#define EHCA_BMASK(pos,length) (((pos)<<16)+(length)) +/* define a bitmask, the ibm way... */ +#define EHCA_BMASK_IBM(from,to) (((63-to)<<16)+((to)-(from)+1)) +/* internal function, don't use */ +#define EHCA_BMASK_SHIFTPOS(mask) (((mask)>>16)&0xffff) +/* internal function, don't use */ +#define EHCA_BMASK_MASK(mask) (0xffffffffffffffffULL >> ((64-(mask))&0xffff)) +/* return value shifted and masked by mask\n + * variable|=HCA_BMASK_SET(MY_MASK,0x4711) ORs the bits in variable\n + * variable&=~HCA_BMASK_SET(MY_MASK,-1) clears the bits from the mask + * in variable + */ +#define EHCA_BMASK_SET(mask,value) \ + ((EHCA_BMASK_MASK(mask) & ((u64)(value)))<>EHCA_BMASK_SHIFTPOS(mask))) + +/** + * ehca_fixme - Dummy function which will be removed in production code + * to find all todos by compiler. + */ +void ehca_fixme(void); + +extern void exit(int); +inline static void ehca_catastrophic(char *str) +{ +#ifndef EHCA_USERDRIVER + printk(KERN_ERR "HCAD_ERROR %s\n", str); + ehca_flight_to_printk(); +#else + exit(1); +#endif +} + +#define PARANOIA_MODE +#ifdef PARANOIA_MODE + +#define EHCA_CHECK_ADR_P(adr) \ + if (unlikely(adr==0)) { \ + EDEB_ERR(4, "adr=%p check failed line %i", adr, \ + __LINE__); \ + return ERR_PTR(-EFAULT); } + +#define EHCA_CHECK_ADR(adr) \ + if (unlikely(adr==0)) { \ + EDEB_ERR(4, "adr=%p check failed line %i", adr, \ + __LINE__); \ + return -EFAULT; } + +#define EHCA_CHECK_DEVICE_P(device) \ + if (unlikely(device==0)) { \ + EDEB_ERR(4, "device=%p check failed", device); \ + return ERR_PTR(-EFAULT); } + +#define EHCA_CHECK_DEVICE(device) \ + if (unlikely(device==0)) { \ + EDEB_ERR(4, "device=%p check failed", device); \ + return -EFAULT; } + +#define EHCA_CHECK_PD(pd) \ + if (unlikely(pd==0)) { \ + EDEB_ERR(4, "pd=%p check failed", pd); \ + return -EFAULT; } + +#define EHCA_CHECK_PD_P(pd) \ + if (unlikely(pd==0)) { \ + EDEB_ERR(4, "pd=%p check failed", pd); \ + return ERR_PTR(-EFAULT); } + +#define EHCA_CHECK_AV(av) \ + if (unlikely(av==0)) { \ + EDEB_ERR(4, "av=%p check failed", av); \ + return -EFAULT; } + +#define EHCA_CHECK_AV_P(av) \ + if (unlikely(av==0)) { \ + EDEB_ERR(4, "av=%p check failed", av); \ + return ERR_PTR(-EFAULT); } + +#define EHCA_CHECK_CQ(cq) \ + if (unlikely(cq==0)) { \ + EDEB_ERR(4, "cq=%p check failed", cq); \ + return -EFAULT; } + +#define EHCA_CHECK_CQ_P(cq) \ + if (unlikely(cq==0)) { \ + EDEB_ERR(4, "cq=%p check failed", cq); \ + return ERR_PTR(-EFAULT); } + +#define EHCA_CHECK_EQ(eq) \ + if (unlikely(eq==0)) { \ + EDEB_ERR(4, "eq=%p check failed", eq); \ + return -EFAULT; } + +#define EHCA_CHECK_EQ_P(eq) \ + if (unlikely(eq==0)) { \ + EDEB_ERR(4, "eq=%p check failed", eq); \ + return ERR_PTR(-EFAULT); } + +#define EHCA_CHECK_QP(qp) \ + if (unlikely(qp==0)) { \ + EDEB_ERR(4, "qp=%p check failed", qp); \ + return -EFAULT; } + +#define EHCA_CHECK_QP_P(qp) \ + if (unlikely(qp==0)) { \ + EDEB_ERR(4, "qp=%p check failed", qp); \ + return ERR_PTR(-EFAULT); } + +#define EHCA_CHECK_MR(mr) \ + if (unlikely(mr==0)) { \ + EDEB_ERR(4, "mr=%p check failed", mr); \ + return -EFAULT; } + +#define EHCA_CHECK_MR_P(mr) \ + if (unlikely(mr==0)) { \ + EDEB_ERR(4, "mr=%p check failed", mr); \ + return ERR_PTR(-EFAULT); } + +#define EHCA_CHECK_MW(mw) \ + if (unlikely(mw==0)) { \ + EDEB_ERR(4, "mw=%p check failed", mw); \ + return -EFAULT; } + +#define EHCA_CHECK_MW_P(mw) \ + if (unlikely(mw==0)) { \ + EDEB_ERR(4, "mw=%p check failed", mw); \ + return ERR_PTR(-EFAULT); } + +#define EHCA_CHECK_FMR(fmr) \ + if (unlikely(fmr==0)) { \ + EDEB_ERR(4, "fmr=%p check failed", fmr); \ + return -EFAULT; } + +#define EHCA_CHECK_FMR_P(fmr) \ + if (unlikely(fmr==0)) { \ + EDEB_ERR(4, "fmr=%p check failed", fmr); \ + return ERR_PTR(-EFAULT); } + +#define EHCA_REGISTER_PD(device,pd) +#define EHCA_REGISTER_AV(pd,av) +#define EHCA_DEREGISTER_PD(PD) +#define EHCA_DEREGISTER_AV(av) +#else +#define EHCA_CHECK_DEVICE_P(device) + +#define EHCA_CHECK_PD(pd) +#define EHCA_REGISTER_PD(device,pd) +#define EHCA_DEREGISTER_PD(PD) +#endif + +/** + * ehca2ib_return_code - Returns ib return code corresponding to the given + * ehca return code. + */ +static inline int ehca2ib_return_code(u64 ehca_rc) +{ + switch (ehca_rc) { + case H_Success: + return 0; + case H_Busy: + return -EBUSY; + case H_NoMem: + return -ENOMEM; + default: + return -EINVAL; + } +} + +#endif /* EHCA_TOOLS_H */ From rolandd at cisco.com Fri Feb 17 16:57:19 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 17 Feb 2006 16:57:19 -0800 Subject: [openib-general] [PATCH 06/22] Queue handling In-Reply-To: <20060218005532.13620.79663.stgit@localhost.localdomain> References: <20060218005532.13620.79663.stgit@localhost.localdomain> Message-ID: <20060218005719.13620.95136.stgit@localhost.localdomain> From: Roland Dreier Code like #ifndef __PPC64__ void * dummy1; /* make sure we use the same thing on 32 bit */ #endif looks _very_ suspicious. Much better to make sure that the structures are laid out the same no matter what the word size of the architecture is rather than relying on fragile hacks like this. --- drivers/infiniband/hw/ehca/ipz_pt_fn.c | 137 ++++++++++++++++++++++ drivers/infiniband/hw/ehca/ipz_pt_fn.h | 165 +++++++++++++++++++++++++++ drivers/infiniband/hw/ehca/ipz_pt_fn_core.h | 152 +++++++++++++++++++++++++ 3 files changed, 454 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ipz_pt_fn.c b/drivers/infiniband/hw/ehca/ipz_pt_fn.c new file mode 100644 index 0000000..d6c490c --- /dev/null +++ b/drivers/infiniband/hw/ehca/ipz_pt_fn.c @@ -0,0 +1,137 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * internal queue handling + * + * Authors: Waleri Fomin + * Reinhard Ernst + * Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ipz_pt_fn.c,v 1.16 2006/02/06 10:17:34 schickhj Exp $ + */ + +#define DEB_PREFIX "iptz" + +#include "ehca_kernel.h" +#include "ehca_tools.h" +#include "ipz_pt_fn.h" + +extern int ehca_hwlevel; + +void *ipz_QPageit_get_inc(struct ipz_queue *queue) +{ + void *retvalue = NULL; + u8 *EOF_last_page = queue->queue + queue->queue_length; + + retvalue = queue->current_q_addr; + queue->current_q_addr += queue->pagesize; + if (queue->current_q_addr > EOF_last_page) { + queue->current_q_addr -= queue->pagesize; + retvalue = NULL; + } + + if ((((u64)retvalue) % EHCA_PAGESIZE) != 0) { + EDEB(4, "ERROR!! not at PAGE-Boundary"); + return (NULL); + } + EDEB(7, "queue=%p retvalue=%p", queue, retvalue); + return (retvalue); +} + +void *ipz_QEit_EQ_get_inc(struct ipz_queue *queue) +{ + void *retvalue = NULL; + u8 *last_entry_in_q = queue->queue + queue->queue_length + - queue->qe_size; + + retvalue = queue->current_q_addr; + queue->current_q_addr += queue->qe_size; + if (queue->current_q_addr > last_entry_in_q) { + queue->current_q_addr = queue->queue; + queue->toggle_state = (~queue->toggle_state) & 1; + } + + EDEB(7, "queue=%p retvalue=%p new current_q_addr=%p qe_size=%x", + queue, retvalue, queue->current_q_addr, queue->qe_size); + + return (retvalue); +} + +int ipz_queue_ctor(struct ipz_queue *queue, + const u32 nr_of_pages, + const u32 pagesize, const u32 qe_size, const u32 nr_of_sg) +{ + EDEB_EN(7, "nr_of_pages=%x pagesize=%x qe_size=%x", + nr_of_pages, pagesize, qe_size); + queue->queue_length = nr_of_pages * pagesize; + queue->queue = vmalloc(queue->queue_length); + if (queue->queue == 0) { + EDEB(4, "ERROR!! didn't get the memory"); + return (FALSE); + } + if ((((u64)queue->queue) & (EHCA_PAGESIZE - 1)) != 0) { + EDEB(4, "ERROR!! QUEUE doesn't start at " + "page boundary"); + vfree(queue->queue); + return (FALSE); + } + + memset(queue->queue, 0, queue->queue_length); + queue->current_q_addr = queue->queue; + queue->qe_size = qe_size; + queue->act_nr_of_sg = nr_of_sg; + queue->pagesize = pagesize; + queue->toggle_state = 1; + EDEB_EX(7, "queue_length=%x queue=%p qe_size=%x" + " act_nr_of_sg=%x", queue->queue_length, queue->queue, + queue->qe_size, queue->act_nr_of_sg); + return TRUE; +} + +int ipz_queue_dtor(struct ipz_queue *queue) +{ + EDEB_EN(7, "ipz_queue pointer=%p", queue); + if (queue == NULL) { + return (FALSE); + } + if (queue->queue == NULL) { + return (FALSE); + } + EDEB(7, "destructing a queue with the following " + "properties:\n nr_of_pages=%x pagesize=%x qe_size=%x", + queue->act_nr_of_sg, queue->pagesize, queue->qe_size); + vfree(queue->queue); + + EDEB_EX(7, "queue freed!"); + return TRUE; +} diff --git a/drivers/infiniband/hw/ehca/ipz_pt_fn.h b/drivers/infiniband/hw/ehca/ipz_pt_fn.h new file mode 100644 index 0000000..2e197db --- /dev/null +++ b/drivers/infiniband/hw/ehca/ipz_pt_fn.h @@ -0,0 +1,165 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * internal queue handling + * + * Authors: Waleri Fomin + * Reinhard Ernst + * Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ipz_pt_fn.h,v 1.11 2006/02/06 10:17:34 schickhj Exp $ + */ + +#ifndef __IPZ_PT_FN_H__ +#define __IPZ_PT_FN_H__ + +#include "ipz_pt_fn_core.h" +#include "ehca_qes.h" + +#define EHCA_PAGESIZE 4096UL +#define EHCA_PT_ENTRIES 512UL + +/** @brief generic page table + */ +struct ipz_pt { + u64 entries[EHCA_PT_ENTRIES]; +}; + +/** @brief generic page + */ +struct ipz_page { + u8 entries[EHCA_PAGESIZE]; +}; + +/** @brief page table for a queue, only to be used in pf + */ +struct ipz_qpt { + /* queue page tables (kv), use u64 because we know the element length */ + u64 *qpts; + u32 allocated_qpts_entries; + u32 nr_of_PTEs; /* number of page table entries PTE iterators */ + u64 *current_pte_addr; +}; + +/** @brief constructor for a ipz_queue_t, placement new for ipz_queue_t, + new for all dependent datastructors + + all QP Tables are the same + flow: + -# allocate+pin queue + @see ipz_qpt_ctor() + @returns true if ok, false if out of memory + */ +int ipz_queue_ctor(struct ipz_queue *queue, const u32 nr_of_pages, + const u32 pagesize, + const u32 qe_size, /* queue entry size*/ + const u32 nr_of_sg); + +/** @brief destructor for a ipz_queue_t + -# free queue + @see ipz_queue_ctor() + @returns true if ok, false if queue was NULL-ptr of free failed +*/ +int ipz_queue_dtor(struct ipz_queue *queue); + +/** @brief constructor for a ipz_qpt_t, + * placement new for struct ipz_queue, new for all dependent datastructors + * + * all QP Tables are the same, + * flow: + * -# allocate+pin queue + * -# initialise ptcb + * -# allocate+pin PTs + * -# link PTs to a ring, according to HCA Arch, set bit62 id needed + * -# the ring must have room for exactly nr_of_PTEs + * @see ipz_qpt_ctor() + */ +void ipz_qpt_ctor(struct ipz_qpt *qpt, + struct ehca_bridge_handle bridge, + const u32 nr_of_QEs, + const u32 pagesize, + const u32 qe_size, + const u8 lowbyte, const u8 toggle, + u32 * act_nr_of_QEs, + u32 * act_nr_of_pages); + +/** @brief return current Queue Entry, increment Queue Entry iterator by one + step in struct ipz_queue, will wrap in ringbuffer + @returns address (kv) of Queue Entry BEFORE increment + @warning don't use in parallel with ipz_QPageit_get_inc() + @warning unpredictable results may occur if steps>act_nr_of_queue_entries + + fix EQ page problems + */ +void *ipz_QEit_EQ_get_inc(struct ipz_queue *queue); + +/** @brief return current Event Queue Entry, increment Queue Entry iterator + by one step in struct ipz_queue if valid, will wrap in ringbuffer + @returns address (kv) of Queue Entry BEFORE increment + @returns 0 and does not increment, if wrong valid state + @warning don't use in parallel with ipz_queue_QPageit_get_inc() + @warning unpredictable results may occur if steps>act_nr_of_queue_entries + */ +inline static void *ipz_QEit_EQ_get_inc_valid(struct ipz_queue *queue) +{ + void *retvalue = ipz_QEit_get(queue); + u32 qe = *(u8 *) retvalue; + EDEB(7, "ipz_QEit_EQ_get_inc_valid qe=%x", qe); + if ((qe >> 7) == (queue->toggle_state & 1)) { + /* this is a good one */ + ipz_QEit_EQ_get_inc(queue); + } else { + retvalue = NULL; + } + return (retvalue); +} + +/** + @returns address (GX) of first queue entry + */ +inline static u64 ipz_qpt_get_firstpage(struct ipz_qpt *qpt) +{ + return (be64_to_cpu(qpt->qpts[0])); +} + +/** + @returns address (kv) of first page of queue page table + */ +inline static void *ipz_qpt_get_qpt(struct ipz_qpt *qpt) +{ + return (qpt->qpts); +} + +#endif /* __IPZ_PT_FN_H__ */ diff --git a/drivers/infiniband/hw/ehca/ipz_pt_fn_core.h b/drivers/infiniband/hw/ehca/ipz_pt_fn_core.h new file mode 100644 index 0000000..1b9a114 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ipz_pt_fn_core.h @@ -0,0 +1,152 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * internal queue handling + * + * Authors: Waleri Fomin + * Reinhard Ernst + * Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ipz_pt_fn_core.h,v 1.12 2006/02/06 10:17:34 schickhj Exp $ + */ + +#ifndef __IPZ_PT_FN_CORE_H__ +#define __IPZ_PT_FN_CORE_H__ + +#ifdef __KERNEL__ +#include "ehca_tools.h" +#else /* some replacements for kernel stuff */ +#include "ehca_utools.h" +#endif + +#include "ehca_qes.h" + +/** @brief generic queue in linux kernel virtual memory (kv) + */ +struct ipz_queue { +#ifndef __PPC64__ + void * dummy1; /* make sure we use the same thing on 32 bit */ +#endif + u8 *current_q_addr; /* current queue entry */ +#ifndef __PPC64__ + void * dummy2; +#endif + u8 *queue; /* points to first queue entry */ + u32 qe_size; /* queue entry size */ + u32 act_nr_of_sg; + u32 queue_length; /* queue length allocated in bytes */ + u32 pagesize; + u32 toggle_state; /* toggle flag - per page */ + u32 dummy3; /* 64 bit alignment*/ +}; + +/** @brief return current Queue Entry + @returns address (kv) of Queue Entry + */ +static inline void *ipz_QEit_get(struct ipz_queue *queue) +{ + return (queue->current_q_addr); +} + +/** @brief return current Queue Page , increment Queue Page iterator from + page to page in struct ipz_queue, last increment will return 0! and + NOT wrap + @returns address (kv) of Queue Page + @warning don't use in parallel with ipz_QE_get_inc() + */ +void *ipz_QPageit_get_inc(struct ipz_queue *queue); + +/** @brief return current Queue Entry, increment Queue Entry iterator by one + step in struct ipz_queue, will wrap in ringbuffer + @returns address (kv) of Queue Entry BEFORE increment + @warning don't use in parallel with ipz_QPageit_get_inc() + @warning unpredictable results may occur if steps>act_nr_of_queue_entries + */ +static inline void *ipz_QEit_get_inc(struct ipz_queue *queue) +{ + void *retvalue = 0; + u8 *last_entry_in_q = queue->queue + queue->queue_length + - queue->qe_size; + + retvalue = queue->current_q_addr; + queue->current_q_addr += queue->qe_size; + if (queue->current_q_addr > last_entry_in_q) { + queue->current_q_addr = queue->queue; + /* toggle the valid flag */ + queue->toggle_state = (~queue->toggle_state) & 1; + } + + EDEB(7, "queue=%p retvalue=%p new current_q_addr=%p qe_size=%x", + queue, retvalue, queue->current_q_addr, queue->qe_size); + + return (retvalue); +} + +/** @brief return current Queue Entry, increment Queue Entry iterator by one + step in struct ipz_queue, will wrap in ringbuffer + @returns address (kv) of Queue Entry BEFORE increment + @returns 0 and does not increment, if wrong valid state + @warning don't use in parallel with ipz_QPageit_get_inc() + @warning unpredictable results may occur if steps>act_nr_of_queue_entries + */ +inline static void *ipz_QEit_get_inc_valid(struct ipz_queue *queue) +{ + void *retvalue = ipz_QEit_get(queue); +#ifdef USERSPACE_DRIVER + + u32 qe = + ((struct ehca_cqe *)(ehca_ktou((struct ehca_cqe *)retvalue)))-> + cqe_flags; +#else + u32 qe = ((struct ehca_cqe *)retvalue)->cqe_flags; +#endif + if ((qe >> 7) == (queue->toggle_state & 1)) { + /* this is a good one */ + ipz_QEit_get_inc(queue); + } else + retvalue = 0; + return (retvalue); +} + +/** @brief returns and resets Queue Entry iterator + @returns address (kv) of first Queue Entry + */ +static inline void *ipz_QEit_reset(struct ipz_queue *queue) +{ + queue->current_q_addr = queue->queue; + return (queue->queue); +} + +#endif /* __IPZ_PT_FN_CORE_H__ */ From rolandd at cisco.com Fri Feb 17 16:57:39 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 17 Feb 2006 16:57:39 -0800 Subject: [openib-general] [PATCH 12/22] ehca low-level verbs In-Reply-To: <20060218005532.13620.79663.stgit@localhost.localdomain> References: <20060218005532.13620.79663.stgit@localhost.localdomain> Message-ID: <20060218005739.13620.15633.stgit@localhost.localdomain> From: Roland Dreier What is ehca_init_module()? It is declared but never defined. --- drivers/infiniband/hw/ehca/ehca_iverbs.h | 163 ++++++++++++++++++ drivers/infiniband/hw/ehca/ehca_qes.h | 274 ++++++++++++++++++++++++++++++ 2 files changed, 437 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h new file mode 100644 index 0000000..b1319a9 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h @@ -0,0 +1,163 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * Function definitions for internal functions + * + * Authors: Heiko J Schick + * Khadija Souissi + * Christoph Raisch + * Hoang-Nam Nguyen + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_iverbs.h,v 1.32 2006/02/06 10:17:34 schickhj Exp $ + */ + +#ifndef __EHCA_IVERBS_H__ +#define __EHCA_IVERBS_H__ + +#include "ehca_classes.h" +/** ehca internal verb for testuse + */ +void ehca_init_module(void); + +int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props); +int ehca_query_port(struct ib_device *ibdev, + u8 port, struct ib_port_attr *props); +int ehca_query_pkey(struct ib_device *ibdev, u8 port, u16 index, u16 * pkey); +int ehca_query_gid(struct ib_device *ibdev, u8 port, + int index, union ib_gid *gid); +int ehca_modify_port(struct ib_device *ibdev, + u8 port, int port_modify_mask, + struct ib_port_modify *props); + +struct ib_pd *ehca_alloc_pd(struct ib_device *device, + struct ib_ucontext *context, + struct ib_udata *udata); + +int ehca_dealloc_pd(struct ib_pd *pd); + +struct ib_ah *ehca_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr); +int ehca_modify_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr); +int ehca_query_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr); +int ehca_destroy_ah(struct ib_ah *ah); + +struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, + struct ib_ucontext *context, + struct ib_udata *udata); +int ehca_resize_cq(struct ib_cq *cq, int cqe); + +int ehca_destroy_cq(struct ib_cq *cq); + +int ehca_poll_cq(struct ib_cq *cq, int num_entries, struct ib_wc *wc); + +int ehca_peek_cq(struct ib_cq *cq, int wc_cnt); + +int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify); + +struct ib_qp *ehca_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *init_attr, + struct ib_udata *udata); + +u64 ehca_define_sqp(struct ehca_shca *shca, struct ehca_qp *ibqp, + struct ib_qp_init_attr *qp_init_attr); + +int ehca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask); + +int ehca_query_qp(struct ib_qp *qp, struct ib_qp_attr *qp_attr, + int qp_attr_mask, struct ib_qp_init_attr *qp_init_attr); + +int ehca_destroy_qp(struct ib_qp *qp); + +int ehca_post_send(struct ib_qp *qp, + struct ib_send_wr *send_wr, struct ib_send_wr **bad_send_wr); + +int ehca_post_recv(struct ib_qp *qp, + struct ib_recv_wr *recv_wr, struct ib_recv_wr **bad_recv_wr); + +struct ib_mr *ehca_get_dma_mr(struct ib_pd *pd, int mr_access_flags); + +struct ib_mr *ehca_reg_phys_mr(struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, u64 *iova_start); + +struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, + struct ib_umem *region, + int mr_access_flags, struct ib_udata *udata); + +int ehca_rereg_phys_mr(struct ib_mr *mr, + int mr_rereg_mask, + struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, int mr_access_flags, u64 *iova_start); + +int ehca_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr); + +int ehca_dereg_mr(struct ib_mr *mr); + +struct ib_mw *ehca_alloc_mw(struct ib_pd *pd); + +int ehca_bind_mw(struct ib_qp *qp, + struct ib_mw *mw, struct ib_mw_bind *mw_bind); + +int ehca_dealloc_mw(struct ib_mw *mw); + +struct ib_fmr *ehca_alloc_fmr(struct ib_pd *pd, + int mr_access_flags, + struct ib_fmr_attr *fmr_attr); + +int ehca_map_phys_fmr(struct ib_fmr *fmr, + u64 *page_list, int list_len, u64 iova); + +int ehca_unmap_fmr(struct list_head *fmr_list); + +int ehca_dealloc_fmr(struct ib_fmr *fmr); + +int ehca_attach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid); + +int ehca_detach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid); + +struct ib_ucontext *ehca_alloc_ucontext(struct ib_device *device, + struct ib_udata *udata); +int ehca_dealloc_ucontext(struct ib_ucontext *context); + +int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma); + +int ehca_poll_eqs(void *data); + +int ehca_mmap_nopage(u64 foffset,u64 length,void ** mapped,struct vm_area_struct ** vma); +int ehca_mmap_register(u64 physical,void ** mapped,struct vm_area_struct ** vma); +int ehca_munmap(unsigned long addr, size_t len); + +#endif diff --git a/drivers/infiniband/hw/ehca/ehca_qes.h b/drivers/infiniband/hw/ehca/ehca_qes.h new file mode 100644 index 0000000..e9420e3 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_qes.h @@ -0,0 +1,274 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * Hardware request structures + * + * Authors: Waleri Fomin + * Reinhard Ernst + * Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_qes.h,v 1.9 2006/02/06 10:17:34 schickhj Exp $ + */ + + +#ifndef _EHCA_QES_H_ +#define _EHCA_QES_H_ + +/** DON'T include any kernel related files here!!! + * This file is used commonly in user and kernel space!!! + */ + +/** + * virtual scatter gather entry to specify remote adresses with length + */ +struct ehca_vsgentry { + u64 vaddr; + u32 lkey; + u32 length; +}; + +#define GRH_FLAG_MASK EHCA_BMASK_IBM(7,7) +#define GRH_IPVERSION_MASK EHCA_BMASK_IBM(0,3) +#define GRH_TCLASS_MASK EHCA_BMASK_IBM(4,12) +#define GRH_FLOWLABEL_MASK EHCA_BMASK_IBM(13,31) +#define GRH_PAYLEN_MASK EHCA_BMASK_IBM(32,47) +#define GRH_NEXTHEADER_MASK EHCA_BMASK_IBM(48,55) +#define GRH_HOPLIMIT_MASK EHCA_BMASK_IBM(56,63) + +/** + * Unreliable Datagram Address Vector Format + * see IBTA Vol1 chapter 8.3 Global Routing Header + */ +struct ehca_ud_av { + u8 sl; + u8 lnh; + u16 dlid; + u8 reserved1; + u8 reserved2; + u8 reserved3; + u8 slid_path_bits; + u8 reserved4; + u8 ipd; + u8 reserved5; + u8 pmtu; + u32 reserved6; + u64 reserved7; + union { + struct { + u64 word_0; /* always set to 6 */ + /*should be 0x1B for IB transport */ + u64 word_1; + u64 word_2; + u64 word_3; + u64 word_4; + } grh; + struct { + u32 wd_0; + u32 wd_1; + /* DWord_1 --> SGID */ + + u32 sgid_wd3; + /* bits 127 - 96 */ + + u32 sgid_wd2; + /* bits 95 - 64 */ + /* DWord_2 */ + + u32 sgid_wd1; + /* bits 63 - 32 */ + + u32 sgid_wd0; + /* bits 31 - 0 */ + /* DWord_3 --> DGID */ + + u32 dgid_wd3; + /* bits 127 - 96 + **/ + u32 dgid_wd2; + /* bits 95 - 64 + DWord_4 */ + u32 dgid_wd1; + /* bits 63 - 32 */ + + u32 dgid_wd0; + /* bits 31 - 0 */ + } grh_l; + }; +}; + +/* maximum number of sg entries allowed in a WQE */ +#define MAX_WQE_SG_ENTRIES 252 + +#define WQE_OPTYPE_SEND 0x80 +#define WQE_OPTYPE_RDMAREAD 0x40 +#define WQE_OPTYPE_RDMAWRITE 0x20 +#define WQE_OPTYPE_CMPSWAP 0x10 +#define WQE_OPTYPE_FETCHADD 0x08 +#define WQE_OPTYPE_BIND 0x04 + +#define WQE_WRFLAG_REQ_SIGNAL_COM 0x80 +#define WQE_WRFLAG_FENCE 0x40 +#define WQE_WRFLAG_IMM_DATA_PRESENT 0x20 +#define WQE_WRFLAG_SOLIC_EVENT 0x10 + +#define WQEF_CACHE_HINT 0x80 +#define WQEF_CACHE_HINT_RD_WR 0x40 +#define WQEF_TIMED_WQE 0x20 +#define WQEF_PURGE 0x08 + +#define MW_BIND_ACCESSCTRL_R_WRITE 0x40 +#define MW_BIND_ACCESSCTRL_R_READ 0x20 +#define MW_BIND_ACCESSCTRL_R_ATOMIC 0x10 + +struct ehca_wqe { + u64 work_request_id; + u8 optype; + u8 wr_flag; + u16 pkeyi; + u8 wqef; + u8 nr_of_data_seg; + u16 wqe_provided_slid; + u32 destination_qp_number; + u32 resync_psn_sqp; + u32 local_ee_context_qkey; + u32 immediate_data; + union { + struct { + u64 remote_virtual_adress; + u32 rkey; + u32 reserved; + u64 atomic_1st_op_dma_len; + u64 atomic_2nd_op; + struct ehca_vsgentry sg_list[MAX_WQE_SG_ENTRIES]; + + } nud; + struct { + u64 ehca_ud_av_ptr; + u64 reserved1; + u64 reserved2; + u64 reserved3; + struct ehca_vsgentry sg_list[MAX_WQE_SG_ENTRIES]; + } ud_avp; + struct { + struct ehca_ud_av ud_av; + struct ehca_vsgentry sg_list[MAX_WQE_SG_ENTRIES - + 2]; + } ud_av; + struct { + u64 reserved0; + u64 reserved1; + u64 reserved2; + u64 reserved3; + struct ehca_vsgentry sg_list[MAX_WQE_SG_ENTRIES]; + } all_rcv; + + struct { + u64 reserved; + u32 rkey; + u32 old_rkey; + u64 reserved1; + u64 reserved2; + u64 virtual_address; + u32 reserved3; + u32 length; + u32 reserved4; + u16 reserved5; + u8 reserved6; + u8 lr_ctl; + u32 lkey; + u32 reserved7; + u64 reserved8; + u64 reserved9; + u64 reserved10; + u64 reserved11; + } bind; + struct { + u64 reserved12; + u64 reserved13; + u32 size; + u32 start; + } inline_data; + } u; + +}; + +#define WC_SEND_RECEIVE EHCA_BMASK_IBM(0,0) +#define WC_IMM_DATA EHCA_BMASK_IBM(1,1) +#define WC_GRH_PRESENT EHCA_BMASK_IBM(2,2) +#define WC_SE_BIT EHCA_BMASK_IBM(3,3) + +struct ehca_cqe { + u64 work_request_id; + u8 optype; + u8 w_completion_flags; + u16 reserved1; + u32 nr_bytes_transferred; + u32 immediate_data; + u32 local_qp_number; + u8 freed_resource_count; + u8 service_level; + u16 wqe_count; + u32 qp_token; + u32 qkey_ee_token; + u32 remote_qp_number; + u16 dlid; + u16 rlid; + u16 reserved2; + u16 pkey_index; + u32 cqe_timestamp; + u32 wqe_timestamp; + u8 wqe_timestamp_valid; + u8 reserved3; + u8 reserved4; + u8 cqe_flags; + u32 status; +}; + +struct ehca_eqe { + u64 entry; +}; + +struct ehca_mrte { + u64 starting_va; + u64 length; /* length of memory region in bytes*/ + u32 pd; + u8 key_instance; + u8 pagesize; + u8 mr_control; + u8 local_remote_access_ctrl; + u8 reserved[0x20 - 0x18]; + u64 at_pointer[4]; +}; +#endif /*_EHCA_QES_H_*/ From rolandd at cisco.com Fri Feb 17 16:57:37 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 17 Feb 2006 16:57:37 -0800 Subject: [openib-general] [PATCH 11/22] ehca event queues In-Reply-To: <20060218005532.13620.79663.stgit@localhost.localdomain> References: <20060218005532.13620.79663.stgit@localhost.localdomain> Message-ID: <20060218005730.13620.53494.stgit@localhost.localdomain> From: Roland Dreier in ehca_poll_eqs(), is there any reason not to use list_for_each_entry()? Since ehca_poll_eqs() defers all the work to an workqueue, is there any reason for it to run in a kernel thread? Why not just make it a recurring timer? --- drivers/infiniband/hw/ehca/ehca_eq.c | 242 ++++++++++++++++++++++++++++++++++ drivers/infiniband/hw/ehca/ehca_eq.h | 78 +++++++++++ 2 files changed, 320 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_eq.c b/drivers/infiniband/hw/ehca/ehca_eq.c new file mode 100644 index 0000000..e508edb --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_eq.c @@ -0,0 +1,242 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * Event queue handling + * + * Authors: Waleri Fomin + * Reinhard Ernst + * Heiko J Schick + * Hoang-Nam Nguyen + * + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_eq.c,v 1.40 2006/02/06 16:20:38 schickhj Exp $ + */ + +#define DEB_PREFIX "e_eq" + +#include "ehca_eq.h" +#include "ehca_kernel.h" +#include "ehca_classes.h" +#include "hcp_if.h" +#include "ehca_iverbs.h" +#include "ipz_pt_fn.h" +#include "ehca_qes.h" +#include "ehca_irq.h" + +/* TODO: should be defined in ehca_classes_pSeries.h */ +#define HIPZ_EQ_REGISTER_ORIG 0 + +int ehca_create_eq(struct ehca_shca *shca, + struct ehca_eq *eq, + const enum ehca_eq_type type, const u32 length) +{ + extern struct workqueue_struct *ehca_wq; + u64 ret = H_Success; + u32 nr_pages = 0; + u32 i; + void *vpage = NULL; + + EDEB_EN(7, "shca=%p eq=%p length=%x", shca, eq, length); + EHCA_CHECK_ADR(shca); + EHCA_CHECK_ADR(eq); + + spin_lock_init(&eq->spinlock); + eq->is_initialized = 0; + + if (type!=EHCA_EQ && type!=EHCA_NEQ) { + EDEB_ERR(4, "Invalid EQ type %x. eq=%p", type, eq); + return -EINVAL; + } + if (length==0) { + EDEB_ERR(4, "EQ length must not be zero. eq=%p", eq); + return -EINVAL; + } + + ret = hipz_h_alloc_resource_eq(shca->ipz_hca_handle, + &eq->pf, + type, + length, + &eq->ipz_eq_handle, + &eq->length, + &nr_pages, &eq->irq_info.ist); + + if (ret != H_Success) { + EDEB_ERR(4, "Can't allocate EQ / NEQ. eq=%p", eq); + return -EINVAL; + } + + ret = ipz_queue_ctor(&eq->ipz_queue, nr_pages, + EHCA_PAGESIZE, sizeof(struct ehca_eqe), 0); + if (!ret) { + EDEB_ERR(4, "Can't allocate EQ pages. eq=%p", eq); + goto create_eq_exit1; + } + + for (i = 0; i < nr_pages; i++) { + u64 rpage; + + if (!(vpage = ipz_QPageit_get_inc(&eq->ipz_queue))) { + ret = H_Resource; + goto create_eq_exit2; + } + + rpage = ehca_kv_to_g(vpage); + ret = hipz_h_register_rpage_eq(shca->ipz_hca_handle, + eq->ipz_eq_handle, + &eq->pf, + 0, + HIPZ_EQ_REGISTER_ORIG, rpage, 1); + + if (i == (nr_pages - 1)) { + /* last page */ + vpage = ipz_QPageit_get_inc(&eq->ipz_queue); + if ((ret != H_Success) || (vpage != 0)) { + goto create_eq_exit2; + } + } else { + if ((ret != H_PAGE_REGISTERED) || (vpage == 0)) { + goto create_eq_exit2; + } + } + } + + ipz_QEit_reset(&eq->ipz_queue); + +#ifndef EHCA_USERDRIVER + { + pid_t pid = 0; + (eq->irq_info).pid = pid; + (eq->irq_info).eq = eq; + (eq->irq_info).wq = ehca_wq; + (eq->irq_info).work = &(eq->work); + } +#endif + + /* register interrupt handlers and initialize work queues */ + if (type == EHCA_EQ) { + INIT_WORK(&(eq->work), + ehca_interrupt_eq, (void *)&(eq->irq_info)); + eq->is_initialized = 1; + hipz_request_interrupt(&(eq->irq_info), ehca_interrupt); + } else if (type == EHCA_NEQ) { + INIT_WORK(&(eq->work), + ehca_interrupt_neq, (void *)&(eq->irq_info)); + hipz_request_interrupt(&(eq->irq_info), ehca_interrupt); + } + + EDEB_EX(7, "ret=%lx", ret); + + return 0; + + create_eq_exit2: + ipz_queue_dtor(&eq->ipz_queue); + + create_eq_exit1: + hipz_h_destroy_eq(shca->ipz_hca_handle, eq); + + EDEB_EX(7, "ret=%lx", ret); + + return -EINVAL; +} + +void *ehca_poll_eq(struct ehca_shca *shca, struct ehca_eq *eq) +{ + unsigned long flags = 0; + void *eqe = NULL; + + EDEB_EN(7, "shca=%p eq=%p", shca, eq); + EHCA_CHECK_ADR_P(shca); + EHCA_CHECK_EQ_P(eq); + + spin_lock_irqsave(&eq->spinlock, flags); + eqe = ipz_QEit_EQ_get_inc_valid(&eq->ipz_queue); + spin_unlock_irqrestore(&eq->spinlock, flags); + + EDEB_EX(7, "eq=%p eqe=%p", eq, eqe); + + return eqe; +} + +int ehca_poll_eqs(void *data) +{ + extern struct workqueue_struct *ehca_wq; + struct ehca_shca *shca; + struct ehca_module* module = data; + struct list_head *entry; + + do { + spin_lock(&module->shca_lock); + list_for_each(entry, &module->shca_list) { + shca = list_entry(entry, struct ehca_shca, shca_list); + + if (shca->eq.is_initialized && !kthread_should_stop()) + queue_work(ehca_wq, &shca->eq.work); + } + spin_unlock(&module->shca_lock); + + msleep_interruptible(1000); + } + while(!kthread_should_stop()); + + return 0; +} + +int ehca_destroy_eq(struct ehca_shca *shca, struct ehca_eq *eq) +{ + unsigned long flags = 0; + u64 retcode = H_Success; + + EDEB_EN(7, "shca=%p eq=%p", shca, eq); + EHCA_CHECK_ADR(shca); + EHCA_CHECK_EQ(eq); + + spin_lock_irqsave(&eq->spinlock, flags); + hipz_free_interrupt(&(eq->irq_info)); + + retcode = hipz_h_destroy_eq(shca->ipz_hca_handle, eq); + + spin_unlock_irqrestore(&eq->spinlock, flags); + + if (retcode != H_Success) { + EDEB_ERR(4, "Can't free EQ resources."); + return -EINVAL; + } + ipz_queue_dtor(&eq->ipz_queue); + + EDEB_EX(7, "retcode=%lx", retcode); + + return 0; +} + diff --git a/drivers/infiniband/hw/ehca/ehca_eq.h b/drivers/infiniband/hw/ehca/ehca_eq.h new file mode 100644 index 0000000..d09f21b --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_eq.h @@ -0,0 +1,78 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * Completion queue, event queue handling helper functions + * + * Authors: Waleri Fomin + * Reinhard Ernst + * Heiko J Schick + * Hoang-Nam Nguyen + * + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_eq.h,v 1.10 2006/02/06 10:17:34 schickhj Exp $ + */ + +#ifndef EHCA_EQ_H +#define EHCA_EQ_H + +#include "ehca_classes.h" +#include "ehca_common.h" + +enum ehca_eq_type { + EHCA_EQ = 0, /* event queue */ + EHCA_NEQ /* notification event queue */ +}; + +/** @brief hcad internal create EQ + */ +int ehca_create_eq(struct ehca_shca *shca, + struct ehca_eq *eq, /* struct contains eq to create */ + enum ehca_eq_type type, + const u32 length); + +/** @brief destroy the eq + */ +int ehca_destroy_eq(struct ehca_shca *shca, struct ehca_eq *eq); + +/** @brief hcad internal poll EQ + - check if new EQE available, + - if yes, increment EQE pointer + - otherwise return 0 + @returns pointer to EQE if new valid EQEavailable + */ +void *ehca_poll_eq(struct ehca_shca *shca, struct ehca_eq *eq); + +#endif /* EHCA_EQ_H */ + From rolandd at cisco.com Fri Feb 17 16:57:41 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 17 Feb 2006 16:57:41 -0800 Subject: [openib-general] [PATCH 13/22] HCA query functions In-Reply-To: <20060218005532.13620.79663.stgit@localhost.localdomain> References: <20060218005532.13620.79663.stgit@localhost.localdomain> Message-ID: <20060218005741.13620.93906.stgit@localhost.localdomain> From: Roland Dreier --- drivers/infiniband/hw/ehca/ehca_hca.c | 321 +++++++++++++++++++++++++++++++++ 1 files changed, 321 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c new file mode 100644 index 0000000..af05a5c --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_hca.c @@ -0,0 +1,321 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * HCA query functions + * + * Authors: Heiko J Schick + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_hca.c,v 1.46 2006/02/06 10:17:34 schickhj Exp $ + */ + +#undef DEB_PREFIX +#define DEB_PREFIX "shca" + +#include "ehca_kernel.h" +#include "ehca_tools.h" + +#include "hcp_if.h" /* TODO: later via hipz_* header file */ + +#define TO_MAX_INT(dest, src) \ + if (src >= INT_MAX) \ + dest = INT_MAX; \ + else \ + dest = src + +int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props) +{ + int ret = 0; + struct ehca_shca *shca; + struct query_hca_rblock *rblock; + + EDEB_EN(7, ""); + EHCA_CHECK_DEVICE(ibdev); + + memset(props, 0, sizeof(struct ib_device_attr)); + shca = container_of(ibdev, struct ehca_shca, ib_device); + + rblock = kmalloc(PAGE_SIZE, GFP_KERNEL); + if (rblock == NULL) { + EDEB_ERR(4, "Can't allocate rblock memory."); + ret = -ENOMEM; + goto query_device0; + } + + memset(rblock, 0, PAGE_SIZE); + + if (hipz_h_query_hca(shca->ipz_hca_handle, rblock) != H_Success) { + EDEB_ERR(4, "Can't query device properties"); + ret = -EINVAL; + goto query_device1; + } + props->fw_ver = rblock->hw_ver; + /* TODO: memcpy(&props->sys_image_guid, ...); */ + props->max_mr_size = rblock->max_mr_size; + /* TODO: props->page_size_cap */ + props->vendor_id = rblock->vendor_id >> 8; + props->vendor_part_id = rblock->vendor_part_id >> 16; + props->hw_ver = rblock->hw_ver; + TO_MAX_INT(props->max_qp, (rblock->max_qp - rblock->cur_qp)); + /* TODO: props->max_qp_wr = */ + /* TODO: props->device_cap_flags */ + props->max_sge = rblock->max_sge; + props->max_sge_rd = rblock->max_sge_rd; + TO_MAX_INT(props->max_qp, (rblock->max_cq - rblock->cur_cq)); + props->max_cqe = rblock->max_cqe; + TO_MAX_INT(props->max_mr, (rblock->max_cq - rblock->cur_mr)); + TO_MAX_INT(props->max_pd, rblock->max_pd); + /* TODO: props->max_qp_rd_atom */ + /* TODO: props->max_qp_init_rd_atom */ + /* TODO: props->atomic_cap */ + /* TODO: props->max_ee */ + /* TODO: props->max_rdd */ + props->max_mw = rblock->max_mw; + TO_MAX_INT(props->max_mr, (rblock->max_mw - rblock->cur_mw)); + props->max_raw_ipv6_qp = rblock->max_raw_ipv6_qp; + props->max_raw_ethy_qp = rblock->max_raw_ethy_qp; + props->max_mcast_grp = rblock->max_mcast_grp; + props->max_mcast_qp_attach = rblock->max_qps_attached_mcast_grp; + props->max_total_mcast_qp_attach = rblock->max_qps_attached_all_mcast_grp; + + TO_MAX_INT(props->max_ah, rblock->max_ah); + + props->max_fmr = rblock->max_mr; + /* TODO: props->max_map_per_fmr */ + + /* TODO: props->max_srq */ + /* TODO: props->max_srq_wr */ + /* TODO: props->max_srq_sge */ + props->max_srq = 0; + props->max_srq_wr = 0; + props->max_srq_sge = 0; + + /* TODO: props->max_pkeys */ + props->max_pkeys = 16; + + props->local_ca_ack_delay = rblock->local_ca_ack_delay; + + query_device1: + kfree(rblock); + + query_device0: + EDEB_EX(7, "ret=%x", ret); + + return ret; +} + +int ehca_query_port(struct ib_device *ibdev, + u8 port, struct ib_port_attr *props) +{ + int ret = 0; + struct ehca_shca *shca; + struct query_port_rblock *rblock; + + EDEB_EN(7, "port=%x", port); + EHCA_CHECK_DEVICE(ibdev); + + memset(props, 0, sizeof(struct ib_port_attr)); + shca = container_of(ibdev, struct ehca_shca, ib_device); + + rblock = kmalloc(PAGE_SIZE, GFP_KERNEL); + if (rblock == NULL) { + EDEB_ERR(4, "Can't allocate rblock memory."); + ret = -ENOMEM; + goto query_port0; + } + + memset(rblock, 0, PAGE_SIZE); + + if (hipz_h_query_port(shca->ipz_hca_handle, port, rblock) != H_Success) { + EDEB_ERR(4, "Can't query port properties"); + ret = -EINVAL; + goto query_port1; + } + + props->state = rblock->state; + + switch (rblock->max_mtu) { + case 0x1: + props->active_mtu = props->max_mtu = IB_MTU_256; + break; + case 0x2: + props->active_mtu = props->max_mtu = IB_MTU_512; + break; + case 0x3: + props->active_mtu = props->max_mtu = IB_MTU_1024; + break; + case 0x4: + props->active_mtu = props->max_mtu = IB_MTU_2048; + break; + case 0x5: + props->active_mtu = props->max_mtu = IB_MTU_4096; + break; + default: + EDEB_ERR(4, "Unknown MTU size: %x.", rblock->max_mtu); + } + + props->gid_tbl_len = rblock->gid_tbl_len; + /* TODO: props->port_cap_flags */ + props->max_msg_sz = rblock->max_msg_sz; + props->bad_pkey_cntr = rblock->bad_pkey_cntr; + props->qkey_viol_cntr = rblock->qkey_viol_cntr; + props->pkey_tbl_len = rblock->pkey_tbl_len; + props->lid = rblock->lid; + props->sm_lid = rblock->sm_lid; + props->lmc = rblock->lmc; + /* TODO: max_vl_num */ + props->sm_sl = rblock->sm_sl; + props->subnet_timeout = rblock->subnet_timeout; + props->init_type_reply = rblock->init_type_reply; + + /* TODO: props->active_width */ + props->active_width = IB_WIDTH_12X; + /* TODO: props->active_speed */ + + /* TODO: props->phys_state */ + + query_port1: + kfree(rblock); + + query_port0: + EDEB_EX(7, "ret=%x", ret); + + return ret; +} + +int ehca_query_pkey(struct ib_device *ibdev, u8 port, u16 index, u16 *pkey) +{ + int ret = 0; + struct ehca_shca *shca; + struct query_port_rblock *rblock; + + EDEB_EN(7, "port=%x index=%x", port, index); + EHCA_CHECK_DEVICE(ibdev); + + if (index > 16) { + EDEB_ERR(4, "Invalid index: %x.", index); + ret = -EINVAL; + goto query_pkey0; + } + + shca = container_of(ibdev, struct ehca_shca, ib_device); + + rblock = kmalloc(PAGE_SIZE, GFP_KERNEL); + if (rblock == NULL) { + EDEB_ERR(4, "Can't allocate rblock memory."); + ret = -ENOMEM; + goto query_pkey0; + } + + memset(rblock, 0, PAGE_SIZE); + + if (hipz_h_query_port(shca->ipz_hca_handle, port, rblock) != H_Success) { + EDEB_ERR(4, "Can't query port properties"); + ret = -EINVAL; + goto query_pkey1; + } + + memcpy(pkey, &rblock->pkey_entries + index, sizeof(u16)); + + query_pkey1: + kfree(rblock); + + query_pkey0: + EDEB_EX(7, "ret=%x", ret); + + return ret; +} + +int ehca_query_gid(struct ib_device *ibdev, u8 port, + int index, union ib_gid *gid) +{ + int ret = 0; + struct ehca_shca *shca; + struct query_port_rblock *rblock; + + EDEB_EN(7, "port=%x index=%x", port, index); + EHCA_CHECK_DEVICE(ibdev); + + if (index > 255) { + EDEB_ERR(4, "Invalid index: %x.", index); + ret = -EINVAL; + goto query_gid0; + } + + shca = container_of(ibdev, struct ehca_shca, ib_device); + + rblock = kmalloc(PAGE_SIZE, GFP_KERNEL); + if (rblock == NULL) { + EDEB_ERR(4, "Can't allocate rblock memory."); + ret = -ENOMEM; + goto query_gid0; + } + + memset(rblock, 0, PAGE_SIZE); + + if (hipz_h_query_port(shca->ipz_hca_handle, port, rblock) != H_Success) { + EDEB_ERR(4, "Can't query port properties"); + ret = -EINVAL; + goto query_gid1; + } + + memcpy(&gid->raw[0], &rblock->gid_prefix, sizeof(u64)); + memcpy(&gid->raw[8], &rblock->guid_entries[index], sizeof(u64)); + + query_gid1: + kfree(rblock); + + query_gid0: + EDEB_EX(7, "ret=%x GID=%lx%lx", ret, + *(u64 *) & gid->raw[0], + *(u64 *) & gid->raw[8]); + + return ret; +} + +int ehca_modify_port(struct ib_device *ibdev, + u8 port, int port_modify_mask, + struct ib_port_modify *props) +{ + int ret = 0; + + EDEB_EN(7, "port=%x", port); + EHCA_CHECK_DEVICE(ibdev); + + /* TODO: implementation */ + + EDEB_EX(7, "ret=%x", ret); + + return ret; +} From rolandd at cisco.com Fri Feb 17 16:57:45 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 17 Feb 2006 16:57:45 -0800 Subject: [openib-general] [PATCH 15/22] ehca queue pair handling In-Reply-To: <20060218005532.13620.79663.stgit@localhost.localdomain> References: <20060218005532.13620.79663.stgit@localhost.localdomain> Message-ID: <20060218005745.13620.43256.stgit@localhost.localdomain> From: Roland Dreier --- drivers/infiniband/hw/ehca/ehca_qp.c | 1528 ++++++++++++++++++++++++++++++++++ 1 files changed, 1528 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c new file mode 100644 index 0000000..e5b1b80 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -0,0 +1,1528 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * QP functions + * + * Authors: Waleri Fomin + * Reinhard Ernst + * Hoang-Nam Nguyen + * Heiko J Schick + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_qp.c,v 1.159 2006/02/15 15:01:24 nguyen Exp $ + */ + + +#define DEB_PREFIX "e_qp" + +#include "ehca_kernel.h" + +#include "ehca_classes.h" +#include "ehca_tools.h" +#include "hcp_if.h" +#include "ehca_qes.h" + +#include "ehca_iverbs.h" +#include +#include + +#include +#include + +/** @brief attributes not supported by query qp + */ +#define QP_ATTR_QUERY_NOT_SUPPORTED (IB_QP_MAX_DEST_RD_ATOMIC | \ + IB_QP_MAX_QP_RD_ATOMIC | \ + IB_QP_ACCESS_FLAGS | \ + IB_QP_EN_SQD_ASYNC_NOTIFY) + +/** @brief ehca (internal) qp state values + */ +enum ehca_qp_state { + EHCA_QPS_RESET = 1, + EHCA_QPS_INIT = 2, + EHCA_QPS_RTR = 3, + EHCA_QPS_RTS = 5, + EHCA_QPS_SQD = 6, + EHCA_QPS_SQE = 8, + EHCA_QPS_ERR = 128 +}; + +/** @brief qp state transitions as defined by IB Arch Rel 1.1 page 431 + */ +enum ib_qp_statetrans { + IB_QPST_ANY2RESET, + IB_QPST_ANY2ERR, + IB_QPST_RESET2INIT, + IB_QPST_INIT2RTR, + IB_QPST_INIT2INIT, + IB_QPST_RTR2RTS, + IB_QPST_RTS2SQD, + IB_QPST_RTS2RTS, + IB_QPST_SQD2RTS, + IB_QPST_SQE2RTS, + IB_QPST_SQD2SQD, + IB_QPST_MAX /* nr of transitions, this must be last!!! */ +}; + +/** @brief returns ehca qp state corresponding to given ib qp state + */ +static inline enum ehca_qp_state ib2ehca_qp_state(enum ib_qp_state ib_qp_state) +{ + switch (ib_qp_state) { + case IB_QPS_RESET: + return EHCA_QPS_RESET; + case IB_QPS_INIT: + return EHCA_QPS_INIT; + case IB_QPS_RTR: + return EHCA_QPS_RTR; + case IB_QPS_RTS: + return EHCA_QPS_RTS; + case IB_QPS_SQD: + return EHCA_QPS_SQD; + case IB_QPS_SQE: + return EHCA_QPS_SQE; + case IB_QPS_ERR: + return EHCA_QPS_ERR; + default: + EDEB_ERR(4, "invalid ib_qp_state=%x", ib_qp_state); + return -EINVAL; + } +} + +/** @brief returns ib qp state corresponding to given ehca qp state + */ +static inline enum ib_qp_state ehca2ib_qp_state(enum ehca_qp_state + ehca_qp_state) +{ + switch (ehca_qp_state) { + case EHCA_QPS_RESET: + return IB_QPS_RESET; + case EHCA_QPS_INIT: + return IB_QPS_INIT; + case EHCA_QPS_RTR: + return IB_QPS_RTR; + case EHCA_QPS_RTS: + return IB_QPS_RTS; + case EHCA_QPS_SQD: + return IB_QPS_SQD; + case EHCA_QPS_SQE: + return IB_QPS_SQE; + case EHCA_QPS_ERR: + return IB_QPS_ERR; + default: + EDEB_ERR(4,"invalid ehca_qp_state=%x",ehca_qp_state); + return -EINVAL; + } +} + +/** @brief qp type + * used as index for req_attr and opt_attr of struct ehca_modqp_statetrans + */ +enum ehca_qp_type { + QPT_RC = 0, + QPT_UC = 1, + QPT_UD = 2, + QPT_SQP = 3, + QPT_MAX +}; + +/** @brief returns ehca qp type corresponding to ib qp type + */ +static inline enum ehca_qp_type ib2ehcaqptype(enum ib_qp_type ibqptype) +{ + switch (ibqptype) { + case IB_QPT_SMI: + case IB_QPT_GSI: + return QPT_SQP; + case IB_QPT_RC: + return QPT_RC; + case IB_QPT_UC: + return QPT_UC; + case IB_QPT_UD: + return QPT_UD; + default: + EDEB_ERR(4,"Invalid ibqptype=%x", ibqptype); + return -EINVAL; + } +} + +static inline enum ib_qp_statetrans get_modqp_statetrans(int ib_fromstate, + int ib_tostate) +{ + int index = -EINVAL; + switch (ib_tostate) { + case IB_QPS_RESET: + index = IB_QPST_ANY2RESET; + break; + case IB_QPS_INIT: + if (ib_fromstate == IB_QPS_RESET) { + index = IB_QPST_RESET2INIT; + } else if (ib_fromstate == IB_QPS_INIT) { + index = IB_QPST_INIT2INIT; + } + break; + case IB_QPS_RTR: + if (ib_fromstate == IB_QPS_INIT) { + index = IB_QPST_INIT2RTR; + } + break; + case IB_QPS_RTS: + if (ib_fromstate == IB_QPS_RTR) { + index = IB_QPST_RTR2RTS; + } else if (ib_fromstate == IB_QPS_RTS) { + index = IB_QPST_RTS2RTS; + } else if (ib_fromstate == IB_QPS_SQD) { + index = IB_QPST_SQD2RTS; + } else if (ib_fromstate == IB_QPS_SQE) { + index = IB_QPST_SQE2RTS; + } + break; + case IB_QPS_SQD: + if (ib_fromstate == IB_QPS_RTS) { + index = IB_QPST_RTS2SQD; + } + break; + case IB_QPS_SQE: + /* not allowed via mod qp */ + break; + case IB_QPS_ERR: + index = IB_QPST_ANY2ERR; + break; + default: + return -EINVAL; + } + + return index; +} + +/** @brief ehca service types + */ +enum ehca_service_type { + ST_RC = 0, + ST_UC = 1, + ST_RD = 2, + ST_UD = 3 +}; + +/** @brief returns hcp service type corresponding to given ib qp type + * used by create_qp() + */ +static inline int ibqptype2servicetype(enum ib_qp_type ibqptype) +{ + switch (ibqptype) { + case IB_QPT_SMI: + case IB_QPT_GSI: + return ST_UD; + case IB_QPT_RC: + return ST_RC; + case IB_QPT_UC: + return ST_UC; + case IB_QPT_UD: + return ST_UD; + case IB_QPT_RAW_IPV6: + return -EINVAL; + case IB_QPT_RAW_ETY: + return -EINVAL; + default: + EDEB_ERR(4, "Invalid ibqptype=%x", ibqptype); + return -EINVAL; + } +} + +/* init_qp_queues - Initializes/constructs r/squeue and registers queue pages. + * returns 0 if successful, + * -EXXXX if not + */ +static inline int init_qp_queues(struct ipz_adapter_handle ipz_hca_handle, + struct ehca_qp *my_qp, + int nr_sq_pages, + int nr_rq_pages, + int swqe_size, + int rwqe_size, + int nr_send_sges, int nr_receive_sges) +{ + int ret = -EINVAL; + int cnt = 0; + void *vpage = NULL; + u64 rpage = 0; + int ipz_rc = -1; + u64 hipz_rc = H_Parameter; + + ipz_rc = ipz_queue_ctor(&my_qp->ehca_qp_core.ipz_squeue, + nr_sq_pages, + EHCA_PAGESIZE, swqe_size, nr_send_sges); + if (!ipz_rc) { + EDEB_ERR(4, "Cannot allocate page for squeue. ipz_rc=%x", + ipz_rc); + ret = -EBUSY; + return ret; + } + + ipz_rc = ipz_queue_ctor(&my_qp->ehca_qp_core.ipz_rqueue, + nr_rq_pages, + EHCA_PAGESIZE, rwqe_size, nr_receive_sges); + if (!ipz_rc) { + EDEB_ERR(4, "Cannot allocate page for rqueue. ipz_rc=%x", + ipz_rc); + ret = -EBUSY; + goto init_qp_queues0; + } + /* register SQ pages */ + for (cnt = 0; cnt < nr_sq_pages; cnt++) { + vpage = ipz_QPageit_get_inc(&my_qp->ehca_qp_core.ipz_squeue); + if (!vpage) { + EDEB_ERR(4, "SQ ipz_QPageit_get_inc() " + "failed p_vpage= %p", vpage); + ret = -EINVAL; + goto init_qp_queues1; + } + rpage = ehca_kv_to_g(vpage); + + hipz_rc = hipz_h_register_rpage_qp(ipz_hca_handle, + my_qp->ipz_qp_handle, + &my_qp->pf, 0, 0, /*TODO*/ + rpage, 1, + my_qp->ehca_qp_core.galpas.kernel); + if (hipz_rc < H_Success) { + EDEB_ERR(4,"SQ hipz_qp_register_rpage() faield " + " rc=%lx", hipz_rc); + ret = ehca2ib_return_code(hipz_rc); + goto init_qp_queues1; + } + /* for sq no need to check hipz_rc against + e.g. H_PAGE_REGISTERED */ + } + + ipz_QEit_reset(&my_qp->ehca_qp_core.ipz_squeue); + + /* register RQ pages */ + for (cnt = 0; cnt < nr_rq_pages; cnt++) { + vpage = ipz_QPageit_get_inc(&my_qp->ehca_qp_core.ipz_rqueue); + if (!vpage) { + EDEB_ERR(4,"RQ ipz_QPageit_get_inc() " + "failed p_vpage = %p", vpage); + hipz_rc = H_Resource; + ret = -EINVAL; + goto init_qp_queues1; + } + + rpage = ehca_kv_to_g(vpage); + + hipz_rc = hipz_h_register_rpage_qp(ipz_hca_handle, + my_qp->ipz_qp_handle, + &my_qp->pf, 0, 1, /*TODO*/ + rpage, 1, + my_qp->ehca_qp_core.galpas. + kernel); + if (hipz_rc < H_Success) { + EDEB_ERR(4, "RQ hipz_qp_register_rpage() failed " + "rc=%lx", hipz_rc); + ret = ehca2ib_return_code(hipz_rc); + goto init_qp_queues1; + } + if (cnt == (nr_rq_pages - 1)) { /* last page! */ + if (hipz_rc != H_Success) { + EDEB_ERR(4,"RQ hipz_qp_register_rpage() " + "hipz_rc= %lx ", hipz_rc); + ret = ehca2ib_return_code(hipz_rc); + goto init_qp_queues1; + } + vpage = ipz_QPageit_get_inc(&my_qp->ehca_qp_core.ipz_rqueue); + if (vpage != NULL) { + EDEB_ERR(4,"ipz_QPageit_get_inc() " + "should not succeed vpage=%p", + vpage); + ret = -EINVAL; + goto init_qp_queues1; + } + } else { + if (hipz_rc != H_PAGE_REGISTERED) { + EDEB_ERR(4,"RQ hipz_qp_register_rpage() " + "hipz_rc= %lx ", hipz_rc); + ret = ehca2ib_return_code(hipz_rc); + goto init_qp_queues1; + } + } + } + + ipz_QEit_reset(&my_qp->ehca_qp_core.ipz_rqueue); + + return 0; + + init_qp_queues1: + ipz_queue_dtor(&my_qp->ehca_qp_core.ipz_rqueue); + init_qp_queues0: + ipz_queue_dtor(&my_qp->ehca_qp_core.ipz_squeue); + return ret; +} + + +struct ib_qp *ehca_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *init_attr, + struct ib_udata *udata) +{ + static int da_msg_size[]={ 128, 256, 512, 1024, 2048, 4096 }; + int ret = -EINVAL; + int servicetype = 0; + int sigtype = 0; + + struct ehca_qp *my_qp = NULL; + struct ehca_pd *my_pd = NULL; + struct ehca_shca *shca = NULL; + struct ehca_cq *recv_ehca_cq = NULL; + struct ehca_cq *send_ehca_cq = NULL; + struct ib_ucontext *context = NULL; + u64 hipz_rc = H_Parameter; + int max_send_sge; + int max_recv_sge; + /* h_call's out parameters */ + u16 act_nr_send_wqes = 0, act_nr_receive_wqes = 0; + u8 act_nr_send_sges = 0, act_nr_receive_sges = 0; + u32 qp_nr = 0, + nr_sq_pages = 0, swqe_size = 0, rwqe_size = 0, nr_rq_pages = 0; + u8 daqp_completion; + u8 isdaqp; + EDEB_EN(7,"pd=%p init_attr=%p", pd, init_attr); + + EHCA_CHECK_PD_P(pd); + EHCA_CHECK_ADR_P(init_attr); + + if (init_attr->sq_sig_type != IB_SIGNAL_REQ_WR && + init_attr->sq_sig_type != IB_SIGNAL_ALL_WR) { + EDEB_ERR(4, "init_attr->sg_sig_type=%x not allowed", + init_attr->sq_sig_type); + return ERR_PTR(-EINVAL); + } + + /* save daqp completion bits */ + daqp_completion = init_attr->qp_type & 0x60; + /* save daqp bit */ + isdaqp = (init_attr->qp_type & 0x80) ? 1 : 0; + init_attr->qp_type = init_attr->qp_type & 0x1F; + + if (init_attr->qp_type != IB_QPT_UD && + init_attr->qp_type != IB_QPT_SMI && + init_attr->qp_type != IB_QPT_GSI && + init_attr->qp_type != IB_QPT_UC && + init_attr->qp_type != IB_QPT_RC) { + EDEB_ERR(4,"wrong QP Type=%x",init_attr->qp_type); + return ERR_PTR(-EINVAL); + } + if (init_attr->qp_type != IB_QPT_RC && isdaqp != 0) { + EDEB_ERR(4,"unsupported LL QP Type=%x",init_attr->qp_type); + return ERR_PTR(-EINVAL); + } + + if (pd->uobject && udata != NULL) { + context = pd->uobject->context; + } + + my_qp = ehca_qp_new(); + if (!my_qp) { + EDEB_ERR(4, "pd=%p not enough memory to alloc qp", pd); + return ERR_PTR(-ENOMEM); + } + + my_pd = container_of(pd, struct ehca_pd, ib_pd); + + shca = container_of(pd->device, struct ehca_shca, ib_device); + recv_ehca_cq = container_of(init_attr->recv_cq, struct ehca_cq, ib_cq); + send_ehca_cq = container_of(init_attr->send_cq, struct ehca_cq, ib_cq); + + my_qp->init_attr = *init_attr; + + do { + if (!idr_pre_get(&ehca_qp_idr, GFP_KERNEL)) { + ret = -ENOMEM; + EDEB_ERR(4, "Can't reserve idr resources."); + goto create_qp_exit0; + } + + down_write(&ehca_qp_idr_sem); + ret = idr_get_new(&ehca_qp_idr, my_qp, &my_qp->token); + up_write(&ehca_qp_idr_sem); + + } while (ret == -EAGAIN); + + if (ret) { + ret = -ENOMEM; + EDEB_ERR(4, "Can't allocate new idr entry."); + goto create_qp_exit0; + } + + servicetype = ibqptype2servicetype(init_attr->qp_type); + if (servicetype < 0) { + ret = -EINVAL; + EDEB_ERR(4, "Invalid qp_type=%x", init_attr->qp_type); + goto create_qp_exit0; + } + + if (init_attr->sq_sig_type == IB_SIGNAL_ALL_WR) { + sigtype = HCALL_SIGT_EVERY; + } else { + sigtype = HCALL_SIGT_BY_WQE; + } + + /* UD_AV CIRCUMVENTION */ + max_send_sge=init_attr->cap.max_send_sge; + max_recv_sge=init_attr->cap.max_recv_sge; + if (IB_QPT_UD == init_attr->qp_type || + IB_QPT_GSI == init_attr->qp_type || + IB_QPT_SMI == init_attr->qp_type) { + max_send_sge += 2; + max_recv_sge += 2; + } + + EDEB(7, "isdaqp=%x daqp_completion=%x", isdaqp, daqp_completion); + + hipz_rc = hipz_h_alloc_resource_qp(shca->ipz_hca_handle, + &my_qp->pf, + servicetype, + isdaqp | daqp_completion, + sigtype, 0, /* no ud ad lkey ctrl */ + send_ehca_cq->ipz_cq_handle, + recv_ehca_cq->ipz_cq_handle, + shca->eq.ipz_eq_handle, + my_qp->token, + my_pd->fw_pd, + (u16) init_attr->cap.max_send_wr + 1, /* fixme(+1 ??) */ + (u16) init_attr->cap.max_recv_wr + 1, /* fixme(+1 ??) */ + (u8) max_send_sge, + (u8) max_recv_sge, + 0, /* ignored if ud ad lkey ctrl is 0 */ + &my_qp->ipz_qp_handle, + &qp_nr, + &act_nr_send_wqes, + &act_nr_receive_wqes, + &act_nr_send_sges, + &act_nr_receive_sges, + &nr_sq_pages, + &nr_rq_pages, + &my_qp->ehca_qp_core.galpas); + if (hipz_rc != H_Success) { + EDEB_ERR(4, "h_alloc_resource_qp() failed rc=%lx", hipz_rc); + ret = ehca2ib_return_code(hipz_rc); + goto create_qp_exit1; + } + + /* store real qp_num as we got from ehca */ + my_qp->ehca_qp_core.real_qp_num = qp_nr; + + switch (init_attr->qp_type) { + case IB_QPT_RC: + if (isdaqp == 0) { + swqe_size = offsetof(struct ehca_wqe, + u.nud.sg_list[(act_nr_send_sges)]); + rwqe_size = offsetof(struct ehca_wqe, + u.nud.sg_list[(act_nr_receive_sges)]); + } else { /* for daqp we need to use msg size, not wqe size */ + swqe_size = da_msg_size[max_send_sge]; + rwqe_size = da_msg_size[max_recv_sge]; + act_nr_send_sges=1; + act_nr_receive_sges=1; + } + break; + case IB_QPT_UC: + swqe_size = offsetof(struct ehca_wqe, + u.nud.sg_list[(act_nr_send_sges)]); + rwqe_size = offsetof(struct ehca_wqe, + u.nud.sg_list[(act_nr_receive_sges)]); + break; + + case IB_QPT_UD: + case IB_QPT_GSI: + case IB_QPT_SMI: + /* UD circumvention */ + act_nr_receive_sges -= 2; + act_nr_send_sges -= 2; + swqe_size = offsetof(struct ehca_wqe, + u.ud_av.sg_list[(act_nr_send_sges)]); + rwqe_size = offsetof(struct ehca_wqe, + u.ud_av.sg_list[(act_nr_receive_sges)]); + + if (IB_QPT_GSI == init_attr->qp_type || + IB_QPT_SMI == init_attr->qp_type) { + act_nr_send_wqes = init_attr->cap.max_send_wr; + act_nr_receive_wqes = init_attr->cap.max_recv_wr; + act_nr_send_sges = init_attr->cap.max_send_sge; + act_nr_receive_sges = init_attr->cap.max_recv_sge; + qp_nr = (init_attr->qp_type == IB_QPT_SMI) ? 0 : 1; + } + + break; + + default: + break; + } + + /* initializes r/squeue and registers queue pages */ + ret = init_qp_queues(shca->ipz_hca_handle, my_qp, + nr_sq_pages, nr_rq_pages, + swqe_size, rwqe_size, + act_nr_send_sges, act_nr_receive_sges); + if (ret != 0) { + EDEB_ERR(4,"Couldn't initialize r/squeue and pages ret=%x", + ret); + goto create_qp_exit2; + } + + my_qp->ib_qp.pd = &my_pd->ib_pd; + my_qp->ib_qp.device = my_pd->ib_pd.device; + + my_qp->ib_qp.recv_cq = init_attr->recv_cq; + my_qp->ib_qp.send_cq = init_attr->send_cq; + + my_qp->ib_qp.qp_num = qp_nr; + my_qp->ib_qp.qp_type = init_attr->qp_type; + + my_qp->ehca_qp_core.qp_type = init_attr->qp_type; + my_qp->ib_qp.srq = init_attr->srq; + + my_qp->ib_qp.qp_context = init_attr->qp_context; + my_qp->ib_qp.event_handler = init_attr->event_handler; + + init_attr->cap.max_inline_data = 0; /* not supported? */ + init_attr->cap.max_recv_sge = act_nr_receive_sges; + init_attr->cap.max_recv_wr = act_nr_receive_wqes; + init_attr->cap.max_send_sge = act_nr_send_sges; + init_attr->cap.max_send_wr = act_nr_send_wqes; + + /* TODO : define_apq0() not supported yet */ + if (init_attr->qp_type == IB_QPT_GSI) { + if ((hipz_rc = ehca_define_sqp(shca, my_qp, init_attr))) { + EDEB_ERR(4, "ehca_define_sqp() failed rc=%lx", hipz_rc); + ret = ehca2ib_return_code(hipz_rc); + goto create_qp_exit3; + } + } + + if (init_attr->send_cq != NULL) { + struct ehca_cq *cq = container_of(init_attr->send_cq, + struct ehca_cq, ib_cq); + ret = ehca_cq_assign_qp(cq, my_qp); + if (ret != 0) { + EDEB_ERR(4, "Couldn't assign qp to send_cq ret=%x", ret); + goto create_qp_exit3; + } + my_qp->send_cq = cq; + } + + /* copy queues, galpa data to user space */ + if (context != NULL && udata != NULL) { + struct ehca_create_qp_resp resp; + struct vm_area_struct * vma; + resp.qp_num = qp_nr; + resp.token = my_qp->token; + resp.ehca_qp_core = my_qp->ehca_qp_core; + + ehca_mmap_nopage(((u64) (my_qp->token) << 32) | 0x22000000, + my_qp->ehca_qp_core.ipz_rqueue.queue_length, + ((void**)&resp.ehca_qp_core.ipz_rqueue.queue), + &vma); + my_qp->uspace_rqueue = (u64)resp.ehca_qp_core.ipz_rqueue.queue; + ehca_mmap_nopage(((u64) (my_qp->token) << 32) | 0x23000000, + my_qp->ehca_qp_core.ipz_squeue.queue_length, + ((void**)&resp.ehca_qp_core.ipz_squeue.queue), + &vma); + my_qp->uspace_squeue = (u64)resp.ehca_qp_core.ipz_squeue.queue; + ehca_mmap_register(my_qp->ehca_qp_core.galpas.user.fw_handle, + ((void**)&resp.ehca_qp_core.galpas.kernel.fw_handle), + &vma); + my_qp->uspace_fwh = (u64)resp.ehca_qp_core.galpas.kernel.fw_handle; + + if (ib_copy_to_udata(udata, &resp, sizeof resp)) { + EDEB_ERR(4, "Copy to udata failed"); + ret = -EINVAL; + goto create_qp_exit3; + } + } + + EDEB_EX(7, "ehca_qp=%p qp_num=%x, token=%x", + my_qp, qp_nr, my_qp->token); + return (&my_qp->ib_qp); + + create_qp_exit3: + ipz_queue_dtor(&my_qp->ehca_qp_core.ipz_rqueue); + ipz_queue_dtor(&my_qp->ehca_qp_core.ipz_squeue); + + create_qp_exit2: + hipz_h_destroy_qp(shca->ipz_hca_handle, my_qp); + + create_qp_exit1: + down_write(&ehca_qp_idr_sem); + idr_remove(&ehca_qp_idr, my_qp->token); + up_write(&ehca_qp_idr_sem); + + create_qp_exit0: + ehca_qp_delete(my_qp); + EDEB_EX(4, "failed ret=%x", ret); + return ERR_PTR(ret); + +} + +/** called by internal_modify_qp() at trans sqe -> rts: + * set purge bit of bad wqe and subsequent wqes to avoid reentering sqe + * @return total number of bad wqes in bad_wqe_cnt + */ +static int prepare_sqe_rts(struct ehca_qp *my_qp, struct ehca_shca *shca, + int *bad_wqe_cnt) +{ + int ret = 0; + u64 hipz_rc = H_Success; + struct ipz_queue *squeue = NULL; + void *bad_send_wqe_p = NULL; + void *bad_send_wqe_v = NULL; + void *squeue_start_p = NULL; + void *squeue_end_p = NULL; + void *squeue_start_v = NULL; + void *squeue_end_v = NULL; + struct ehca_wqe *wqe = NULL; + int qp_num = my_qp->ib_qp.qp_num; + + EDEB_EN(7, "ehca_qp=%p qp_num=%x ", my_qp, qp_num); + + /* get send wqe pointer */ + hipz_rc = hipz_h_disable_and_get_wqe(shca->ipz_hca_handle, + my_qp->ipz_qp_handle, &my_qp->pf, + &bad_send_wqe_p, NULL, 2); + if (hipz_rc != H_Success) { + EDEB_ERR(4, "hipz_h_disable_and_get_wqe() failed " + "ehca_qp=%p qp_num=%x hipz_rc=%lx", + my_qp, qp_num, hipz_rc); + ret = ehca2ib_return_code(hipz_rc); + goto prepare_sqe_rts_exit1; + } + bad_send_wqe_p = (void*)((u64)bad_send_wqe_p & (~(1L<<63))); + EDEB(7, "qp_num=%x bad_send_wqe_p=%p", qp_num, bad_send_wqe_p); + /* convert wqe pointer to vadr */ + bad_send_wqe_v = abs_to_virt((u64)bad_send_wqe_p); + EDEB_DMP(6, bad_send_wqe_v, 32, "qp_num=%x bad_wqe", qp_num); + + squeue = &my_qp->ehca_qp_core.ipz_squeue; + squeue_start_p = (void*)ehca_kv_to_g(squeue->queue); + squeue_end_p = squeue_start_p+squeue->queue_length; + squeue_start_v = abs_to_virt((u64)squeue_start_p); + squeue_end_v = abs_to_virt((u64)squeue_end_p); + EDEB(6, "qp_num=%x squeue_start_v=%p squeue_end_v=%p", + qp_num, squeue_start_v, squeue_end_v); + + /* loop sets wqe's purge bit */ + wqe = (struct ehca_wqe*)bad_send_wqe_v; + *bad_wqe_cnt = 0; + while (wqe->optype != 0xff && wqe->wqef != 0xff) { + EDEB_DMP(6, wqe, 32, "qp_num=%x wqe", qp_num); + wqe->nr_of_data_seg = 0; /* suppress data access */ + wqe->wqef = WQEF_PURGE; /* WQE to be purged */ + wqe = (struct ehca_wqe*)((u8*)wqe+squeue->qe_size); + *bad_wqe_cnt = (*bad_wqe_cnt)+1; + if ((void*)wqe >= squeue_end_v) { + wqe = squeue_start_v; + } + } /* eof while wqe */ + /* bad wqe will be reprocessed and ignored when pol_cq() is called, + i.e. nr of wqes with flush error status is one less */ + EDEB(6, "qp_num=%x flusherr_wqe_cnt=%x", qp_num, (*bad_wqe_cnt)-1); + wqe->wqef = 0; + + prepare_sqe_rts_exit1: + + EDEB_EX(7, "ehca_qp=%p qp_num=%x ret=%x", my_qp, qp_num, ret); + return ret; +} + +/** @brief internal modify qp with circumvention to handle aqp0 properly + * smi_reset2init indicates if this is an internal reset-to-init-call for + * smi. This flag must always be zero if called from ehca_modify_qp()! + * This internal func was intorduced to avoid recursion of ehca_modify_qp()! + */ +static int internal_modify_qp(struct ib_qp *ibqp, + struct ib_qp_attr *attr, + int attr_mask, int smi_reset2init) +{ + enum ib_qp_state qp_cur_state = 0, qp_new_state = 0; + int cnt = 0, qp_attr_idx = 0, retcode = 0; + + enum ib_qp_statetrans statetrans; + struct hcp_modify_qp_control_block *mqpcb = NULL; + struct ehca_qp *my_qp = NULL; + struct ehca_shca *shca = NULL; + u64 update_mask = 0; + u64 hipz_rc = H_Success; + int bad_wqe_cnt = 0; + int squeue_locked = 0; + unsigned long spl_flags = 0; + + my_qp = container_of(ibqp, struct ehca_qp, ib_qp); + shca = container_of(ibqp->pd->device, struct ehca_shca, ib_device); + + EDEB_EN(7, "ehca_qp=%p qp_num=%x ibqp_type=%x " + "new qp_state=%x attribute_mask=%x", + my_qp, ibqp->qp_num, ibqp->qp_type, + attr->qp_state, attr_mask); + + /* do query_qp to obtain current attr values */ + mqpcb = kmalloc(PAGE_SIZE, GFP_KERNEL); + if (mqpcb == NULL) { + retcode = -ENOMEM; + EDEB_ERR(4, "Could not get zeroed page for mqpcb " + "ehca_qp=%p qp_num=%x ", my_qp, ibqp->qp_num); + goto modify_qp_exit0; + } + memset(mqpcb, 0, PAGE_SIZE); + + hipz_rc = hipz_h_query_qp(shca->ipz_hca_handle, + my_qp->ipz_qp_handle, + &my_qp->pf, + mqpcb, my_qp->ehca_qp_core.galpas.kernel); + if (hipz_rc != H_Success) { + EDEB_ERR(4, "hipz_h_query_qp() failed " + "ehca_qp=%p qp_num=%x hipz_rc=%lx", + my_qp, ibqp->qp_num, hipz_rc); + retcode = ehca2ib_return_code(hipz_rc); + goto modify_qp_exit1; + } + EDEB(7, "ehca_qp=%p qp_num=%x ehca_qp_state=%x", + my_qp, ibqp->qp_num, mqpcb->qp_state); + + qp_cur_state = ehca2ib_qp_state(mqpcb->qp_state); + + if (qp_cur_state == -EINVAL) { /* invalid qp state */ + retcode = -EINVAL; + EDEB_ERR(4, "Invalid current ehca_qp_state=%x " + "ehca_qp=%p qp_num=%x", + mqpcb->qp_state, my_qp, ibqp->qp_num); + goto modify_qp_exit1; + } + /* circumvention to set aqp0 initial state to init + as expected by IB spec */ + if (smi_reset2init == 0 && + ibqp->qp_type == IB_QPT_SMI && + qp_cur_state == IB_QPS_RESET && + (attr_mask & IB_QP_STATE) + && attr->qp_state == IB_QPS_INIT) { /* RESET -> INIT */ + struct ib_qp_attr smiqp_attr = { + .qp_state = IB_QPS_INIT, + .port_num = my_qp->init_attr.port_num, + .pkey_index = 0, + .qkey = 0 + }; + int smiqp_attr_mask = IB_QP_STATE | IB_QP_PORT | + IB_QP_PKEY_INDEX | IB_QP_QKEY; + int smirc = internal_modify_qp( + ibqp, &smiqp_attr, smiqp_attr_mask, 1); + if (smirc != 0) { + EDEB_ERR(4, "SMI RESET -> INIT failed. " + "ehca_modify_qp() rc=%x", smirc); + retcode = H_Parameter; + goto modify_qp_exit1; + } + qp_cur_state = IB_QPS_INIT; + EDEB(7, "SMI RESET -> INIT succeeded"); + } + /* is transmitted current state equal to "real" current state */ + if (attr_mask & IB_QP_CUR_STATE) { + if (qp_cur_state != attr->cur_qp_state) { + retcode = -EINVAL; + EDEB_ERR(4, "Invalid IB_QP_CUR_STATE " + "attr->curr_qp_state=%x <> " + "actual cur_qp_state=%x. " + "ehca_qp=%p qp_num=%x", + attr->cur_qp_state, qp_cur_state, + my_qp, ibqp->qp_num); + goto modify_qp_exit1; + } + } + + EDEB(7, "ehca_qp=%p qp_num=%x current qp_state=%x " + "new qp_state=%x attribute_mask=%x", + my_qp, ibqp->qp_num, qp_cur_state, attr->qp_state, attr_mask); + + qp_new_state = attr_mask & IB_QP_STATE ? attr->qp_state : qp_cur_state; + if (!smi_reset2init && + !ib_modify_qp_is_ok(qp_cur_state, qp_new_state, ibqp->qp_type, + attr_mask)) { + retcode = -EINVAL; + EDEB_ERR(4, "Invalid qp transition new_state=%x cur_state=%x " + "ehca_qp=%p qp_num=%x attr_mask=%x", + qp_new_state, qp_cur_state, my_qp, ibqp->qp_num, + attr_mask); + goto modify_qp_exit1; + } + + if ((mqpcb->qp_state = ib2ehca_qp_state(qp_new_state))) { + update_mask = EHCA_BMASK_SET(MQPCB_MASK_QP_STATE, 1); + } else { + retcode = -EINVAL; + EDEB_ERR(4, "Invalid new qp state=%x " + "ehca_qp=%p qp_num=%x", + qp_new_state, my_qp, ibqp->qp_num); + goto modify_qp_exit1; + } + + /* retrieve state transition struct to get req and opt attrs */ + statetrans = get_modqp_statetrans(qp_cur_state, qp_new_state); + if (statetrans < 0) { + retcode = -EINVAL; + EDEB_ERR(4, " qp_cur_state=%x " + "new_qp_state=%x State_xsition=%x " + "ehca_qp=%p qp_num=%x", + qp_cur_state, qp_new_state, + statetrans, my_qp, ibqp->qp_num); + goto modify_qp_exit1; + } + + qp_attr_idx = ib2ehcaqptype(ibqp->qp_type); + + if (qp_attr_idx < 0) { + retcode = qp_attr_idx; + EDEB_ERR(4, "Invalid QP type=%x ehca_qp=%p qp_num=%x", + ibqp->qp_type, my_qp, ibqp->qp_num); + goto modify_qp_exit1; + } + + EDEB(7, "ehca_qp=%p qp_num=%x qp_state_xsit=%x", + my_qp, ibqp->qp_num, statetrans); + + /* sqe -> rts: set purge bit of bad wqe before actual trans */ + if ((my_qp->ehca_qp_core.qp_type == IB_QPT_UD + || my_qp->ehca_qp_core.qp_type == IB_QPT_GSI + || my_qp->ehca_qp_core.qp_type == IB_QPT_SMI) + && statetrans == IB_QPST_SQE2RTS) { + /* mark next free wqe if kernel */ + if (my_qp->uspace_squeue == 0) { + struct ehca_wqe *wqe = NULL; + /* lock send queue */ + spin_lock_irqsave(&my_qp->spinlock_s, spl_flags); + squeue_locked = 1; + /* mark next free wqe */ + wqe=(struct ehca_wqe*) + my_qp->ehca_qp_core.ipz_squeue.current_q_addr; + wqe->optype = wqe->wqef = 0xff; + EDEB(7, "qp_num=%x next_free_wqe=%p", + ibqp->qp_num, wqe); + } + retcode = prepare_sqe_rts(my_qp, shca, &bad_wqe_cnt); + if (retcode != 0) { + EDEB_ERR(4, "prepare_sqe_rts() failed " + "ehca_qp=%p qp_num=%x ret=%x", + my_qp, ibqp->qp_num, retcode); + goto modify_qp_exit2; + } + } + + /* enable RDMA_Atomic_Control if reset->init und reliable con + this is necessary since gen2 does not provide that flag, + but pHyp requires it */ + if (statetrans == IB_QPST_RESET2INIT && + (ibqp->qp_type == IB_QPT_RC || ibqp->qp_type == IB_QPT_UC)) { + mqpcb->rdma_atomic_ctrl = 3; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_RDMA_ATOMIC_CTRL, 1); + } + /* circ. pHyp requires #RDMA/Atomic Responder Resources for UC INIT -> RTR */ + if (statetrans == IB_QPST_INIT2RTR && + (ibqp->qp_type == IB_QPT_UC) && + !(attr_mask & IB_QP_MAX_DEST_RD_ATOMIC)) { + mqpcb->rdma_nr_atomic_resp_res = 1; /* default to 1 */ + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_RDMA_NR_ATOMIC_RESP_RES, 1); + } + + if (attr_mask & IB_QP_PKEY_INDEX) { + mqpcb->prim_p_key_idx = attr->pkey_index; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_PRIM_P_KEY_IDX, 1); + EDEB(7, "ehca_qp=%p qp_num=%x " + "IB_QP_PKEY_INDEX update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + if (attr_mask & IB_QP_PORT) { + if (attr->port_num < 1 || attr->port_num > shca->num_ports) { + retcode = -EINVAL; + EDEB_ERR(4, "Invalid port=%x. " + "ehca_qp=%p qp_num=%x num_ports=%x", + attr->port_num, my_qp, ibqp->qp_num, + shca->num_ports); + goto modify_qp_exit2; + } + mqpcb->prim_phys_port = attr->port_num; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_PRIM_PHYS_PORT, 1); + EDEB(7, "ehca_qp=%p qp_num=%x IB_QP_PORT update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + if (attr_mask & IB_QP_QKEY) { + mqpcb->qkey = attr->qkey; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_QKEY, 1); + EDEB(7, "ehca_qp=%p qp_num=%x IB_QP_QKEY update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + if (attr_mask & IB_QP_AV) { + mqpcb->dlid = attr->ah_attr.dlid; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_DLID, 1); + mqpcb->source_path_bits = attr->ah_attr.src_path_bits; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_SOURCE_PATH_BITS, 1); + mqpcb->service_level = attr->ah_attr.sl; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_SERVICE_LEVEL, 1); + mqpcb->max_static_rate = attr->ah_attr.static_rate; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_MAX_STATIC_RATE, 1); + + /* only if GRH is TRUE we might consider SOURCE_GID_IDX and DEST_GID + * otherwise phype will return H_ATTR_PARM!!! + */ + if (attr->ah_attr.ah_flags == IB_AH_GRH) { + mqpcb->send_grh_flag = 1 << 31; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_SEND_GRH_FLAG, 1); + mqpcb->source_gid_idx = attr->ah_attr.grh.sgid_index; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_SOURCE_GID_IDX, 1); + + for (cnt = 0; cnt < 16; cnt++) { + mqpcb->dest_gid.byte[cnt] = + attr->ah_attr.grh.dgid.raw[cnt]; + } + + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_DEST_GID, 1); + mqpcb->flow_label = attr->ah_attr.grh.flow_label; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_FLOW_LABEL, 1); + mqpcb->hop_limit = attr->ah_attr.grh.hop_limit; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_HOP_LIMIT, 1); + mqpcb->traffic_class = attr->ah_attr.grh.traffic_class; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_TRAFFIC_CLASS, 1); + } + + EDEB(7, "ehca_qp=%p qp_num=%x IB_QP_AV update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + + if (attr_mask & IB_QP_PATH_MTU) { + mqpcb->path_mtu = attr->path_mtu; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_PATH_MTU, 1); + EDEB(7, "ehca_qp=%p qp_num=%x IB_QP_PATH_MTU update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + if (attr_mask & IB_QP_TIMEOUT) { + mqpcb->timeout = attr->timeout; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_TIMEOUT, 1); + EDEB(7, "ehca_qp=%p qp_num=%x IB_QP_TIMEOUT update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + if (attr_mask & IB_QP_RETRY_CNT) { + mqpcb->retry_count = attr->retry_cnt; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_RETRY_COUNT, 1); + EDEB(7, "ehca_qp=%p qp_num=%x IB_QP_RETRY_CNT update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + if (attr_mask & IB_QP_RNR_RETRY) { + mqpcb->rnr_retry_count = attr->rnr_retry; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_RNR_RETRY_COUNT, 1); + EDEB(7, "ehca_qp=%p qp_num=%x IB_QP_RNR_RETRY update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + if (attr_mask & IB_QP_RQ_PSN) { + mqpcb->receive_psn = attr->rq_psn; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_RECEIVE_PSN, 1); + EDEB(7, "ehca_qp=%p qp_num=%x IB_QP_RQ_PSN update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC) { + /* @TODO CHECK THIS with our spec */ + mqpcb->rdma_nr_atomic_resp_res = attr->max_dest_rd_atomic; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_RDMA_NR_ATOMIC_RESP_RES, 1); + EDEB(7, "ehca_qp=%p qp_num=%x IB_QP_MAX_DEST_RD_ATOMIC " + "update_mask=%lx", my_qp, ibqp->qp_num, update_mask); + } + if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC) { + /* @TODO CHECK THIS with our spec */ + mqpcb->rdma_atomic_outst_dest_qp = attr->max_rd_atomic; + update_mask |= + EHCA_BMASK_SET + (MQPCB_MASK_RDMA_ATOMIC_OUTST_DEST_QP, 1); + EDEB(7, "ehca_qp=%p qp_num=%x IB_QP_MAX_QP_RD_ATOMIC " + "update_mask=%lx", my_qp, ibqp->qp_num, update_mask); + } + if (attr_mask & IB_QP_ALT_PATH) { + mqpcb->dlid_al = attr->alt_ah_attr.dlid; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_DLID_AL, 1); + mqpcb->source_path_bits_al = attr->alt_ah_attr.src_path_bits; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_SOURCE_PATH_BITS_AL, 1); + mqpcb->service_level_al = attr->alt_ah_attr.sl; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_SERVICE_LEVEL_AL, 1); + mqpcb->max_static_rate_al = attr->alt_ah_attr.static_rate; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_MAX_STATIC_RATE_AL, 1); + + /* only if GRH is TRUE we might consider SOURCE_GID_IDX and DEST_GID + * otherwise phype will return H_ATTR_PARM!!! + */ + if (attr->alt_ah_attr.ah_flags == IB_AH_GRH) { + mqpcb->send_grh_flag_al = 1 << 31; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_SEND_GRH_FLAG_AL, 1); + mqpcb->source_gid_idx_al = + attr->alt_ah_attr.grh.sgid_index; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_SOURCE_GID_IDX_AL, 1); + + for (cnt = 0; cnt < 16; cnt++) { + mqpcb->dest_gid_al.byte[cnt] = + attr->alt_ah_attr.grh.dgid.raw[cnt]; + } + + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_DEST_GID_AL, 1); + mqpcb->flow_label_al = attr->alt_ah_attr.grh.flow_label; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_FLOW_LABEL_AL, 1); + mqpcb->hop_limit_al = attr->alt_ah_attr.grh.hop_limit; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_HOP_LIMIT_AL, 1); + mqpcb->traffic_class_al = + attr->alt_ah_attr.grh.traffic_class; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_TRAFFIC_CLASS_AL, 1); + } + + EDEB(7, "ehca_qp=%p qp_num=%x " + "IB_QP_ALT_PATH update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + + if (attr_mask & IB_QP_MIN_RNR_TIMER) { + mqpcb->min_rnr_nak_timer_field = attr->min_rnr_timer; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_MIN_RNR_NAK_TIMER_FIELD, 1); + EDEB(7, "ehca_qp=%p qp_num=%x " + "IB_QP_MIN_RNR_TIMER update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + + if (attr_mask & IB_QP_SQ_PSN) { + mqpcb->send_psn = attr->sq_psn; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_SEND_PSN, 1); + EDEB(7, "ehca_qp=%p qp_num=%x " + "IB_QP_SQ_PSN update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + + if (attr_mask & IB_QP_DEST_QPN) { + mqpcb->dest_qp_nr = attr->dest_qp_num; + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_DEST_QP_NR, 1); + EDEB(7, "ehca_qp=%p qp_num=%x " + "IB_QP_DEST_QPN update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + } + + if (attr_mask & IB_QP_PATH_MIG_STATE) { + mqpcb->path_migration_state = attr->path_mig_state; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_PATH_MIGRATION_STATE, 1); + EDEB(7, "ehca_qp=%p qp_num=%x " + "IB_QP_PATH_MIG_STATE update_mask=%lx", my_qp, + ibqp->qp_num, update_mask); + } + + if (attr_mask & IB_QP_CAP) { + mqpcb->max_nr_outst_send_wr = attr->cap.max_send_wr+1; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_MAX_NR_OUTST_SEND_WR, 1); + mqpcb->max_nr_outst_recv_wr = attr->cap.max_recv_wr+1; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_MAX_NR_OUTST_RECV_WR, 1); + EDEB(7, "ehca_qp=%p qp_num=%x " + "IB_QP_CAP update_mask=%lx", + my_qp, ibqp->qp_num, update_mask); + /* TODO no support for max_send/recv_sge??? */ + } + + EDEB_DMP(7, mqpcb, 4*70, "ehca_qp=%p qp_num=%x", my_qp, ibqp->qp_num); + + hipz_rc = hipz_h_modify_qp(shca->ipz_hca_handle, + my_qp->ipz_qp_handle, + &my_qp->pf, + update_mask, + mqpcb, my_qp->ehca_qp_core.galpas.kernel); + + if (hipz_rc != H_Success) { + retcode = ehca2ib_return_code(hipz_rc); + EDEB_ERR(4, "hipz_h_modify_qp() failed rc=%lx " + "ehca_qp=%p qp_num=%x", + hipz_rc, my_qp, ibqp->qp_num); + goto modify_qp_exit2; + } + + if ((my_qp->ehca_qp_core.qp_type == IB_QPT_UD + || my_qp->ehca_qp_core.qp_type == IB_QPT_GSI + || my_qp->ehca_qp_core.qp_type == IB_QPT_SMI) + && statetrans == IB_QPST_SQE2RTS) { + /* doorbell to reprocessing wqes */ + iosync(); /* serialize GAL register access */ + hipz_update_SQA(&my_qp->ehca_qp_core, bad_wqe_cnt-1); + EDEB(6, "doorbell for %x wqes", bad_wqe_cnt); + } + + if (statetrans == IB_QPST_RESET2INIT || + statetrans == IB_QPST_INIT2INIT) { + mqpcb->qp_enable = TRUE; + mqpcb->qp_state = EHCA_QPS_INIT; + update_mask = 0; + update_mask = EHCA_BMASK_SET(MQPCB_MASK_QP_ENABLE, 1); + + EDEB(7, "ehca_qp=%p qp_num=%x " + "RESET_2_INIT needs an additional enable " + "-> update_mask=%lx", my_qp, ibqp->qp_num, update_mask); + + hipz_rc = hipz_h_modify_qp(shca->ipz_hca_handle, + my_qp->ipz_qp_handle, + &my_qp->pf, + update_mask, + mqpcb, + my_qp->ehca_qp_core.galpas.kernel); + + if (hipz_rc != H_Success) { + retcode = ehca2ib_return_code(hipz_rc); + EDEB_ERR(4, "ENABLE in context of " + "RESET_2_INIT failed! " + "Maybe you didn't get a LID" + "hipz_rc=%lx ehca_qp=%p qp_num=%x", + hipz_rc, my_qp, ibqp->qp_num); + goto modify_qp_exit2; + } + } + + if (statetrans == IB_QPST_ANY2RESET) { + ipz_QEit_reset(&my_qp->ehca_qp_core.ipz_rqueue); + ipz_QEit_reset(&my_qp->ehca_qp_core.ipz_squeue); + } + + if (attr_mask & IB_QP_QKEY) { + my_qp->ehca_qp_core.qkey = attr->qkey; + } + + modify_qp_exit2: + if (squeue_locked) { /* this means: sqe -> rts */ + spin_unlock_irqrestore(&my_qp->spinlock_s, spl_flags); + my_qp->sqerr_purgeflag = 1; + } + + modify_qp_exit1: + kfree(mqpcb); + + modify_qp_exit0: + EDEB_EX(7, "ehca_qp=%p qp_num=%x ibqp_type=%x retcode=%x", + my_qp, ibqp->qp_num, ibqp->qp_type, retcode); + return retcode; +} + +int ehca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask) +{ + int ret = 0; + struct ehca_qp *my_qp = NULL; + + EHCA_CHECK_ADR(ibqp); + EHCA_CHECK_ADR(attr); + EHCA_CHECK_ADR(ibqp->device); + + my_qp = container_of(ibqp, struct ehca_qp, ib_qp); + + EDEB_EN(7, "ehca_qp=%p qp_num=%x ibqp_type=%x attr_mask=%x", + my_qp, ibqp->qp_num, ibqp->qp_type, attr_mask); + + ret = internal_modify_qp(ibqp, attr, attr_mask, 0); + + EDEB_EX(7, "ehca_qp=%p qp_num=%x ibqp_type=%x ret=%x", + my_qp, ibqp->qp_num, ibqp->qp_type, ret); + return ret; +} + +int ehca_query_qp(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask, struct ib_qp_init_attr *qp_init_attr) +{ + struct ehca_qp *my_qp = NULL; + struct ehca_shca *shca = NULL; + struct hcp_modify_qp_control_block *qpcb = NULL; + + struct ipz_adapter_handle adapter_handle; + int cnt = 0, retcode = 0; + u64 hipz_rc = H_Success; + + EHCA_CHECK_ADR(qp); + EHCA_CHECK_ADR(qp_attr); + EHCA_CHECK_DEVICE(qp->device); + + my_qp = container_of(qp, struct ehca_qp, ib_qp); + + EDEB_EN(7, "ehca_qp=%p qp_num=%x " + "qp_attr=%p qp_attr_mask=%x qp_init_attr=%p", + my_qp, qp->qp_num, qp_attr, qp_attr_mask, qp_init_attr); + + shca = container_of(qp->device, struct ehca_shca, ib_device); + adapter_handle = shca->ipz_hca_handle; + + if (qp_attr_mask & QP_ATTR_QUERY_NOT_SUPPORTED) { + retcode = -EINVAL; + EDEB_ERR(4,"Invalid attribute mask " + "ehca_qp=%p qp_num=%x qp_attr_mask=%x ", + my_qp, qp->qp_num, qp_attr_mask); + goto query_qp_exit0; + } + + qpcb = kmalloc(EHCA_PAGESIZE, GFP_KERNEL ); + + if (qpcb == NULL) { + retcode = -ENOMEM; + EDEB_ERR(4,"Out of memory for qpcb " + "ehca_qp=%p qp_num=%x", my_qp, qp->qp_num); + goto query_qp_exit0; + } + memset(qpcb, 0, sizeof(*qpcb)); + + hipz_rc = hipz_h_query_qp(adapter_handle, + my_qp->ipz_qp_handle, + &my_qp->pf, + qpcb, my_qp->ehca_qp_core.galpas.kernel); + + if (hipz_rc != H_Success) { + retcode = ehca2ib_return_code(hipz_rc); + EDEB_ERR(4,"hipz_h_query_qp() failed " + "ehca_qp=%p qp_num=%x hipz_rc=%lx", + my_qp, qp->qp_num, hipz_rc); + goto query_qp_exit1; + } + + qp_attr->cur_qp_state = ehca2ib_qp_state(qpcb->qp_state); + qp_attr->qp_state = qp_attr->cur_qp_state; + if (qp_attr->cur_qp_state == -EINVAL) { + retcode = -EINVAL; + EDEB_ERR(4,"Got invalid ehca_qp_state=%x " + "ehca_qp=%p qp_num=%x", + qpcb->qp_state, my_qp, qp->qp_num); + goto query_qp_exit1; + } + + if (qp_attr->qp_state == IB_QPS_SQD) { + qp_attr->sq_draining = TRUE; + } + + qp_attr->qkey = qpcb->qkey; + qp_attr->path_mtu = qpcb->path_mtu; + qp_attr->path_mig_state = qpcb->path_migration_state; + qp_attr->rq_psn = qpcb->receive_psn; + qp_attr->sq_psn = qpcb->send_psn; + qp_attr->min_rnr_timer = qpcb->min_rnr_nak_timer_field; + qp_attr->cap.max_send_wr = qpcb->max_nr_outst_send_wr-1; + qp_attr->cap.max_recv_wr = qpcb->max_nr_outst_recv_wr-1; + /* UD_AV CIRCUMVENTION */ + if (my_qp->ehca_qp_core.qp_type == IB_QPT_UD) { + qp_attr->cap.max_send_sge = + qpcb->actual_nr_sges_in_sq_wqe - 2; + qp_attr->cap.max_recv_sge = + qpcb->actual_nr_sges_in_rq_wqe - 2; + } else { + qp_attr->cap.max_send_sge = + qpcb->actual_nr_sges_in_sq_wqe; + qp_attr->cap.max_recv_sge = + qpcb->actual_nr_sges_in_rq_wqe; + } + + qp_attr->cap.max_inline_data = my_qp->sq_max_inline_data_size; + qp_attr->dest_qp_num = qpcb->dest_qp_nr; + + qp_attr->pkey_index = + EHCA_BMASK_GET(MQPCB_PRIM_P_KEY_IDX, qpcb->prim_p_key_idx); + + qp_attr->port_num = + EHCA_BMASK_GET(MQPCB_PRIM_PHYS_PORT, qpcb->prim_phys_port); + + qp_attr->timeout = qpcb->timeout; + qp_attr->retry_cnt = qpcb->retry_count; + qp_attr->rnr_retry = qpcb->rnr_retry_count; + + qp_attr->alt_pkey_index = + EHCA_BMASK_GET(MQPCB_PRIM_P_KEY_IDX, qpcb->alt_p_key_idx); + + qp_attr->alt_port_num = qpcb->alt_phys_port; + qp_attr->alt_timeout = qpcb->timeout_al; + + /* primary av */ + qp_attr->ah_attr.sl = qpcb->service_level; + + if (qpcb->send_grh_flag) { + qp_attr->ah_attr.ah_flags = IB_AH_GRH; + } + + qp_attr->ah_attr.static_rate = qpcb->max_static_rate; + qp_attr->ah_attr.dlid = qpcb->dlid; + qp_attr->ah_attr.src_path_bits = qpcb->source_path_bits; + qp_attr->ah_attr.port_num = qp_attr->port_num; + + /* primary GRH */ + qp_attr->ah_attr.grh.traffic_class = qpcb->traffic_class; + qp_attr->ah_attr.grh.hop_limit = qpcb->hop_limit; + qp_attr->ah_attr.grh.sgid_index = qpcb->source_gid_idx; + qp_attr->ah_attr.grh.flow_label = qpcb->flow_label; + + for (cnt = 0; cnt < 16; cnt++) { + qp_attr->ah_attr.grh.dgid.raw[cnt] = + qpcb->dest_gid.byte[cnt]; + } + + /* alternate AV */ + qp_attr->alt_ah_attr.sl = qpcb->service_level_al; + if (qpcb->send_grh_flag_al) { + qp_attr->alt_ah_attr.ah_flags = IB_AH_GRH; + } + + qp_attr->alt_ah_attr.static_rate = qpcb->max_static_rate_al; + qp_attr->alt_ah_attr.dlid = qpcb->dlid_al; + qp_attr->alt_ah_attr.src_path_bits = qpcb->source_path_bits_al; + + /* alternate GRH */ + qp_attr->alt_ah_attr.grh.traffic_class = qpcb->traffic_class_al; + qp_attr->alt_ah_attr.grh.hop_limit = qpcb->hop_limit_al; + qp_attr->alt_ah_attr.grh.sgid_index = qpcb->source_gid_idx_al; + qp_attr->alt_ah_attr.grh.flow_label = qpcb->flow_label_al; + + for (cnt = 0; cnt < 16; cnt++) { + qp_attr->alt_ah_attr.grh.dgid.raw[cnt] = + qpcb->dest_gid_al.byte[cnt]; + } + + /* return init attributes given in ehca_create_qp */ + if (qp_init_attr != NULL) { + *qp_init_attr = my_qp->init_attr; + } + + EDEB(7, "ehca_qp=%p qp_number=%x dest_qp_number=%x " + "dlid=%x path_mtu=%x dest_gid=%lx_%lx " + "service_level=%x qp_state=%x", + my_qp, qpcb->qp_number, qpcb->dest_qp_nr, + qpcb->dlid, qpcb->path_mtu, + qpcb->dest_gid.dw[0], qpcb->dest_gid.dw[1], + qpcb->service_level, qpcb->qp_state); + + EDEB_DMP(7, qpcb, 4*70, "ehca_qp=%p qp_num=%x", my_qp, qp->qp_num); + + query_qp_exit1: + kfree(qpcb); + + query_qp_exit0: + EDEB_EX(7, "ehca_qp=%p qp_num=%x retcode=%x", + my_qp, qp->qp_num, retcode); + return retcode; +} + +int ehca_destroy_qp(struct ib_qp *ibqp) +{ + struct ehca_qp *my_qp = NULL; + struct ehca_shca *shca = NULL; + struct ehca_pfqp *qp_pf = NULL; + u32 qp_num = 0; + int retcode = 0; + u64 hipz_ret = H_Success; + u8 port_num = 0; + enum ib_qp_type qp_type; + + EHCA_CHECK_ADR(ibqp); + + my_qp = container_of(ibqp, struct ehca_qp, ib_qp); + qp_num = ibqp->qp_num; + qp_pf = &my_qp->pf; + + shca = container_of(ibqp->device, struct ehca_shca, ib_device); + + EDEB_EN(7, "ehca_qp=%p qp_num=%x", my_qp, ibqp->qp_num); + + if (my_qp->send_cq != NULL) { + retcode = ehca_cq_unassign_qp(my_qp->send_cq, + my_qp->ehca_qp_core.real_qp_num); + if (retcode != 0) { + EDEB_ERR(4, "Couldn't unassign qp from send_cq " + "ret=%x qp_num=%x cq_num=%x", + retcode, my_qp->ib_qp.qp_num, + my_qp->send_cq->cq_number); + goto destroy_qp_exit0; + } + } + + down_write(&ehca_qp_idr_sem); + idr_remove(&ehca_qp_idr, my_qp->token); + up_write(&ehca_qp_idr_sem); + + /* un-mmap if vma alloc */ + if (my_qp->uspace_rqueue != 0) { + struct ehca_qp_core *qp_core = &my_qp->ehca_qp_core; + retcode = ehca_munmap(my_qp->uspace_rqueue, + qp_core->ipz_rqueue.queue_length); + retcode = ehca_munmap(my_qp->uspace_squeue, + qp_core->ipz_squeue.queue_length); + retcode = ehca_munmap(my_qp->uspace_fwh, 4096); + } + + hipz_ret = hipz_h_destroy_qp(shca->ipz_hca_handle, my_qp); + if (hipz_ret != H_Success) { + EDEB_ERR(4, "hipz_h_destroy_qp() failed " + "rc=%lx ehca_qp=%p qp_num=%x", + hipz_ret, qp_pf, qp_num); + goto destroy_qp_exit0; + } + + port_num = my_qp->init_attr.port_num; + qp_type = my_qp->init_attr.qp_type; + + /* TODO: later with IB_QPT_SMI */ + if (qp_type == IB_QPT_GSI) { + struct ib_event event; + + EDEB(4, "EHCA port %x is inactive.", port_num); + event.device = &shca->ib_device; + event.event = IB_EVENT_PORT_ERR; + event.element.port_num = port_num; + shca->sport[port_num - 1].port_state = IB_PORT_DOWN; + ib_dispatch_event(&event); + } + + ipz_queue_dtor(&my_qp->ehca_qp_core.ipz_rqueue); + ipz_queue_dtor(&my_qp->ehca_qp_core.ipz_squeue); + ehca_qp_delete(my_qp); + + destroy_qp_exit0: + retcode = ehca2ib_return_code(hipz_ret); + EDEB_EX(7,"ret=%x", retcode); + return retcode; +} + +/* eof ehca_qp.c */ From rolandd at cisco.com Fri Feb 17 16:57:50 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 17 Feb 2006 16:57:50 -0800 Subject: [openib-general] [PATCH 17/22] Special QP functions In-Reply-To: <20060218005532.13620.79663.stgit@localhost.localdomain> References: <20060218005532.13620.79663.stgit@localhost.localdomain> Message-ID: <20060218005750.13620.62709.stgit@localhost.localdomain> From: Roland Dreier The wait for the port to become active when creating QP 1 seems bizarre. Why can't we just create QP 1 before the port is active? What is the issue with creating QP 0? Without QP 0, it's impossible to run a subnet manager on top of ehca. --- drivers/infiniband/hw/ehca/ehca_sqp.c | 135 +++++++++++++++++++++++++++++++++ 1 files changed, 135 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_sqp.c b/drivers/infiniband/hw/ehca/ehca_sqp.c new file mode 100644 index 0000000..bbad4cb --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_sqp.c @@ -0,0 +1,135 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * SQP functions + * + * Authors: Khadija Souissi + * Heiko J Schick + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_sqp.c,v 1.35 2006/02/06 10:17:34 schickhj Exp $ + */ + + +#define DEB_PREFIX "e_qp" + +#include "ehca_kernel.h" +#include "ehca_classes.h" +#include "ehca_tools.h" +#include "hcp_if.h" +#include "ehca_qes.h" +#include "ehca_iverbs.h" + +#include +#include + +extern int ehca_create_aqp1(struct ehca_shca *shca, struct ehca_sport *sport); +extern int ehca_destroy_aqp1(struct ehca_sport *sport); + +extern int ehca_port_act_time; + +/** + * ehca_define_aqp0 - TODO + * + * @ehca_qp: : TODO adapter_handle, ipz_qp_handle, galpas.kernel + * @qp_init_attr : TODO for port number + */ +u64 ehca_define_sqp(struct ehca_shca *shca, + struct ehca_qp *ehca_qp, + struct ib_qp_init_attr *qp_init_attr) +{ + + u32 pma_qp_nr = 0; + u32 bma_qp_nr = 0; + u64 ret = H_Success; + u8 port = qp_init_attr->port_num; + int counter = 0; + + EDEB_EN(7, "port=%x qp_type=%x", + port, qp_init_attr->qp_type); + + shca->sport[port - 1].port_state = IB_PORT_DOWN; + + switch (qp_init_attr->qp_type) { + case IB_QPT_SMI: + /* TODO: function not supported yet */ + /* + ret = hipz_h_define_aqp0(shca->ipz_hca_handle, + ehca_qp->ipz_qp_handle, + ehca_qp->galpas.kernel, + (u32)qp_init_attr->port_num); + */ + break; + case IB_QPT_GSI: + ret = hipz_h_define_aqp1(shca->ipz_hca_handle, + ehca_qp->ipz_qp_handle, + ehca_qp->ehca_qp_core.galpas.kernel, + (u32) qp_init_attr->port_num, + &pma_qp_nr, &bma_qp_nr); + + if (ret != H_Success) { + EDEB_ERR(4, "Can't define AQP1 for port %x. rc=%lx", + port, ret); + goto ehca_define_aqp1; + } + break; + default: + ret = H_Parameter; + goto ehca_define_aqp1; + } + +#ifndef EHCA_USERDRIVER + while ((shca->sport[port - 1].port_state != IB_PORT_ACTIVE) && + (counter < ehca_port_act_time)) { + EDEB(6, "... wait until port %x is active", + port); + msleep_interruptible(1000); + counter++; + } + + if (counter == ehca_port_act_time) { + EDEB_ERR(4, "Port %x is not active.", port); + ret = H_Hardware; + } +#else + if (shca->sport[port - 1].port_state != IB_PORT_ACTIVE) { + sleep(20); + } +#endif + + ehca_define_aqp1: + EDEB_EX(7, "ret=%lx", ret); + + return ret; +} From rolandd at cisco.com Fri Feb 17 16:57:52 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 17 Feb 2006 16:57:52 -0800 Subject: [openib-general] [PATCH 18/22] ehca address vectors, multicast groups, protection domains In-Reply-To: <20060218005532.13620.79663.stgit@localhost.localdomain> References: <20060218005532.13620.79663.stgit@localhost.localdomain> Message-ID: <20060218005752.13620.3255.stgit@localhost.localdomain> From: Roland Dreier --- drivers/infiniband/hw/ehca/ehca_av.c | 258 +++++++++++++++++++++++++++++++ drivers/infiniband/hw/ehca/ehca_mcast.c | 194 +++++++++++++++++++++++ drivers/infiniband/hw/ehca/ehca_pd.c | 100 ++++++++++++ 3 files changed, 552 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_av.c b/drivers/infiniband/hw/ehca/ehca_av.c new file mode 100644 index 0000000..f5382c2 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_av.c @@ -0,0 +1,258 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * adress vector functions + * + * Authors: Reinhard Ernst + * Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_av.c,v 1.28 2006/02/06 10:17:34 schickhj Exp $ + */ + + +#define DEB_PREFIX "ehav" + +#include "ehca_kernel.h" +#include "ehca_tools.h" +#include "ehca_iverbs.h" +#include "hcp_if.h" + +struct ib_ah *ehca_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr) +{ + extern int ehca_static_rate; + int retcode = 0; + struct ehca_av *av = NULL; + + EHCA_CHECK_PD_P(pd); + EHCA_CHECK_ADR_P(ah_attr); + + EDEB_EN(7,"pd=%p ah_attr=%p", pd, ah_attr); + + av = ehca_av_new(); + if (!av) { + EDEB_ERR(4,"Out of memory pd=%p ah_attr=%p", pd, ah_attr); + retcode = -ENOMEM; + goto create_ah_exit0; + } + + av->av.sl = ah_attr->sl; + av->av.dlid = ntohs(ah_attr->dlid); + av->av.slid_path_bits = ah_attr->src_path_bits; + + if (ehca_static_rate < 0) { + av->av.ipd = ah_attr->static_rate; + } else { + av->av.ipd = ehca_static_rate; + } + + av->av.lnh = ah_attr->ah_flags; + av->av.grh.word_0 |= EHCA_BMASK_SET(GRH_IPVERSION_MASK, 6); + av->av.grh.word_0 |= EHCA_BMASK_SET(GRH_TCLASS_MASK, + ah_attr->grh.traffic_class); + av->av.grh.word_0 |= EHCA_BMASK_SET(GRH_FLOWLABEL_MASK, + ah_attr->grh.flow_label); + av->av.grh.word_0 |= EHCA_BMASK_SET(GRH_HOPLIMIT_MASK, + ah_attr->grh.hop_limit); + av->av.grh.word_0 |= EHCA_BMASK_SET(GRH_NEXTHEADER_MASK, 0x1B); + /* IB transport */ + av->av.grh.word_0 = be64_to_cpu(av->av.grh.word_0); + /* set sgid in grh.word_1 */ + if (ah_attr->ah_flags & IB_AH_GRH) { + int rc = 0; + struct ib_port_attr port_attr; + union ib_gid gid; + memset(&port_attr, 0, sizeof(port_attr)); + rc = ehca_query_port(pd->device, ah_attr->port_num, + &port_attr); + if (rc != 0) { /* invalid port number */ + retcode = -EINVAL; + EDEB_ERR(4, "Invalid port number " + "ehca_query_port() returned %x " + "pd=%p ah_attr=%p", rc, pd, ah_attr); + goto create_ah_exit1; + } + memset(&gid, 0, sizeof(gid)); + rc = ehca_query_gid(pd->device, + ah_attr->port_num, + ah_attr->grh.sgid_index, &gid); + if (rc != 0) { + retcode = -EINVAL; + EDEB_ERR(4, "Failed to retrieve sgid " + "ehca_query_gid() returned %x " + "pd=%p ah_attr=%p", rc, pd, ah_attr); + goto create_ah_exit1; + } + memcpy(&av->av.grh.word_1, &gid, sizeof(gid)); + } + /* for the time beeing we use a hard coded PMTU of 2048 Bytes */ + av->av.pmtu = 4; /* TODO */ + + /* dgid comes in grh.word_3 */ + memcpy(&av->av.grh.word_3, &ah_attr->grh.dgid, + sizeof(ah_attr->grh.dgid)); + + EHCA_REGISTER_AV(device, pd); + + EDEB_EX(7,"pd=%p ah_attr=%p av=%p", pd, ah_attr, av); + return (&av->ib_ah); + + create_ah_exit1: + ehca_av_delete(av); + + create_ah_exit0: + EDEB_EX(7,"retcode=%x pd=%p ah_attr=%p", retcode, pd, ah_attr); + return ERR_PTR(retcode); +} + +int ehca_modify_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr) +{ + struct ehca_av *av = NULL; + struct ehca_ud_av new_ehca_av; + int ret = 0; + + EHCA_CHECK_AV(ah); + EHCA_CHECK_ADR(ah_attr); + + EDEB_EN(7,"ah=%p ah_attr=%p", ah, ah_attr); + + memset(&new_ehca_av, 0, sizeof(new_ehca_av)); + new_ehca_av.sl = ah_attr->sl; + new_ehca_av.dlid = ntohs(ah_attr->dlid); + new_ehca_av.slid_path_bits = ah_attr->src_path_bits; + new_ehca_av.ipd = ah_attr->static_rate; + new_ehca_av.lnh = EHCA_BMASK_SET(GRH_FLAG_MASK, + ((ah_attr->ah_flags & IB_AH_GRH) > 0)); + new_ehca_av.grh.word_0 = EHCA_BMASK_SET(GRH_TCLASS_MASK, + ah_attr->grh.traffic_class); + new_ehca_av.grh.word_0 |= EHCA_BMASK_SET(GRH_FLOWLABEL_MASK, + ah_attr->grh.flow_label); + new_ehca_av.grh.word_0 |= EHCA_BMASK_SET(GRH_HOPLIMIT_MASK, + ah_attr->grh.hop_limit); + new_ehca_av.grh.word_0 |= EHCA_BMASK_SET(GRH_NEXTHEADER_MASK, 0x1b); + new_ehca_av.grh.word_0 = be64_to_cpu(new_ehca_av.grh.word_0); + + /* set sgid in grh.word_1 */ + if (ah_attr->ah_flags & IB_AH_GRH) { + int rc = 0; + struct ib_port_attr port_attr; + union ib_gid gid; + memset(&port_attr, 0, sizeof(port_attr)); + rc = ehca_query_port(ah->device, ah_attr->port_num, + &port_attr); + if (rc != 0) { /* invalid port number */ + ret = -EINVAL; + EDEB_ERR(4, "Invalid port number " + "ehca_query_port() returned %x " + "ah=%p ah_attr=%p port_num=%x", + rc, ah, ah_attr, ah_attr->port_num); + goto modify_ah_exit1; + } + memset(&gid, 0, sizeof(gid)); + rc = ehca_query_gid(ah->device, + ah_attr->port_num, + ah_attr->grh.sgid_index, &gid); + if (rc != 0) { + ret = -EINVAL; + EDEB_ERR(4, + "Failed to retrieve sgid " + "ehca_query_gid() returned %x " + "ah=%p ah_attr=%p port_num=%x " + "sgid_index=%x", + rc, ah, ah_attr, ah_attr->port_num, + ah_attr->grh.sgid_index); + goto modify_ah_exit1; + } + memcpy(&new_ehca_av.grh.word_1, &gid, sizeof(gid)); + } + + new_ehca_av.pmtu = 4; /* TODO: see comment in create_ah() */ + + memcpy(&new_ehca_av.grh.word_3, &ah_attr->grh.dgid, + sizeof(ah_attr->grh.dgid)); + + av = container_of(ah, struct ehca_av, ib_ah); + av->av = new_ehca_av; + + modify_ah_exit1: + EDEB_EX(7,"ret=%x ah=%p ah_attr=%p", ret, ah, ah_attr); + + return ret; +} + +int ehca_query_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr) +{ + int ret = 0; + struct ehca_av *av = NULL; + + EHCA_CHECK_AV(ah); + EHCA_CHECK_ADR(ah_attr); + + EDEB_EN(7,"ah=%p ah_attr=%p", ah, ah_attr); + + av = container_of(ah, struct ehca_av, ib_ah); + memcpy(&ah_attr->grh.dgid, &av->av.grh.word_3, + sizeof(ah_attr->grh.dgid)); + ah_attr->sl = av->av.sl; + + ah_attr->dlid = av->av.dlid; + + ah_attr->src_path_bits = av->av.slid_path_bits; + ah_attr->static_rate = av->av.ipd; + ah_attr->ah_flags = EHCA_BMASK_GET(GRH_FLAG_MASK, av->av.lnh); + ah_attr->grh.traffic_class = EHCA_BMASK_GET(GRH_TCLASS_MASK, + av->av.grh.word_0); + ah_attr->grh.hop_limit = EHCA_BMASK_GET(GRH_HOPLIMIT_MASK, + av->av.grh.word_0); + ah_attr->grh.flow_label = EHCA_BMASK_GET(GRH_FLOWLABEL_MASK, + av->av.grh.word_0); + + EDEB_EX(7,"ah=%p ah_attr=%p ret=%x", ah, ah_attr, ret); + return ret; +} + +int ehca_destroy_ah(struct ib_ah *ah) +{ + int ret = 0; + + EHCA_CHECK_AV(ah); + EHCA_DEREGISTER_AV(ah); + + EDEB_EN(7,"ah=%p", ah); + + ehca_av_delete(container_of(ah, struct ehca_av, ib_ah)); + + EDEB_EX(7,"ret=%x ah=%p", ret, ah); + return ret; +} diff --git a/drivers/infiniband/hw/ehca/ehca_mcast.c b/drivers/infiniband/hw/ehca/ehca_mcast.c new file mode 100644 index 0000000..b49bcf6 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_mcast.c @@ -0,0 +1,194 @@ + +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * mcast functions + * + * Authors: Waleri Fomin + * Reinhard Ernst + * Hoang-Nam Nguyen + * Heiko J Schick + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_mcast.c,v 1.20 2006/02/06 10:17:34 schickhj Exp $ + */ + +#define DEB_PREFIX "mcas" + +#include "ehca_kernel.h" +#include "ehca_classes.h" +#include "ehca_tools.h" +#include "hcp_if.h" +#include "ehca_qes.h" +#include +#include +#include "ehca_iverbs.h" + +#define MAX_MC_LID 0xFFFE +#define MIN_MC_LID 0xC000 /* Multicast limits */ +#define EHCA_VALID_MULTICAST_GID(gid) ((gid)[0] == 0xFF) +#define EHCA_VALID_MULTICAST_LID(lid) (((lid) >= MIN_MC_LID) && ((lid) <= MIN_MC_LID)) + +int ehca_attach_mcast(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + struct ehca_qp *my_qp = NULL; + struct ehca_shca *shca = NULL; + union ib_gid my_gid; + u64 hipz_rc = H_Success; + int retcode = 0; + + EHCA_CHECK_ADR(ibqp); + EHCA_CHECK_ADR(gid); + + my_qp = container_of(ibqp, struct ehca_qp, ib_qp); + + EHCA_CHECK_QP(my_qp); + if (ibqp->qp_type != IB_QPT_UD) { + EDEB_ERR(4, "invalid qp_type %x gid, retcode=%x", + ibqp->qp_type, EINVAL); + return (-EINVAL); + } + + shca = container_of(ibqp->pd->device, struct ehca_shca, ib_device); + EHCA_CHECK_ADR(shca); + + if (!(EHCA_VALID_MULTICAST_GID(gid->raw))) { + EDEB_ERR(4, "gid is not valid mulitcast gid retcode=%x", + EINVAL); + return (-EINVAL); + } else if ((lid < MIN_MC_LID) || (lid > MAX_MC_LID)) { + EDEB_ERR(4, "lid=%x is not valid mulitcast lid retcode=%x", + lid, EINVAL); + return (-EINVAL); + } + + memcpy(&my_gid.raw, gid->raw, sizeof(union ib_gid)); + + hipz_rc = hipz_h_attach_mcqp(shca->ipz_hca_handle, + my_qp->ipz_qp_handle, + my_qp->ehca_qp_core.galpas.kernel, + lid, my_gid); + if (H_Success != hipz_rc) { + EDEB_ERR(4, + "ehca_qp=%p qp_num=%x hipz_h_attach_mcqp() failed " + "hipz_rc=%lx", my_qp, ibqp->qp_num, hipz_rc); + } + retcode = ehca2ib_return_code(hipz_rc); + + EDEB_EX(7, "mcast attach retcode=%x\n" + "ehca_qp=%p qp_num=%x lid=%x\n" + "my_gid= %x %x %x %x\n" + " %x %x %x %x\n" + " %x %x %x %x\n" + " %x %x %x %x\n", + retcode, my_qp, ibqp->qp_num, lid, + my_gid.raw[0], my_gid.raw[1], + my_gid.raw[2], my_gid.raw[3], + my_gid.raw[4], my_gid.raw[5], + my_gid.raw[6], my_gid.raw[7], + my_gid.raw[8], my_gid.raw[9], + my_gid.raw[10], my_gid.raw[11], + my_gid.raw[12], my_gid.raw[13], + my_gid.raw[14], my_gid.raw[15]); + + return retcode; +} + +int ehca_detach_mcast(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + struct ehca_qp *my_qp = NULL; + struct ehca_shca *shca = NULL; + union ib_gid my_gid; + u64 hipz_rc = H_Success; + int retcode = 0; + + EHCA_CHECK_ADR(ibqp); + EHCA_CHECK_ADR(gid); + + my_qp = container_of(ibqp, struct ehca_qp, ib_qp); + + EHCA_CHECK_QP(my_qp); + if (ibqp->qp_type != IB_QPT_UD) { + EDEB_ERR(4, "invalid qp_type %x gid, retcode=%x", + ibqp->qp_type, EINVAL); + return (-EINVAL); + } + + shca = container_of(ibqp->pd->device, struct ehca_shca, ib_device); + EHCA_CHECK_ADR(shca); + + if (!(EHCA_VALID_MULTICAST_GID(gid->raw))) { + EDEB_ERR(4, "gid is not valid mulitcast gid retcode=%x", + EINVAL); + return (-EINVAL); + } else if ((lid < MIN_MC_LID) || (lid > MAX_MC_LID)) { + EDEB_ERR(4, "lid=%x is not valid mulitcast lid retcode=%x", + lid, EINVAL); + return (-EINVAL); + } + + EDEB_EN(7, "dgid=%p qp_numl=%x lid=%x", + gid, ibqp->qp_num, lid); + + memcpy(&my_gid.raw, gid->raw, sizeof(union ib_gid)); + + hipz_rc = hipz_h_detach_mcqp(shca->ipz_hca_handle, + my_qp->ipz_qp_handle, + my_qp->ehca_qp_core.galpas.kernel, + lid, my_gid); + if (H_Success != hipz_rc) { + EDEB_ERR(4, + "ehca_qp=%p qp_num=%x hipz_h_detach_mcqp() failed " + "hipz_rc=%lx", my_qp, ibqp->qp_num, hipz_rc); + } + retcode = ehca2ib_return_code(hipz_rc); + + EDEB_EX(7, "mcast detach retcode=%x\n" + "ehca_qp=%p qp_num=%x lid=%x\n" + "my_gid= %x %x %x %x\n" + " %x %x %x %x\n" + " %x %x %x %x\n" + " %x %x %x %x\n", + retcode, my_qp, ibqp->qp_num, lid, + my_gid.raw[0], my_gid.raw[1], + my_gid.raw[2], my_gid.raw[3], + my_gid.raw[4], my_gid.raw[5], + my_gid.raw[6], my_gid.raw[7], + my_gid.raw[8], my_gid.raw[9], + my_gid.raw[10], my_gid.raw[11], + my_gid.raw[12], my_gid.raw[13], + my_gid.raw[14], my_gid.raw[15]); + + return retcode; +} diff --git a/drivers/infiniband/hw/ehca/ehca_pd.c b/drivers/infiniband/hw/ehca/ehca_pd.c new file mode 100644 index 0000000..e110320 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_pd.c @@ -0,0 +1,100 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * PD functions + * + * Authors: Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_pd.c,v 1.25 2006/02/06 10:17:34 schickhj Exp $ + */ + + +#define DEB_PREFIX "vpd " + +#include "ehca_kernel.h" +#include "ehca_tools.h" +#include "ehca_iverbs.h" + +struct ib_pd *ehca_alloc_pd(struct ib_device *device, + struct ib_ucontext *context, struct ib_udata *udata) +{ + struct ib_pd *mypd = NULL; + struct ehca_pd *pd = NULL; + + EDEB_EN(7, "device=%p context=%p udata=%p", device, context, udata); + + EHCA_CHECK_DEVICE_P(device); + + pd = ehca_pd_new(); + if (!pd) { + EDEB_ERR(4, "ERROR device=%p context=%p pd=%p " + "out of memory", device, context, mypd); + return ERR_PTR(-ENOMEM); + } + + /* kernel pd when (device,-1,0) + * user pd only if context != -1 */ + if (context == NULL) { + /* kernel pds after init reuses always + * the one created in ehca_shca_reopen() + */ + struct ehca_shca *shca = container_of(device, struct ehca_shca, + ib_device); + pd->fw_pd.value = shca->pd->fw_pd.value; + } else { + pd->fw_pd.value = (u64)pd; + } + + mypd = &pd->ib_pd; + + EHCA_REGISTER_PD(device, pd); + + EDEB_EX(7, "device=%p context=%p pd=%p", device, context, mypd); + + return (mypd); +} + +int ehca_dealloc_pd(struct ib_pd *pd) +{ + int ret = 0; + EDEB_EN(7, "pd=%p", pd); + + EHCA_CHECK_PD(pd); + EHCA_DEREGISTER_PD(pd); + ehca_pd_delete(container_of(pd, struct ehca_pd, ib_pd)); + + EDEB_EX(7, "pd=%p", pd); + return ret; +} From rolandd at cisco.com Fri Feb 17 16:57:57 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 17 Feb 2006 16:57:57 -0800 Subject: [openib-general] [PATCH 20/22] ehca userspace verbs In-Reply-To: <20060218005532.13620.79663.stgit@localhost.localdomain> References: <20060218005532.13620.79663.stgit@localhost.localdomain> Message-ID: <20060218005757.13620.13628.stgit@localhost.localdomain> From: Roland Dreier --- drivers/infiniband/hw/ehca/ehca_uverbs.c | 376 ++++++++++++++++++++++++++++++ 1 files changed, 376 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_uverbs.c b/drivers/infiniband/hw/ehca/ehca_uverbs.c new file mode 100644 index 0000000..f813e9c --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_uverbs.c @@ -0,0 +1,376 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * userspace support verbs + * + * Authors: Heiko J Schick + * Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_uverbs.c,v 1.29 2006/02/06 10:17:34 schickhj Exp $ + */ + +#undef DEB_PREFIX +#define DEB_PREFIX "uver" + +#include "ehca_kernel.h" +#include "ehca_tools.h" +#include "ehca_classes.h" +#include "ehca_iverbs.h" +#include "ehca_eq.h" +#include "ehca_mrmw.h" + +#include "hcp_sense.h" /* TODO: later via hipz_* header file */ +#include "hcp_if.h" /* TODO: later via hipz_* header file */ + +struct ib_ucontext *ehca_alloc_ucontext(struct ib_device *device, + struct ib_udata *udata) +{ + struct ehca_ucontext *my_context = NULL; + EHCA_CHECK_ADR_P(device); + EDEB_EN(7, "device=%p name=%s", device, device->name); + my_context = kmalloc(sizeof *my_context, GFP_KERNEL); + if (NULL == my_context) { + EDEB_ERR(4, "Out of memory device=%p", device); + return ERR_PTR(-ENOMEM); + } + memset(my_context, 0, sizeof(*my_context)); + EDEB_EX(7, "device=%p ucontext=%p", device, my_context); + return &my_context->ib_ucontext; +} + +int ehca_dealloc_ucontext(struct ib_ucontext *context) +{ + struct ehca_ucontext *my_context = NULL; + EHCA_CHECK_ADR(context); + EDEB_EN(7, "ucontext=%p", context); + my_context = container_of(context, struct ehca_ucontext, ib_ucontext); + kfree(my_context); + EDEB_EN(7, "ucontext=%p", context); + return 0; +} + +struct page *ehca_nopage(struct vm_area_struct *vma, + unsigned long address, int *type) +{ + struct page *mypage = 0; + u64 fileoffset = vma->vm_pgoff << PAGE_SHIFT; + u32 idr_handle = fileoffset >> 32; + u32 q_type = (fileoffset >> 28) & 0xF; /* CQ, QP,... */ + u32 rsrc_type = (fileoffset >> 24) & 0xF; /* sq,rq,cmnd_window */ + + EDEB_EN(7, + "vm_start=%lx vm_end=%lx vm_page_prot=%lx vm_fileoff=%lx", + vma->vm_start, vma->vm_end, vma->vm_page_prot, fileoffset); + + + if (q_type == 1) { /* CQ */ + struct ehca_cq *cq; + + down_read(&ehca_cq_idr_sem); + cq = idr_find(&ehca_cq_idr, idr_handle); + up_read(&ehca_cq_idr_sem); + + /* make sure this mmap really belongs to the authorized user */ + if (cq == 0) { + EDEB_ERR(4, "cq is NULL ret=NOPAGE_SIGBUS"); + return NOPAGE_SIGBUS; + } + if (rsrc_type == 2) { + void *vaddr; + EDEB(6, "cq=%p cq queuearea", cq); + vaddr = address - vma->vm_start + + cq->ehca_cq_core.ipz_queue.queue; + EDEB(6, "queue=%p vaddr=%p", + cq->ehca_cq_core.ipz_queue.queue, vaddr); + mypage = vmalloc_to_page(vaddr); + } + } else if (q_type == 2) { /* QP */ + struct ehca_qp *qp; + + down_read(&ehca_qp_idr_sem); + qp = idr_find(&ehca_qp_idr, idr_handle); + up_read(&ehca_qp_idr_sem); + + /* make sure this mmap really belongs to the authorized user */ + if (qp == NULL) { + EDEB_ERR(4, "qp is NULL ret=NOPAGE_SIGBUS"); + return NOPAGE_SIGBUS; + } + if (rsrc_type == 2) { /* rqueue */ + void *vaddr; + EDEB(6, "qp=%p qp rqueuearea", qp); + vaddr = address - vma->vm_start + + qp->ehca_qp_core.ipz_rqueue.queue; + EDEB(6, "rqueue=%p vaddr=%p", + qp->ehca_qp_core.ipz_rqueue.queue, vaddr); + mypage = vmalloc_to_page(vaddr); + } else if (rsrc_type == 3) { /* squeue */ + void *vaddr; + EDEB(6, "qp=%p qp squeuearea", qp); + vaddr = address - vma->vm_start + + qp->ehca_qp_core.ipz_squeue.queue; + EDEB(6, "squeue=%p vaddr=%p", + qp->ehca_qp_core.ipz_squeue.queue, vaddr); + mypage = vmalloc_to_page(vaddr); + } + } + if (mypage == 0) { + EDEB_ERR(4, "Invalid page adr==NULL ret=NOPAGE_SIGBUS"); + return NOPAGE_SIGBUS; + } + get_page(mypage); + EDEB_EX(7, "page adr=%p", mypage); + return mypage; +} + +static struct vm_operations_struct ehcau_vm_ops = { + .nopage = ehca_nopage, +}; + +/* TODO: better error output messages !!! + NO RETURN WITHOUT ERROR + */ +int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) +{ + u64 fileoffset = vma->vm_pgoff << PAGE_SHIFT; + + + u32 idr_handle = fileoffset >> 32; + u32 q_type = (fileoffset >> 28) & 0xF; /* CQ, QP,... */ + u32 rsrc_type = (fileoffset >> 24) & 0xF; /* sq,rq,cmnd_window */ + u32 ret = -EFAULT; /* assume the worst */ + u64 vsize = 0; /* must be calculated/set below */ + u64 physical = 0; /* must be calculated/set below */ + + EDEB_EN(7, "vm_start=%lx vm_end=%lx vm_page_prot=%lx vm_fileoff=%lx", + vma->vm_start, vma->vm_end, vma->vm_page_prot, fileoffset); + + if (q_type == 1) { /* CQ */ + struct ehca_cq *cq; + + down_read(&ehca_cq_idr_sem); + cq = idr_find(&ehca_cq_idr, idr_handle); + up_read(&ehca_cq_idr_sem); + + /* make sure this mmap really belongs to the authorized user */ + if (cq == 0) + return -EINVAL; + if (cq->ib_cq.uobject == 0) + return -EINVAL; + if (cq->ib_cq.uobject->context != context) + return -EINVAL; + if (rsrc_type == 1) { /* galpa fw handle */ + EDEB(6, "cq=%p cq triggerarea", cq); + vma->vm_flags |= VM_RESERVED; + vsize = vma->vm_end - vma->vm_start; + if (vsize != 4096) { + EDEB_ERR(4, "invalid vsize=%lx", + vma->vm_end - vma->vm_start); + ret = -EINVAL; + goto mmap_exit0; + } + + physical = cq->ehca_cq_core.galpas.user.fw_handle; + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); + vma->vm_flags |= VM_IO | VM_RESERVED; + + EDEB(6, "vsize=%lx physical=%lx", vsize, + physical); + ret = + remap_pfn_range(vma, vma->vm_start, + physical >> PAGE_SHIFT, vsize, + vma->vm_page_prot); + if (ret != 0) { + EDEB_ERR(4, + "Error: remap_pfn_range() returned %x!", + ret); + ret = -ENOMEM; + } + goto mmap_exit0; + } else if (rsrc_type == 2) { /* cq queue_addr */ + EDEB(6, "cq=%p cq q_addr", cq); + /* vma->vm_page_prot = + * pgprot_noncached(vma->vm_page_prot); */ + vma->vm_flags |= VM_RESERVED; + vma->vm_ops = &ehcau_vm_ops; + ret = 0; + goto mmap_exit0; + } else { + EDEB_ERR(6, "bad resource type %x", rsrc_type); + ret = -EINVAL; + goto mmap_exit0; + } + } else if (q_type == 2) { /* QP */ + struct ehca_qp *qp; + + down_read(&ehca_qp_idr_sem); + qp = idr_find(&ehca_qp_idr, idr_handle); + up_read(&ehca_qp_idr_sem); + + /* make sure this mmap really belongs to the authorized user */ + if (qp == NULL || qp->ib_qp.uobject == NULL || + qp->ib_qp.uobject->context != context) { + EDEB(6, "qp=%p, uobject=%p, context=%p", + qp, qp->ib_qp.uobject, qp->ib_qp.uobject->context); + ret = -EINVAL; + goto mmap_exit0; + } + if (rsrc_type == 1) { /* galpa fw handle */ + EDEB(6, "qp=%p qp triggerarea", qp); + vma->vm_flags |= VM_RESERVED; + vsize = vma->vm_end - vma->vm_start; + if (vsize != 4096) { + EDEB_ERR(4, "invalid vsize=%lx", + vma->vm_end - vma->vm_start); + ret = -EINVAL; + goto mmap_exit0; + } + + physical = qp->ehca_qp_core.galpas.user.fw_handle; + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); + vma->vm_flags |= VM_IO | VM_RESERVED; + + EDEB(6, "vsize=%lx physical=%lx", vsize, + physical); + ret = + remap_pfn_range(vma, vma->vm_start, + physical >> PAGE_SHIFT, vsize, + vma->vm_page_prot); + if (ret != 0) { + EDEB_ERR(4, + "Error: remap_pfn_range() returned %x!", + ret); + ret = -ENOMEM; + } + goto mmap_exit0; + } else if (rsrc_type == 2) { /* qp rqueue_addr */ + EDEB(6, "qp=%p qp rqueue_addr", qp); + vma->vm_flags |= VM_RESERVED; + vma->vm_ops = &ehcau_vm_ops; + ret = 0; + goto mmap_exit0; + } else if (rsrc_type == 3) { /* qp squeue_addr */ + EDEB(6, "qp=%p qp squeue_addr", qp); + vma->vm_flags |= VM_RESERVED; + vma->vm_ops = &ehcau_vm_ops; + ret = 0; + goto mmap_exit0; + } else { + EDEB_ERR(4, "bad resource type %x", + rsrc_type); + ret = -EINVAL; + goto mmap_exit0; + } + } else { + EDEB_ERR(4, "bad queue type %x", q_type); + ret = -EINVAL; + goto mmap_exit0; + } + + mmap_exit0: + EDEB_EX(7, "ret=%x", ret); + return ret; +} + +int ehca_mmap_nopage(u64 foffset,u64 length,void ** mapped,struct vm_area_struct ** vma) +{ + down_write(¤t->mm->mmap_sem); + *mapped=(void*) + do_mmap(NULL,0, + length, + PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, + foffset); + up_write(¤t->mm->mmap_sem); + if (*mapped) { + *vma = find_vma(current->mm,(u64)*mapped); + if (*vma) { + (*vma)->vm_flags |= VM_RESERVED; + (*vma)->vm_ops = &ehcau_vm_ops; + } else { + EDEB_ERR(4,"couldn't find queue vma queue=%p", + *mapped); + } + } else { + EDEB_ERR(4,"couldn't create mmap length=%lx",length); + } + EDEB(7,"mapped=%p",*mapped); + return 0; +} + +int ehca_mmap_register(u64 physical,void ** mapped,struct vm_area_struct ** vma) +{ + int ret; + unsigned long vsize; + ehca_mmap_nopage(0,4096,mapped,vma); + (*vma)->vm_flags |= VM_RESERVED; + vsize = (*vma)->vm_end - (*vma)->vm_start; + if (vsize != 4096) { + EDEB_ERR(4, "invalid vsize=%lx", + (*vma)->vm_end - (*vma)->vm_start); + ret = -EINVAL; + return ret; + } + + (*vma)->vm_page_prot = pgprot_noncached((*vma)->vm_page_prot); + (*vma)->vm_flags |= VM_IO | VM_RESERVED; + + EDEB(6, "vsize=%lx physical=%lx", vsize, + physical); + ret = + remap_pfn_range((*vma), (*vma)->vm_start, + physical >> PAGE_SHIFT, vsize, + (*vma)->vm_page_prot); + if (ret != 0) { + EDEB_ERR(4, + "Error: remap_pfn_range() returned %x!", + ret); + ret = -ENOMEM; + } + return ret; + +} + +int ehca_munmap(unsigned long addr, size_t len) { + int ret=0; + struct mm_struct *mm = current->mm; + if (mm!=0) { + down_write(&mm->mmap_sem); + ret = do_munmap(mm, addr, len); + up_write(&mm->mmap_sem); + } + return ret; +} + +/* eof ehca_uverbs.c */ From rolandd at cisco.com Fri Feb 17 16:57:54 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 17 Feb 2006 16:57:54 -0800 Subject: [openib-general] [PATCH 19/22] ehca memory regions In-Reply-To: <20060218005532.13620.79663.stgit@localhost.localdomain> References: <20060218005532.13620.79663.stgit@localhost.localdomain> Message-ID: <20060218005754.13620.41418.stgit@localhost.localdomain> From: Roland Dreier Nearly all the inline functions in ehca_mrmw.h look too big to be inlined. Why can't they just be static functions in ehca_mrmw.c? --- drivers/infiniband/hw/ehca/ehca_mrmw.c | 1711 ++++++++++++++++++++++++++++++++ drivers/infiniband/hw/ehca/ehca_mrmw.h | 739 ++++++++++++++ 2 files changed, 2450 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c new file mode 100644 index 0000000..d756082 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -0,0 +1,1711 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * MR/MW functions + * + * Authors: Dietmar Decker + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_mrmw.c,v 1.86 2006/02/07 07:51:13 decker Exp $ + */ + +#undef DEB_PREFIX +#define DEB_PREFIX "mrmw" + +#include "ehca_kernel.h" +#include "ehca_iverbs.h" +#include "hcp_if.h" +#include "ehca_mrmw.h" + +extern int ehca_use_hp_mr; + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +struct ib_mr *ehca_get_dma_mr(struct ib_pd *pd, int mr_access_flags) +{ + struct ib_mr *ib_mr; + int retcode = 0; + struct ehca_mr *e_maxmr = 0; + struct ehca_pd *e_pd; + struct ehca_shca *shca; + + EDEB_EN(7, "pd=%p mr_access_flags=%x", pd, mr_access_flags); + + EHCA_CHECK_PD_P(pd); + e_pd = container_of(pd, struct ehca_pd, ib_pd); + shca = container_of(pd->device, struct ehca_shca, ib_device); + + if (shca->maxmr) { + e_maxmr = ehca_mr_new(); + if (!e_maxmr) { + EDEB_ERR(4, "out of memory"); + ib_mr = ERR_PTR(-ENOMEM); + goto get_dma_mr_exit0; + } + + retcode = ehca_reg_maxmr(shca, e_maxmr, + (u64 *)KERNELBASE, + mr_access_flags, e_pd, + &e_maxmr->ib.ib_mr.lkey, + &e_maxmr->ib.ib_mr.rkey); + if (retcode != 0) { + ib_mr = ERR_PTR(retcode); + goto get_dma_mr_exit0; + } + ib_mr = &e_maxmr->ib.ib_mr; + } else { + EDEB_ERR(4, "no internal max-MR exist!"); + ib_mr = ERR_PTR(-EINVAL); + goto get_dma_mr_exit0; + } + + get_dma_mr_exit0: + if (IS_ERR(ib_mr) == 0) + EDEB_EX(7, "ib_mr=%p lkey=%x rkey=%x", + ib_mr, ib_mr->lkey, ib_mr->rkey); + else + EDEB_EX(4, "rc=%lx pd=%p mr_access_flags=%x ", + PTR_ERR(ib_mr), pd, mr_access_flags); + return (ib_mr); +} /* end ehca_get_dma_mr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +struct ib_mr *ehca_reg_phys_mr(struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start) +{ + struct ib_mr *ib_mr = 0; + int retcode = 0; + struct ehca_mr *e_mr = 0; + struct ehca_shca *shca = 0; + struct ehca_pd *e_pd = 0; + u64 size = 0; + struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,0,0,0,0,0}; + u32 num_pages_mr = 0; + + EDEB_EN(7, "pd=%p phys_buf_array=%p num_phys_buf=%x " + "mr_access_flags=%x iova_start=%p", pd, phys_buf_array, + num_phys_buf, mr_access_flags, iova_start); + + EHCA_CHECK_PD_P(pd); + if ((num_phys_buf <= 0) || ehca_adr_bad(phys_buf_array)) { + EDEB_ERR(4, "bad input values: num_phys_buf=%x " + "phys_buf_array=%p", num_phys_buf, phys_buf_array); + ib_mr = ERR_PTR(-EINVAL); + goto reg_phys_mr_exit0; + } + if (((mr_access_flags & IB_ACCESS_REMOTE_WRITE) && + !(mr_access_flags & IB_ACCESS_LOCAL_WRITE)) || + ((mr_access_flags & IB_ACCESS_REMOTE_ATOMIC) && + !(mr_access_flags & IB_ACCESS_LOCAL_WRITE))) { + /* Remote Write Access requires Local Write Access */ + /* Remote Atomic Access requires Local Write Access */ + EDEB_ERR(4, "bad input values: mr_access_flags=%x", + mr_access_flags); + ib_mr = ERR_PTR(-EINVAL); + goto reg_phys_mr_exit0; + } + + /* check physical buffer list and calculate size */ + retcode = ehca_mr_chk_buf_and_calc_size(phys_buf_array, num_phys_buf, + iova_start, &size); + if (retcode != 0) { + ib_mr = ERR_PTR(retcode); + goto reg_phys_mr_exit0; + } + if ((size == 0) || + ((0xFFFFFFFFFFFFFFFF - size) < (u64)iova_start)) { + EDEB_ERR(4, "bad input values: size=%lx iova_start=%p", + size, iova_start); + ib_mr = ERR_PTR(-EINVAL); + goto reg_phys_mr_exit0; + } + + e_pd = container_of(pd, struct ehca_pd, ib_pd); + shca = container_of(pd->device, struct ehca_shca, ib_device); + + e_mr = ehca_mr_new(); + if (!e_mr) { + EDEB_ERR(4, "out of memory"); + ib_mr = ERR_PTR(-ENOMEM); + goto reg_phys_mr_exit0; + } + + /* determine number of MR pages */ + /* pagesize currently hardcoded to 4k ... TODO.. */ + num_pages_mr = + ((((u64)iova_start % PAGE_SIZE) + size + + PAGE_SIZE - 1) / PAGE_SIZE); + + /* register MR on HCA */ + if (ehca_mr_is_maxmr(size, iova_start)) { + e_mr->flags |= EHCA_MR_FLAG_MAXMR; + retcode = ehca_reg_maxmr(shca, e_mr, iova_start, + mr_access_flags, e_pd, + &e_mr->ib.ib_mr.lkey, + &e_mr->ib.ib_mr.rkey); + if (retcode != 0) { + ib_mr = ERR_PTR(retcode); + goto reg_phys_mr_exit1; + } + } else { + pginfo.type = EHCA_MR_PGI_PHYS; + pginfo.num_pages = num_pages_mr; + pginfo.num_phys_buf = num_phys_buf; + pginfo.phys_buf_array = phys_buf_array; + + retcode = ehca_reg_mr(shca, e_mr, iova_start, size, + mr_access_flags, e_pd, &pginfo, + &e_mr->ib.ib_mr.lkey, + &e_mr->ib.ib_mr.rkey); + if (retcode != 0) { + ib_mr = ERR_PTR(retcode); + goto reg_phys_mr_exit1; + } + } + + /* successful registration of all pages */ + ib_mr = &e_mr->ib.ib_mr; + goto reg_phys_mr_exit0; + + reg_phys_mr_exit1: + ehca_mr_delete(e_mr); + reg_phys_mr_exit0: + if (IS_ERR(ib_mr) == 0) + EDEB_EX(7, "ib_mr=%p lkey=%x rkey=%x", + ib_mr, ib_mr->lkey, ib_mr->rkey); + else + EDEB_EX(4, "rc=%lx pd=%p phys_buf_array=%p " + "num_phys_buf=%x mr_access_flags=%x iova_start=%p", + PTR_ERR(ib_mr), pd, phys_buf_array, + num_phys_buf, mr_access_flags, iova_start); + return (ib_mr); +} /* end ehca_reg_phys_mr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, + struct ib_umem *region, + int mr_access_flags, + struct ib_udata *udata) +{ + struct ib_mr *ib_mr = 0; + struct ehca_mr *e_mr = 0; + struct ehca_shca *shca = 0; + struct ehca_pd *e_pd = 0; + struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,0,0,0,0,0}; + int retcode = 0; + u32 num_pages_mr = 0; + + EDEB_EN(7, "pd=%p region=%p mr_access_flags=%x udata=%p", + pd, region, mr_access_flags, udata); + + EHCA_CHECK_PD_P(pd); + if (ehca_adr_bad(region)) { + EDEB_ERR(4, "bad input values: region=%p", region); + ib_mr = ERR_PTR(-EINVAL); + goto reg_user_mr_exit0; + } + if (((mr_access_flags & IB_ACCESS_REMOTE_WRITE) && + !(mr_access_flags & IB_ACCESS_LOCAL_WRITE)) || + ((mr_access_flags & IB_ACCESS_REMOTE_ATOMIC) && + !(mr_access_flags & IB_ACCESS_LOCAL_WRITE))) { + /* Remote Write Access requires Local Write Access */ + /* Remote Atomic Access requires Local Write Access */ + EDEB_ERR(4, "bad input values: mr_access_flags=%x", + mr_access_flags); + ib_mr = ERR_PTR(-EINVAL); + goto reg_user_mr_exit0; + } + EDEB(7, "user_base=%lx virt_base=%lx length=%lx offset=%x page_size=%x " + "chunk_list.next=%p", + region->user_base, region->virt_base, region->length, + region->offset, region->page_size, region->chunk_list.next); + if (region->page_size != PAGE_SIZE) { + /* @TODO large page support */ + EDEB_ERR(4, "large pages not supported, region->page_size=%x", + region->page_size); + ib_mr = ERR_PTR(-EINVAL); + goto reg_user_mr_exit0; + } + + if ((region->length == 0) || + ((0xFFFFFFFFFFFFFFFF - region->length) < region->virt_base)) { + EDEB_ERR(4, "bad input values: length=%lx virt_base=%lx", + region->length, region->virt_base); + ib_mr = ERR_PTR(-EINVAL); + goto reg_user_mr_exit0; + } + + e_pd = container_of(pd, struct ehca_pd, ib_pd); + shca = container_of(pd->device, struct ehca_shca, ib_device); + + e_mr = ehca_mr_new(); + if (!e_mr) { + EDEB_ERR(4, "out of memory"); + ib_mr = ERR_PTR(-ENOMEM); + goto reg_user_mr_exit0; + } + + /* determine number of MR pages */ + /* pagesize currently hardcoded to 4k ...TODO... */ + num_pages_mr = + (((region->virt_base % PAGE_SIZE) + region->length + + PAGE_SIZE - 1) / PAGE_SIZE); + + /* register MR on HCA */ + pginfo.type = EHCA_MR_PGI_USER; + pginfo.num_pages = num_pages_mr; + pginfo.region = region; + pginfo.next_chunk = list_prepare_entry(pginfo.next_chunk, + (®ion->chunk_list), + list); + + retcode = ehca_reg_mr(shca, e_mr, (u64 *)region->virt_base, + region->length, mr_access_flags, e_pd, &pginfo, + &e_mr->ib.ib_mr.lkey, &e_mr->ib.ib_mr.rkey); + if (retcode != 0) { + ib_mr = ERR_PTR(retcode); + goto reg_user_mr_exit1; + } + + /* successful registration of all pages */ + ib_mr = &e_mr->ib.ib_mr; + goto reg_user_mr_exit0; + + reg_user_mr_exit1: + ehca_mr_delete(e_mr); + reg_user_mr_exit0: + if (IS_ERR(ib_mr) == 0) + EDEB_EX(7, "ib_mr=%p lkey=%x rkey=%x", + ib_mr, ib_mr->lkey, ib_mr->rkey); + else + EDEB_EX(4, "rc=%lx pd=%p region=%p mr_access_flags=%x " + "udata=%p", + PTR_ERR(ib_mr), pd, region, mr_access_flags, udata); + return (ib_mr); +} /* end ehca_reg_user_mr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +int ehca_rereg_phys_mr(struct ib_mr *mr, + int mr_rereg_mask, + struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start) +{ + int retcode = 0; + struct ehca_shca *shca = 0; + struct ehca_mr *e_mr = 0; + u64 new_size = 0; + u64 *new_start = 0; + u32 new_acl = 0; + struct ehca_pd *new_pd = 0; + u32 tmp_lkey = 0; + u32 tmp_rkey = 0; + unsigned long sl_flags; + u64 num_pages_mr = 0; + struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,0,0,0,0,0}; + + EDEB_EN(7, "mr=%p mr_rereg_mask=%x pd=%p phys_buf_array=%p " + "num_phys_buf=%x mr_access_flags=%x iova_start=%p", + mr, mr_rereg_mask, pd, phys_buf_array, num_phys_buf, + mr_access_flags, iova_start); + + if (!(mr_rereg_mask & IB_MR_REREG_TRANS)) { + /*@TODO not supported, because PHYP rereg hCall needs pages*/ + /*@TODO: We will follow this with Tom ....*/ + EDEB_ERR(4, "rereg without IB_MR_REREG_TRANS not supported yet," + " mr_rereg_mask=%x", mr_rereg_mask); + retcode = -EINVAL; + goto rereg_phys_mr_exit0; + } + + EHCA_CHECK_MR(mr); + e_mr = container_of(mr, struct ehca_mr, ib.ib_mr); + if (mr_rereg_mask & IB_MR_REREG_PD) { + EHCA_CHECK_PD(pd); + } + + if ((mr_rereg_mask & + ~(IB_MR_REREG_TRANS | IB_MR_REREG_PD | IB_MR_REREG_ACCESS)) || + (mr_rereg_mask == 0)) { + retcode = -EINVAL; + goto rereg_phys_mr_exit0; + } + + shca = container_of(mr->device, struct ehca_shca, ib_device); + + /* check other parameters */ + if (e_mr == shca->maxmr) { + /* should be impossible, however reject to be sure */ + EDEB_ERR(3, "rereg internal max-MR impossible, mr=%p " + "shca->maxmr=%p mr->lkey=%x", + mr, shca->maxmr, mr->lkey); + retcode = -EINVAL; + goto rereg_phys_mr_exit0; + } + if (mr_rereg_mask & IB_MR_REREG_TRANS) { /* transl., i.e. addr/size */ + if (e_mr->flags & EHCA_MR_FLAG_FMR) { + EDEB_ERR(4, "not supported for FMR, mr=%p flags=%x", + mr, e_mr->flags); + retcode = -EINVAL; + goto rereg_phys_mr_exit0; + } + if (ehca_adr_bad(phys_buf_array) || num_phys_buf <= 0) { + EDEB_ERR(4, "bad input values: mr_rereg_mask=%x " + "phys_buf_array=%p num_phys_buf=%x", + mr_rereg_mask, phys_buf_array, num_phys_buf); + retcode = -EINVAL; + goto rereg_phys_mr_exit0; + } + } + if ((mr_rereg_mask & IB_MR_REREG_ACCESS) && /* change ACL */ + (((mr_access_flags & IB_ACCESS_REMOTE_WRITE) && + !(mr_access_flags & IB_ACCESS_LOCAL_WRITE)) || + ((mr_access_flags & IB_ACCESS_REMOTE_ATOMIC) && + !(mr_access_flags & IB_ACCESS_LOCAL_WRITE)))) { + /* Remote Write Access requires Local Write Access */ + /* Remote Atomic Access requires Local Write Access */ + EDEB_ERR(4, "bad input values: mr_rereg_mask=%x " + "mr_access_flags=%x", mr_rereg_mask, mr_access_flags); + retcode = -EINVAL; + goto rereg_phys_mr_exit0; + } + + /* set requested values dependent on rereg request */ + spin_lock_irqsave(&e_mr->mrlock, sl_flags); /* get lock @TODO for MR*/ + new_start = e_mr->start; /* new == old address */ + new_size = e_mr->size; /* new == old length */ + new_acl = e_mr->acl; /* new == old access control */ + new_pd = container_of(mr->pd,struct ehca_pd,ib_pd); /*new == old PD*/ + + if (mr_rereg_mask & IB_MR_REREG_TRANS) { + new_start = iova_start; /* change address */ + /* check physical buffer list and calculate size */ + retcode = ehca_mr_chk_buf_and_calc_size(phys_buf_array, + num_phys_buf, + iova_start, &new_size); + if (retcode != 0) + goto rereg_phys_mr_exit1; + if ((new_size == 0) || + ((0xFFFFFFFFFFFFFFFF - new_size) < (u64)iova_start)) { + EDEB_ERR(4, "bad input values: new_size=%lx " + "iova_start=%p", new_size, iova_start); + retcode = -EINVAL; + goto rereg_phys_mr_exit1; + } + num_pages_mr = ((((u64)new_start % PAGE_SIZE) + + new_size + PAGE_SIZE - 1) / PAGE_SIZE); + pginfo.type = EHCA_MR_PGI_PHYS; + pginfo.num_pages = num_pages_mr; + pginfo.num_phys_buf = num_phys_buf; + pginfo.phys_buf_array = phys_buf_array; + } + if (mr_rereg_mask & IB_MR_REREG_ACCESS) + new_acl = mr_access_flags; + if (mr_rereg_mask & IB_MR_REREG_PD) + new_pd = container_of(pd, struct ehca_pd, ib_pd); + + EDEB(7, "mr=%p new_start=%p new_size=%lx new_acl=%x new_pd=%p " + "num_pages_mr=%lx", + e_mr, new_start, new_size, new_acl, new_pd, num_pages_mr); + + retcode = ehca_rereg_mr(shca, e_mr, new_start, new_size, new_acl, + new_pd, &pginfo, &tmp_lkey, &tmp_rkey); + if (retcode != 0) + goto rereg_phys_mr_exit1; + + /* successful reregistration */ + if (mr_rereg_mask & IB_MR_REREG_PD) + mr->pd = pd; + mr->lkey = tmp_lkey; + mr->rkey = tmp_rkey; + + rereg_phys_mr_exit1: + spin_unlock_irqrestore(&e_mr->mrlock, sl_flags); /* free spin lock */ + rereg_phys_mr_exit0: + if (retcode == 0) + EDEB_EX(7, "mr=%p mr_rereg_mask=%x pd=%p phys_buf_array=%p " + "num_phys_buf=%x mr_access_flags=%x iova_start=%p", + mr, mr_rereg_mask, pd, phys_buf_array, num_phys_buf, + mr_access_flags, iova_start); + else + EDEB_EX(4, "retcode=%x mr=%p mr_rereg_mask=%x pd=%p " + "phys_buf_array=%p num_phys_buf=%x mr_access_flags=%x " + "iova_start=%p", + retcode, mr, mr_rereg_mask, pd, phys_buf_array, + num_phys_buf, mr_access_flags, iova_start); + + return (retcode); +} /* end ehca_rereg_phys_mr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +int ehca_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr) +{ + int retcode = 0; + u64 rc = H_Success; + struct ehca_shca *shca = 0; + struct ehca_mr *e_mr = 0; + struct ipz_pd fwpd; /* Firmware PD */ + u32 access_ctrl = 0; + u64 tmp_remote_size = 0; + u64 tmp_remote_len = 0; + + unsigned long sl_flags; + + EDEB_EN(7, "mr=%p mr_attr=%p", mr, mr_attr); + + EHCA_CHECK_MR(mr); + e_mr = container_of(mr, struct ehca_mr, ib.ib_mr); + if (ehca_adr_bad(mr_attr)) { + EDEB_ERR(4, "bad input values: mr_attr=%p", mr_attr); + retcode = -EINVAL; + goto query_mr_exit0; + } + if ((e_mr->flags & EHCA_MR_FLAG_FMR)) { + EDEB_ERR(4, "not supported for FMR, mr=%p e_mr=%p " + "e_mr->flags=%x", mr, e_mr, e_mr->flags); + retcode = -EINVAL; + goto query_mr_exit0; + } + + shca = container_of(mr->device, struct ehca_shca, ib_device); + memset(mr_attr, 0, sizeof(struct ib_mr_attr)); + spin_lock_irqsave(&e_mr->mrlock, sl_flags); /* get spin lock @TODO?? */ + + rc = hipz_h_query_mr(shca->ipz_hca_handle, &e_mr->pf, + &e_mr->ipz_mr_handle, &mr_attr->size, + &mr_attr->device_virt_addr, &tmp_remote_size, + &tmp_remote_len, &access_ctrl, &fwpd, + &mr_attr->lkey, &mr_attr->rkey); + if (rc != H_Success) { + EDEB_ERR(4, "hipz_mr_query failed, rc=%lx mr=%p " + "hca_hndl=%lx mr_hndl=%lx lkey=%x", + rc, mr, shca->ipz_hca_handle.handle, + e_mr->ipz_mr_handle.handle, mr->lkey); + retcode = ehca_mrmw_map_rc_query_mr(rc); + goto query_mr_exit1; + } + ehca_mrmw_reverse_map_acl(&access_ctrl, &mr_attr->mr_access_flags); + mr_attr->pd = mr->pd; + + query_mr_exit1: + spin_unlock_irqrestore(&e_mr->mrlock, sl_flags); /* free spin lock */ + query_mr_exit0: + if (retcode == 0) + EDEB_EX(7, "pd=%p device_virt_addr=%lx size=%lx " + "mr_access_flags=%x lkey=%x rkey=%x", + mr_attr->pd, mr_attr->device_virt_addr, + mr_attr->size, mr_attr->mr_access_flags, + mr_attr->lkey, mr_attr->rkey); + else + EDEB_EX(4, "retcode=%x mr=%p mr_attr=%p", retcode, mr, mr_attr); + return (retcode); +} /* end ehca_query_mr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +int ehca_dereg_mr(struct ib_mr *mr) +{ + int retcode = 0; + u64 rc = H_Success; + struct ehca_shca *shca = 0; + struct ehca_mr *e_mr = 0; + + EDEB_EN(7, "mr=%p", mr); + + EHCA_CHECK_MR(mr); + e_mr = container_of(mr, struct ehca_mr, ib.ib_mr); + shca = container_of(mr->device, struct ehca_shca, ib_device); + + if ((e_mr->flags & EHCA_MR_FLAG_FMR)) { + EDEB_ERR(4, "not supported for FMR, mr=%p e_mr=%p " + "e_mr->flags=%x", mr, e_mr, e_mr->flags); + retcode = -EINVAL; + goto dereg_mr_exit0; + } else if (e_mr == shca->maxmr) { + /* should be impossible, however reject to be sure */ + EDEB_ERR(3, "dereg internal max-MR impossible, mr=%p " + "shca->maxmr=%p mr->lkey=%x", + mr, shca->maxmr, mr->lkey); + retcode = -EINVAL; + goto dereg_mr_exit0; + } + + /*@TODO: BUSY: MR still has bound window(s) */ + rc = hipz_h_free_resource_mr(shca->ipz_hca_handle, &e_mr->pf, + &e_mr->ipz_mr_handle); + if (rc != H_Success) { + EDEB_ERR(4, "hipz_free_mr failed, rc=%lx shca=%p e_mr=%p" + " hca_hndl=%lx mr_hndl=%lx mr->lkey=%x", + rc, shca, e_mr, shca->ipz_hca_handle.handle, + e_mr->ipz_mr_handle.handle, mr->lkey); + retcode = ehca_mrmw_map_rc_free_mr(rc); + goto dereg_mr_exit0; + } + + /* successful deregistration */ + ehca_mr_delete(e_mr); + + dereg_mr_exit0: + if (retcode == 0) + EDEB_EX(7, ""); + else + EDEB_EX(4, "retcode=%x mr=%p", retcode, mr); + return (retcode); +} /* end ehca_dereg_mr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +struct ib_mw *ehca_alloc_mw(struct ib_pd *pd) +{ + struct ib_mw *ib_mw = 0; + u64 rc = H_Success; + struct ehca_shca *shca = 0; + struct ehca_mw *e_mw = 0; + struct ehca_pd *e_pd = 0; + + EDEB_EN(7, "pd=%p", pd); + + EHCA_CHECK_PD_P(pd); + e_pd = container_of(pd, struct ehca_pd, ib_pd); + shca = container_of(pd->device, struct ehca_shca, ib_device); + + e_mw = ehca_mw_new(); + if (!e_mw) { + ib_mw = ERR_PTR(-ENOMEM); + goto alloc_mw_exit0; + } + + rc = hipz_h_alloc_resource_mw(shca->ipz_hca_handle, &e_mw->pf, + &shca->pf, e_pd->fw_pd, + &e_mw->ipz_mw_handle, &e_mw->ib_mw.rkey); + if (rc != H_Success) { + EDEB_ERR(4, "hipz_mw_allocate failed, rc=%lx shca=%p " + "hca_hndl=%lx mw=%p", rc, shca, + shca->ipz_hca_handle.handle, e_mw); + ib_mw = ERR_PTR(ehca_mrmw_map_rc_alloc(rc)); + goto alloc_mw_exit1; + } + /* save R_Key in local copy */ + /*@TODO????? mw->rkey = *rkey_p; */ + + /* successful MW allocation */ + ib_mw = &e_mw->ib_mw; + goto alloc_mw_exit0; + + alloc_mw_exit1: + ehca_mw_delete(e_mw); + alloc_mw_exit0: + if (IS_ERR(ib_mw) == 0) + EDEB_EX(7, "ib_mw=%p rkey=%x", ib_mw, ib_mw->rkey); + else + EDEB_EX(4, "rc=%lx pd=%p", PTR_ERR(ib_mw), pd); + return (ib_mw); +} /* end ehca_alloc_mw() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +int ehca_bind_mw(struct ib_qp *qp, + struct ib_mw *mw, + struct ib_mw_bind *mw_bind) +{ + int retcode = 0; + + /*@TODO: not supported up to now */ + EDEB_ERR(4, "bind MW currently not supported by HCAD"); + retcode = -EPERM; + goto bind_mw_exit0; + + bind_mw_exit0: + if (retcode == 0) + EDEB_EX(7, "qp=%p mw=%p mw_bind=%p", qp, mw, mw_bind); + else + EDEB_EX(4, "rc=%x qp=%p mw=%p mw_bind=%p", + retcode, qp, mw, mw_bind); + return (retcode); +} /* end ehca_bind_mw() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +int ehca_dealloc_mw(struct ib_mw *mw) +{ + int retcode = 0; + u64 rc = H_Success; + struct ehca_shca *shca = 0; + struct ehca_mw *e_mw = 0; + + EDEB_EN(7, "mw=%p", mw); + + EHCA_CHECK_MW(mw); + e_mw = container_of(mw, struct ehca_mw, ib_mw); + shca = container_of(mw->device, struct ehca_shca, ib_device); + + rc = hipz_h_free_resource_mw(shca->ipz_hca_handle, &e_mw->pf, + &e_mw->ipz_mw_handle); + if (rc != H_Success) { + EDEB_ERR(4, "hipz_free_mw failed, rc=%lx shca=%p mw=%p " + "rkey=%x hca_hndl=%lx mw_hndl=%lx", + rc, shca, mw, mw->rkey, shca->ipz_hca_handle.handle, + e_mw->ipz_mw_handle.handle); + retcode = ehca_mrmw_map_rc_free_mw(rc); + goto dealloc_mw_exit0; + } + /* successful deallocation */ + ehca_mw_delete(e_mw); + + dealloc_mw_exit0: + if (retcode == 0) + EDEB_EX(7, ""); + else + EDEB_EX(4, "retcode=%x mw=%p", retcode, mw); + return (retcode); +} /* end ehca_dealloc_mw() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +struct ib_fmr *ehca_alloc_fmr(struct ib_pd *pd, + int mr_access_flags, + struct ib_fmr_attr *fmr_attr) +{ + struct ib_fmr *ib_fmr = 0; + struct ehca_shca *shca = 0; + struct ehca_mr *e_fmr = 0; + int retcode = 0; + struct ehca_pd *e_pd = 0; + u32 tmp_lkey = 0; + u32 tmp_rkey = 0; + struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,0,0,0,0,0}; + + EDEB_EN(7, "pd=%p mr_access_flags=%x fmr_attr=%p", + pd, mr_access_flags, fmr_attr); + + EHCA_CHECK_PD_P(pd); + if (ehca_adr_bad(fmr_attr)) { + EDEB_ERR(4, "bad input values: fmr_attr=%p", fmr_attr); + ib_fmr = ERR_PTR(-EINVAL); + goto alloc_fmr_exit0; + } + + EDEB(7, "max_pages=%x max_maps=%x page_shift=%x", + fmr_attr->max_pages, fmr_attr->max_maps, fmr_attr->page_shift); + + /* check other parameters */ + if (((mr_access_flags & IB_ACCESS_REMOTE_WRITE) && + !(mr_access_flags & IB_ACCESS_LOCAL_WRITE)) || + ((mr_access_flags & IB_ACCESS_REMOTE_ATOMIC) && + !(mr_access_flags & IB_ACCESS_LOCAL_WRITE))) { + /* Remote Write Access requires Local Write Access */ + /* Remote Atomic Access requires Local Write Access */ + EDEB_ERR(4, "bad input values: mr_access_flags=%x", + mr_access_flags); + ib_fmr = ERR_PTR(-EINVAL); + goto alloc_fmr_exit0; + } + if (mr_access_flags & IB_ACCESS_MW_BIND) { + EDEB_ERR(4, "bad input values: mr_access_flags=%x", + mr_access_flags); + ib_fmr = ERR_PTR(-EINVAL); + goto alloc_fmr_exit0; + } + if ((fmr_attr->max_pages == 0) || (fmr_attr->max_maps == 0)) { + EDEB_ERR(4, "bad input values: fmr_attr->max_pages=%x " + "fmr_attr->max_maps=%x fmr_attr->page_shift=%x", + fmr_attr->max_pages, fmr_attr->max_maps, + fmr_attr->page_shift); + ib_fmr = ERR_PTR(-EINVAL); + goto alloc_fmr_exit0; + } + if ((1 << fmr_attr->page_shift) != PAGE_SIZE) { + /* pagesize currently hardcoded to 4k ... */ + EDEB_ERR(4, "unsupported fmr_attr->page_shift=%x", + fmr_attr->page_shift); + ib_fmr = ERR_PTR(-EINVAL); + goto alloc_fmr_exit0; + } + + e_pd = container_of(pd, struct ehca_pd, ib_pd); + shca = container_of(pd->device, struct ehca_shca, ib_device); + + e_fmr = ehca_mr_new(); + if (e_fmr == 0) { + ib_fmr = ERR_PTR(-ENOMEM); + goto alloc_fmr_exit0; + } + e_fmr->flags |= EHCA_MR_FLAG_FMR; + + /* register MR on HCA */ + retcode = ehca_reg_mr(shca, e_fmr, 0, + fmr_attr->max_pages * PAGE_SIZE, + mr_access_flags, e_pd, &pginfo, + &tmp_lkey, &tmp_rkey); + if (retcode != 0) { + ib_fmr = ERR_PTR(retcode); + goto alloc_fmr_exit1; + } + + /* successful registration of all pages */ + e_fmr->fmr_page_size = 1 << fmr_attr->page_shift; + e_fmr->fmr_max_pages = fmr_attr->max_pages; /* pagesize hardcoded 4k */ + e_fmr->fmr_max_maps = fmr_attr->max_maps; + e_fmr->fmr_map_cnt = 0; + ib_fmr = &e_fmr->ib.ib_fmr; + goto alloc_fmr_exit0; + + alloc_fmr_exit1: + ehca_mr_delete(e_fmr); + alloc_fmr_exit0: + if (IS_ERR(ib_fmr) == 0) + EDEB_EX(7, "ib_fmr=%p tmp_lkey=%x tmp_rkey=%x", + ib_fmr, tmp_lkey, tmp_rkey); + else + EDEB_EX(4, "rc=%lx pd=%p mr_access_flags=%x " + "fmr_attr=%p", PTR_ERR(ib_fmr), pd, + mr_access_flags, fmr_attr); + return (ib_fmr); +} /* end ehca_alloc_fmr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +int ehca_map_phys_fmr(struct ib_fmr *fmr, + u64 *page_list, + int list_len, + u64 iova) +{ + int retcode = 0; + struct ehca_shca *shca = 0; + struct ehca_mr *e_fmr = 0; + struct ehca_pd *e_pd = 0; + struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,0,0,0,0,0}; + u32 tmp_lkey = 0; + u32 tmp_rkey = 0; + /*@TODO unsigned long sl_flags; */ + + EDEB_EN(7, "fmr=%p page_list=%p list_len=%x iova=%lx", + fmr, page_list, list_len, iova); + + EHCA_CHECK_FMR(fmr); + e_fmr = container_of(fmr, struct ehca_mr, ib.ib_fmr); + shca = container_of(fmr->device, struct ehca_shca, ib_device); + e_pd = container_of(fmr->pd, struct ehca_pd, ib_pd); + + if (!(e_fmr->flags & EHCA_MR_FLAG_FMR)) { + EDEB_ERR(4, "not a FMR, e_fmr=%p e_fmr->flags=%x", + e_fmr, e_fmr->flags); + retcode = -EINVAL; + goto map_phys_fmr_exit0; + } + retcode = ehca_fmr_check_page_list(e_fmr, page_list, list_len); + if (retcode != 0) + goto map_phys_fmr_exit0; + if (iova % PAGE_SIZE) { + /* only whole-numbered pages */ + EDEB_ERR(4, "bad iova, iova=%lx", iova); + retcode = -EINVAL; + goto map_phys_fmr_exit0; + } + if (e_fmr->fmr_map_cnt >= e_fmr->fmr_max_maps) { + /* HCAD does not limit the maps, however trace this anyway */ + EDEB(6, "map limit exceeded, fmr=%p e_fmr->fmr_map_cnt=%x " + "e_fmr->fmr_max_maps=%x", + fmr, e_fmr->fmr_map_cnt, e_fmr->fmr_max_maps); + } + + pginfo.type = EHCA_MR_PGI_FMR; + pginfo.num_pages = list_len; + pginfo.page_list = page_list; + + /* @TODO spin_lock_irqsave(&e_fmr->mrlock, sl_flags); */ + + retcode = ehca_rereg_mr(shca, e_fmr, (u64 *)iova, + list_len * PAGE_SIZE, + e_fmr->acl, e_pd, &pginfo, + &tmp_lkey, &tmp_rkey); + if (retcode != 0) { + /* @TODO spin_unlock_irqrestore(&fmr->mrlock, sl_flags); */ + goto map_phys_fmr_exit0; + } + /* successful reregistration */ + e_fmr->fmr_map_cnt++; + /* @TODO spin_unlock_irqrestore(&fmr->mrlock, sl_flags); */ + + e_fmr->ib.ib_fmr.lkey = tmp_lkey; + e_fmr->ib.ib_fmr.rkey = tmp_rkey; + + map_phys_fmr_exit0: + if (retcode == 0) + EDEB_EX(7, "lkey=%x rkey=%x", + e_fmr->ib.ib_fmr.lkey, e_fmr->ib.ib_fmr.rkey); + else + EDEB_EX(4, "retcode=%x fmr=%p page_list=%p list_len=%x " + "iova=%lx", + retcode, fmr, page_list, list_len, iova); + return (retcode); +} /* end ehca_map_phys_fmr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +int ehca_unmap_fmr(struct list_head *fmr_list) +{ + int retcode = 0; + struct ib_fmr *ib_fmr; + struct ehca_shca *shca = 0; + struct ehca_shca *prev_shca = 0; + struct ehca_mr *e_fmr = 0; + u32 num_fmr = 0; + u32 unmap_fmr_cnt = 0; + /* @TODO unsigned long sl_flags; */ + + EDEB_EN(7, "fmr_list=%p", fmr_list); + + /* check all FMR belong to same SHCA, and check internal flag */ + list_for_each_entry(ib_fmr, fmr_list, list) { + prev_shca = shca; + shca = container_of(ib_fmr->device, struct ehca_shca, + ib_device); + EHCA_CHECK_FMR(ib_fmr); + e_fmr = container_of(ib_fmr, struct ehca_mr, ib.ib_fmr); + if ((shca != prev_shca) && (prev_shca != 0)) { + EDEB_ERR(4, "SHCA mismatch, shca=%p prev_shca=%p " + "e_fmr=%p", shca, prev_shca, e_fmr); + retcode = -EINVAL; + goto unmap_fmr_exit0; + } + if (!(e_fmr->flags & EHCA_MR_FLAG_FMR)) { + EDEB_ERR(4, "not a FMR, e_fmr=%p e_fmr->flags=%x", + e_fmr, e_fmr->flags); + retcode = -EINVAL; + goto unmap_fmr_exit0; + } + num_fmr++; + } + + /* loop over all FMRs to unmap */ + list_for_each_entry(ib_fmr, fmr_list, list) { + unmap_fmr_cnt++; + e_fmr = container_of(ib_fmr, struct ehca_mr, ib.ib_fmr); + shca = container_of(ib_fmr->device, struct ehca_shca, + ib_device); + /*@TODO??? spin_lock_irqsave(&fmr->mrlock, sl_flags); */ + retcode = ehca_unmap_one_fmr(shca, e_fmr); + /*@TODO???? spin_unlock_irqrestore(&fmr->mrlock, sl_flags); */ + if (retcode != 0) { + /* unmap failed, stop unmapping of rest of FMRs */ + EDEB_ERR(4, "unmap of one FMR failed, stop rest, " + "e_fmr=%p num_fmr=%x unmap_fmr_cnt=%x lkey=%x", + e_fmr, num_fmr, unmap_fmr_cnt, + e_fmr->ib.ib_fmr.lkey); + goto unmap_fmr_exit0; + } + } + + unmap_fmr_exit0: + if (retcode == 0) + EDEB_EX(7, "num_fmr=%x", num_fmr); + else + EDEB_EX(4, "retcode=%x fmr_list=%p num_fmr=%x unmap_fmr_cnt=%x", + retcode, fmr_list, num_fmr, unmap_fmr_cnt); + return (retcode); +} /* end ehca_unmap_fmr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +int ehca_dealloc_fmr(struct ib_fmr *fmr) +{ + int retcode = 0; + u64 rc = H_Success; + struct ehca_shca *shca = 0; + struct ehca_mr *e_fmr = 0; + + EDEB_EN(7, "fmr=%p", fmr); + + EHCA_CHECK_FMR(fmr); + e_fmr = container_of(fmr, struct ehca_mr, ib.ib_fmr); + shca = container_of(fmr->device, struct ehca_shca, ib_device); + + if (!(e_fmr->flags & EHCA_MR_FLAG_FMR)) { + EDEB_ERR(4, "not a FMR, e_fmr=%p e_fmr->flags=%x", + e_fmr, e_fmr->flags); + retcode = -EINVAL; + goto free_fmr_exit0; + } + + rc = hipz_h_free_resource_mr(shca->ipz_hca_handle, &e_fmr->pf, + &e_fmr->ipz_mr_handle); + if (rc != H_Success) { + EDEB_ERR(4, "hipz_free_mr failed, rc=%lx e_fmr=%p " + "hca_hndl=%lx fmr_hndl=%lx fmr->lkey=%x", + rc, e_fmr, shca->ipz_hca_handle.handle, + e_fmr->ipz_mr_handle.handle, fmr->lkey); + ehca_mrmw_map_rc_free_mr(rc); + goto free_fmr_exit0; + } + /* successful deregistration */ + ehca_mr_delete(e_fmr); + + free_fmr_exit0: + if (retcode == 0) + EDEB_EX(7, ""); + else + EDEB_EX(4, "retcode=%x fmr=%p", retcode, fmr); + return (retcode); +} /* end ehca_dealloc_fmr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +int ehca_reg_mr(struct ehca_shca *shca, + struct ehca_mr *e_mr, + u64 *iova_start, + u64 size, + int acl, + struct ehca_pd *e_pd, + struct ehca_mr_pginfo *pginfo, + u32 *lkey, + u32 *rkey) +{ + int retcode = 0; + u64 rc = H_Success; + struct ehca_pfmr *pfmr = &e_mr->pf; + u32 hipz_acl = 0; + + EDEB_EN(7, "shca=%p e_mr=%p iova_start=%p size=%lx acl=%x e_pd=%p " + "pginfo=%p num_pages=%lx", shca, e_mr, iova_start, size, acl, + e_pd, pginfo, pginfo->num_pages); + + ehca_mrmw_map_acl(acl, &hipz_acl); + ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl); + if (ehca_use_hp_mr == 1) + hipz_acl |= 0x00000001; + + rc = hipz_h_alloc_resource_mr(shca->ipz_hca_handle, pfmr, &shca->pf, + (u64)iova_start, size, hipz_acl, + e_pd->fw_pd, &e_mr->ipz_mr_handle, + lkey, rkey); + if (rc != H_Success) { + EDEB_ERR(4, "hipz_alloc_mr failed, rc=%lx hca_hndl=%lx " + "mr_hndl=%lx", rc, shca->ipz_hca_handle.handle, + e_mr->ipz_mr_handle.handle); + retcode = ehca_mrmw_map_rc_alloc(rc); + goto ehca_reg_mr_exit0; + } + + retcode = ehca_reg_mr_rpages(shca, e_mr, pginfo); + if (retcode != 0) + goto ehca_reg_mr_exit1; + + /* successful registration */ + e_mr->num_pages = pginfo->num_pages; + e_mr->start = iova_start; + e_mr->size = size; + e_mr->acl = acl; + goto ehca_reg_mr_exit0; + + ehca_reg_mr_exit1: + rc = hipz_h_free_resource_mr(shca->ipz_hca_handle, pfmr, + &e_mr->ipz_mr_handle); + if (rc != H_Success) { + EDEB(1, "rc=%lx shca=%p e_mr=%p iova_start=%p " + "size=%lx acl=%x e_pd=%p lkey=%x pginfo=%p num_pages=%lx", + rc, shca, e_mr, iova_start, size, acl, + e_pd, *lkey, pginfo, pginfo->num_pages); + ehca_catastrophic("internal error in ehca_reg_mr, " + "not recoverable"); + } + ehca_reg_mr_exit0: + if (retcode == 0) + EDEB_EX(7, "retcode=%x lkey=%x rkey=%x", retcode, *lkey, *rkey); + else + EDEB_EX(4, "retcode=%x shca=%p e_mr=%p iova_start=%p " + "size=%lx acl=%x e_pd=%p pginfo=%p num_pages=%lx", + retcode, shca, e_mr, iova_start, + size, acl, e_pd, pginfo, pginfo->num_pages); + return (retcode); +} /* end ehca_reg_mr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +int ehca_reg_mr_rpages(struct ehca_shca *shca, + struct ehca_mr *e_mr, + struct ehca_mr_pginfo *pginfo) +{ + int retcode = 0; + u64 rc = H_Success; + struct ehca_pfmr *pfmr = &e_mr->pf; + u32 rnum = 0; + u64 rpage = 0; + u32 i; + u64 *kpage = 0; + + EDEB_EN(7, "shca=%p e_mr=%p pginfo=%p num_pages=%lx", + shca, e_mr, pginfo, pginfo->num_pages); + + kpage = kmalloc(PAGE_SIZE, GFP_KERNEL); + if (kpage == 0) { + EDEB_ERR(4, "kpage alloc failed"); + retcode = -ENOMEM; + goto ehca_reg_mr_rpages_exit0; + } + memset(kpage, 0, PAGE_SIZE); + + /* max 512 pages per shot */ + for (i = 0; i < ((pginfo->num_pages + 512 - 1) / 512); i++) { + + if (i == ((pginfo->num_pages + 512 - 1) / 512) - 1) { + rnum = pginfo->num_pages % 512; /* last shot */ + if (rnum == 0) + rnum = 512; /* last shot is full */ + } else + rnum = 512; + + if (rnum > 1) { + retcode = ehca_set_pagebuf(e_mr, pginfo, rnum, kpage); + if (retcode) { + EDEB_ERR(4, "ehca_set_pagebuf bad rc, " + "retcode=%x rnum=%x kpage=%p", + retcode, rnum, kpage); + retcode = -EFAULT; + goto ehca_reg_mr_rpages_exit1; + } + rpage = ehca_kv_to_g(kpage); + if (rpage == 0) { + EDEB_ERR(4, "kpage=%p i=%x", kpage, i); + retcode = -EFAULT; + goto ehca_reg_mr_rpages_exit1; + } + } else { /* rnum==1 */ + retcode = ehca_set_pagebuf_1(e_mr, pginfo, &rpage); + if (retcode) { + EDEB_ERR(4, "ehca_set_pagebuf_1 bad rc, " + "retcode=%x i=%x", retcode, i); + retcode = -EFAULT; + goto ehca_reg_mr_rpages_exit1; + } + } + + EDEB(9, "i=%x rnum=%x rpage=%lx", i, rnum, rpage); + + rc = hipz_h_register_rpage_mr(shca->ipz_hca_handle, + &e_mr->ipz_mr_handle, pfmr, + &shca->pf, + 0, /* pagesize hardcoded to 4k */ + 0, rpage, rnum); + + if (i == ((pginfo->num_pages + 512 - 1) / 512) - 1) { + /* check for 'registration complete'==H_Success */ + /* and for 'page registered'==H_PAGE_REGISTERED */ + if (rc != H_Success) { + EDEB_ERR(4, "last hipz_reg_rpage_mr failed, " + "rc=%lx e_mr=%p i=%x hca_hndl=%lx " + "mr_hndl=%lx lkey=%x", rc, e_mr, i, + shca->ipz_hca_handle.handle, + e_mr->ipz_mr_handle.handle, + e_mr->ib.ib_mr.lkey); + retcode = ehca_mrmw_map_rc_rrpg_last(rc); + break; + } else + retcode = 0; + } else if (rc != H_PAGE_REGISTERED) { + EDEB_ERR(4, "hipz_reg_rpage_mr failed, rc=%lx e_mr=%p " + "i=%x lkey=%x hca_hndl=%lx mr_hndl=%lx", + rc, e_mr, i, e_mr->ib.ib_mr.lkey, + shca->ipz_hca_handle.handle, + e_mr->ipz_mr_handle.handle); + retcode = ehca_mrmw_map_rc_rrpg_notlast(rc); + break; + } else + retcode = 0; + } /* end for(i) */ + + + ehca_reg_mr_rpages_exit1: + kfree(kpage); + ehca_reg_mr_rpages_exit0: + if (retcode == 0) + EDEB_EX(7, "retcode=%x", retcode); + else + EDEB_EX(4, "retcode=%x shca=%p e_mr=%p pginfo=%p " + "num_pages=%lx", + retcode, shca, e_mr, pginfo, pginfo->num_pages); + return (retcode); +} /* end ehca_reg_mr_rpages() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +inline int ehca_rereg_mr_rereg1(struct ehca_shca *shca, + struct ehca_mr *e_mr, + u64 *iova_start, + u64 size, + u32 acl, + struct ehca_pd *e_pd, + struct ehca_mr_pginfo *pginfo, + u32 *lkey, + u32 *rkey) +{ + int retcode = 0; + u64 rc = H_Success; + struct ehca_pfmr *pfmr = &e_mr->pf; + u64 iova_start_out = 0; + u32 hipz_acl = 0; + u64 *kpage = 0; + u64 rpage = 0; + struct ehca_mr_pginfo pginfo_save; + + EDEB_EN(7, "shca=%p e_mr=%p iova_start=%p size=%lx acl=%x " + "e_pd=%p pginfo=%p num_pages=%lx", shca, e_mr, + iova_start, size, acl, e_pd, pginfo, pginfo->num_pages); + + ehca_mrmw_map_acl(acl, &hipz_acl); + ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl); + + kpage = kmalloc(PAGE_SIZE, GFP_KERNEL); + if (kpage == 0) { + EDEB_ERR(4, "kpage alloc failed"); + retcode = -ENOMEM; + goto ehca_rereg_mr_rereg1_exit0; + } + memset(kpage, 0, PAGE_SIZE); + + pginfo_save = *pginfo; + retcode = ehca_set_pagebuf(e_mr, pginfo, pginfo->num_pages, kpage); + if (retcode != 0) { + EDEB_ERR(4, "set pagebuf failed, e_mr=%p pginfo=%p type=%x " + "num_pages=%lx kpage=%p", + e_mr, pginfo, pginfo->type, pginfo->num_pages, kpage); + goto ehca_rereg_mr_rereg1_exit1; + } + rpage = ehca_kv_to_g(kpage); + if (rpage == 0) { + EDEB_ERR(4, "kpage=%p", kpage); + retcode = -EFAULT; + goto ehca_rereg_mr_rereg1_exit1; + } + rc = hipz_h_reregister_pmr(shca->ipz_hca_handle, pfmr, &shca->pf, + &e_mr->ipz_mr_handle, (u64)iova_start, + size, hipz_acl, e_pd->fw_pd, rpage, + &iova_start_out, lkey, rkey); + if (rc != H_Success) { + /* reregistration unsuccessful, */ + /* try it again with the 3 hCalls, */ + /* e.g. this is required in case H_MR_CONDITION */ + /* (MW bound or MR is shared) */ + EDEB(6, "hipz_h_reregister_pmr failed (Rereg1), rc=%lx " + "e_mr=%p", rc, e_mr); + *pginfo = pginfo_save; + retcode = -EAGAIN; + } else if ((u64 *)iova_start_out != iova_start) { + EDEB_ERR(4, "PHYP changed iova_start in rereg_pmr, " + "iova_start=%p iova_start_out=%lx e_mr=%p " + "mr_handle=%lx lkey=%x", iova_start, iova_start_out, + e_mr, e_mr->ipz_mr_handle.handle, e_mr->ib.ib_mr.lkey); + retcode = -EFAULT; + } else { + /* successful reregistration */ + /* note: start and start_out are identical for eServer HCAs */ + e_mr->num_pages = pginfo->num_pages; + e_mr->start = iova_start; + e_mr->size = size; + e_mr->acl = acl; + } + + ehca_rereg_mr_rereg1_exit1: + kfree(kpage); + ehca_rereg_mr_rereg1_exit0: + if ((retcode == 0) || (retcode == -EAGAIN)) + EDEB_EX(7, "retcode=%x rc=%lx lkey=%x rkey=%x pginfo=%p " + "num_pages=%lx", + retcode, rc, *lkey, *rkey, pginfo, pginfo->num_pages); + else + EDEB_EX(4, "retcode=%x rc=%lx lkey=%x rkey=%x pginfo=%p " + "num_pages=%lx", + retcode, rc, *lkey, *rkey, pginfo, pginfo->num_pages); + return (retcode); +} /* end ehca_rereg_mr_rereg1() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +int ehca_rereg_mr(struct ehca_shca *shca, + struct ehca_mr *e_mr, + u64 *iova_start, + u64 size, + int acl, + struct ehca_pd *e_pd, + struct ehca_mr_pginfo *pginfo, + u32 *lkey, + u32 *rkey) +{ + int retcode = 0; + u64 rc = H_Success; + struct ehca_pfmr *pfmr = &e_mr->pf; + int Rereg1Hcall = TRUE; /* TRUE: use hipz_h_reregister_pmr directly */ + int Rereg3Hcall = FALSE; /* TRUE: use 3 hipz calls for reregistration */ + struct ehca_bridge_handle save_bridge; + + EDEB_EN(7, "shca=%p e_mr=%p iova_start=%p size=%lx acl=%x " + "e_pd=%p pginfo=%p num_pages=%lx", shca, e_mr, + iova_start, size, acl, e_pd, pginfo, pginfo->num_pages); + + /* first determine reregistration hCall(s) */ + if ((pginfo->num_pages > 512) || (e_mr->num_pages > 512) || + (pginfo->num_pages > e_mr->num_pages)) { + EDEB(7, "Rereg3 case, pginfo->num_pages=%lx " + "e_mr->num_pages=%x", pginfo->num_pages, e_mr->num_pages); + Rereg1Hcall = FALSE; + Rereg3Hcall = TRUE; + } + + if (e_mr->flags & EHCA_MR_FLAG_MAXMR) { /* check for max-MR */ + Rereg1Hcall = FALSE; + Rereg3Hcall = TRUE; + e_mr->flags &= ~EHCA_MR_FLAG_MAXMR; + EDEB(4, "Rereg MR for max-MR! e_mr=%p", e_mr); + } + + if (Rereg1Hcall) { + retcode = ehca_rereg_mr_rereg1(shca, e_mr, iova_start, size, + acl, e_pd, pginfo, lkey, rkey); + if (retcode != 0) { + if (retcode == -EAGAIN) + Rereg3Hcall = TRUE; + else + goto ehca_rereg_mr_exit0; + } + } + + if (Rereg3Hcall) { + struct ehca_mr save_mr; + + /* first deregister old MR */ + rc = hipz_h_free_resource_mr(shca->ipz_hca_handle, pfmr, + &e_mr->ipz_mr_handle); + if (rc != H_Success) { + EDEB_ERR(4, "hipz_free_mr failed, rc=%lx e_mr=%p " + "hca_hndl=%lx mr_hndl=%lx mr->lkey=%x", + rc, e_mr, shca->ipz_hca_handle.handle, + e_mr->ipz_mr_handle.handle, + e_mr->ib.ib_mr.lkey); + retcode = ehca_mrmw_map_rc_free_mr(rc); + goto ehca_rereg_mr_exit0; + } + /* clean ehca_mr_t, without changing struct ib_mr and lock */ + save_bridge = pfmr->bridge; + save_mr = *e_mr; + ehca_mr_deletenew(e_mr); + + /* set some MR values */ + e_mr->flags = save_mr.flags; + pfmr->bridge = save_bridge; + e_mr->fmr_page_size = save_mr.fmr_page_size; + e_mr->fmr_max_pages = save_mr.fmr_max_pages; + e_mr->fmr_max_maps = save_mr.fmr_max_maps; + e_mr->fmr_map_cnt = save_mr.fmr_map_cnt; + + retcode = ehca_reg_mr(shca, e_mr, iova_start, size, acl, + e_pd, pginfo, lkey, rkey); + if (retcode != 0) { + u32 offset = (u64)(&e_mr->flags) - (u64)e_mr; + memcpy(&e_mr->flags, &(save_mr.flags), + sizeof(struct ehca_mr) - offset); + goto ehca_rereg_mr_exit0; + } + } + + ehca_rereg_mr_exit0: + if (retcode == 0) + EDEB_EX(7, "retcode=%x shca=%p e_mr=%p iova_start=%p size=%lx " + "acl=%x e_pd=%p pginfo=%p num_pages=%lx lkey=%x " + "rkey=%x Rereg1Hcall=%x Rereg3Hcall=%x", + retcode, shca, e_mr, iova_start, size, acl, e_pd, + pginfo, pginfo->num_pages, *lkey, *rkey, Rereg1Hcall, + Rereg3Hcall); + else + EDEB_EX(4, "retcode=%x shca=%p e_mr=%p iova_start=%p size=%lx " + "acl=%x e_pd=%p pginfo=%p num_pages=%lx lkey=%x " + "rkey=%x Rereg1Hcall=%x Rereg3Hcall=%x", + retcode, shca, e_mr, iova_start, size, acl, e_pd, + pginfo, pginfo->num_pages, *lkey, *rkey, Rereg1Hcall, + Rereg3Hcall); + + return (retcode); +} /* end ehca_rereg_mr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +int ehca_unmap_one_fmr(struct ehca_shca *shca, + struct ehca_mr *e_fmr) +{ + int retcode = 0; + u64 rc = H_Success; + struct ehca_pfmr *pfmr = &e_fmr->pf; + int Rereg1Hcall = TRUE; /* TRUE: use hipz_mr_reregister directly */ + int Rereg3Hcall = FALSE; /* TRUE: use 3 hipz calls for unmapping */ + struct ehca_bridge_handle save_bridge; + struct ehca_pd *e_pd = 0; + struct ehca_mr save_fmr; + u32 tmp_lkey = 0; + u32 tmp_rkey = 0; + struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,0,0,0,0,0}; + + EDEB_EN(7, "shca=%p e_fmr=%p", shca, e_fmr); + + /* first check if reregistration hCall can be used for unmap */ + if (e_fmr->fmr_max_pages > 512) { + Rereg1Hcall = FALSE; + Rereg3Hcall = TRUE; + } + + e_pd = container_of(e_fmr->ib.ib_fmr.pd, struct ehca_pd, ib_pd); + + if (Rereg1Hcall) { + /* note: after using rereg hcall with len=0, */ + /* rereg hcall must be used again for registering pages */ + u64 start_out = 0; + rc = hipz_h_reregister_pmr(shca->ipz_hca_handle, pfmr, + &shca->pf, &e_fmr->ipz_mr_handle, 0, + 0, 0, e_pd->fw_pd, 0, &start_out, + &tmp_lkey, &tmp_rkey); + if (rc != H_Success) { + /* should not happen, because length checked above, */ + /* FMRs are not shared and no MW bound to FMRs */ + EDEB_ERR(4, "hipz_reregister_pmr failed (Rereg1), " + "rc=%lx e_fmr=%p hca_hndl=%lx mr_hndl=%lx " + "lkey=%x", rc, e_fmr, + shca->ipz_hca_handle.handle, + e_fmr->ipz_mr_handle.handle, + e_fmr->ib.ib_fmr.lkey); + Rereg3Hcall = TRUE; + } else { + /* successful reregistration */ + e_fmr->start = 0; + e_fmr->size = 0; + } + } + + if (Rereg3Hcall) { + struct ehca_mr save_mr; + + /* first free old FMR */ + rc = hipz_h_free_resource_mr(shca->ipz_hca_handle, pfmr, + &e_fmr->ipz_mr_handle); + if (rc != H_Success) { + EDEB_ERR(4, "hipz_free_mr failed, rc=%lx e_fmr=%p " + "hca_hndl=%lx mr_hndl=%lx lkey=%x", rc, e_fmr, + shca->ipz_hca_handle.handle, + e_fmr->ipz_mr_handle.handle, + e_fmr->ib.ib_fmr.lkey); + retcode = ehca_mrmw_map_rc_free_mr(rc); + goto ehca_unmap_one_fmr_exit0; + } + /* clean ehca_mr_t, without changing lock */ + save_bridge = pfmr->bridge; + save_fmr = *e_fmr; + ehca_mr_deletenew(e_fmr); + + /* set some MR values */ + e_fmr->flags = save_fmr.flags; + pfmr->bridge = save_bridge; + e_fmr->fmr_page_size = save_fmr.fmr_page_size; + e_fmr->fmr_max_pages = save_fmr.fmr_max_pages; + e_fmr->fmr_max_maps = save_fmr.fmr_max_maps; + e_fmr->fmr_map_cnt = save_fmr.fmr_map_cnt; + e_fmr->acl = save_fmr.acl; + + pginfo.type = EHCA_MR_PGI_FMR; + pginfo.num_pages = 0; + retcode = ehca_reg_mr(shca, e_fmr, 0, + (e_fmr->fmr_max_pages * + e_fmr->fmr_page_size), + e_fmr->acl, e_pd, &pginfo, &tmp_lkey, + &tmp_rkey); + if (retcode != 0) { + u32 offset = (u64)(&e_fmr->flags) - (u64)e_fmr; + memcpy(&e_fmr->flags, &(save_mr.flags), + sizeof(struct ehca_mr) - offset); + goto ehca_unmap_one_fmr_exit0; + } + } + + ehca_unmap_one_fmr_exit0: + EDEB_EX(7, "retcode=%x tmp_lkey=%x tmp_rkey=%x fmr_max_pages=%x " + "Rereg1Hcall=%x Rereg3Hcall=%x", retcode, tmp_lkey, tmp_rkey, + e_fmr->fmr_max_pages, Rereg1Hcall, Rereg3Hcall); + return (retcode); +} /* end ehca_unmap_one_fmr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +int ehca_reg_smr(struct ehca_shca *shca, + struct ehca_mr *e_origmr, + struct ehca_mr *e_newmr, + u64 *iova_start, + int acl, + struct ehca_pd *e_pd, + u32 *lkey, + u32 *rkey) +{ + int retcode = 0; + u64 rc = H_Success; + struct ehca_pfmr *pfmr = &e_newmr->pf; + u32 hipz_acl = 0; + + EDEB_EN(7,"shca=%p e_origmr=%p e_newmr=%p iova_start=%p acl=%x e_pd=%p", + shca, e_origmr, e_newmr, iova_start, acl, e_pd); + + ehca_mrmw_map_acl(acl, &hipz_acl); + ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl); + + rc = hipz_h_register_smr(shca->ipz_hca_handle, pfmr, &e_origmr->pf, + &shca->pf, &e_origmr->ipz_mr_handle, + (u64)iova_start, hipz_acl, e_pd->fw_pd, + &e_newmr->ipz_mr_handle, lkey, rkey); + if (rc != H_Success) { + EDEB_ERR(4, "hipz_reg_smr failed, rc=%lx shca=%p e_origmr=%p " + "e_newmr=%p iova_start=%p acl=%x e_pd=%p hca_hndl=%lx " + "mr_hndl=%lx lkey=%x", rc, shca, e_origmr, e_newmr, + iova_start, acl, e_pd, shca->ipz_hca_handle.handle, + e_origmr->ipz_mr_handle.handle, + e_origmr->ib.ib_mr.lkey); + retcode = ehca_mrmw_map_rc_reg_smr(rc); + goto ehca_reg_smr_exit0; + } + /* successful registration */ + e_newmr->num_pages = e_origmr->num_pages; + e_newmr->start = iova_start; + e_newmr->size = e_origmr->size; + e_newmr->acl = acl; + goto ehca_reg_smr_exit0; + + ehca_reg_smr_exit0: + if (retcode == 0) + EDEB_EX(7, "retcode=%x lkey=%x rkey=%x", + retcode, *lkey, *rkey); + else + EDEB_EX(4, "retcode=%x shca=%p e_origmr=%p e_newmr=%p " + "iova_start=%p acl=%x e_pd=%p", retcode, + shca, e_origmr, e_newmr, iova_start, acl, e_pd); + return (retcode); +} /* end ehca_reg_smr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +int ehca_reg_internal_maxmr( + struct ehca_shca *shca, + struct ehca_pd *e_pd, + struct ehca_mr **e_maxmr) +{ + int retcode = 0; + struct ehca_mr *e_mr = 0; + u64 *iova_start = 0; + u64 size_maxmr = 0; + struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,0,0,0,0,0}; + struct ib_phys_buf ib_pbuf; + u32 num_pages_mr = 0; + + EDEB_EN(7, "shca=%p e_pd=%p e_maxmr=%p", shca, e_pd, e_maxmr); + + if (ehca_adr_bad(shca) || ehca_adr_bad(e_pd) || ehca_adr_bad(e_maxmr)) { + EDEB_ERR(4, "bad input values: shca=%p e_pd=%p e_maxmr=%p", + shca, e_pd, e_maxmr); + retcode = -EINVAL; + goto ehca_reg_internal_maxmr_exit0; + } + + e_mr = ehca_mr_new(); + if (!e_mr) { + EDEB_ERR(4, "out of memory"); + retcode = -ENOMEM; + goto ehca_reg_internal_maxmr_exit0; + } + e_mr->flags |= EHCA_MR_FLAG_MAXMR; + + /* register internal max-MR on HCA */ + size_maxmr = (u64)high_memory - PAGE_OFFSET; + EDEB(9, "high_memory=%p PAGE_OFFSET=%lx", high_memory, PAGE_OFFSET); + iova_start = (u64 *)KERNELBASE; + ib_pbuf.addr = 0; + ib_pbuf.size = size_maxmr; + num_pages_mr = + ((((u64)iova_start % PAGE_SIZE) + size_maxmr + + PAGE_SIZE - 1) / PAGE_SIZE); + + pginfo.type = EHCA_MR_PGI_PHYS; + pginfo.num_pages = num_pages_mr; + pginfo.num_phys_buf = 1; + pginfo.phys_buf_array = &ib_pbuf; + + retcode = ehca_reg_mr(shca, e_mr, iova_start, size_maxmr, 0, e_pd, + &pginfo, &e_mr->ib.ib_mr.lkey, + &e_mr->ib.ib_mr.rkey); + if (retcode != 0) { + EDEB_ERR(4, "reg of internal max MR failed, e_mr=%p " + "iova_start=%p size_maxmr=%lx num_pages_mr=%x", + e_mr, iova_start, size_maxmr, num_pages_mr); + goto ehca_reg_internal_maxmr_exit1; + } + + /* successful registration of all pages */ + e_mr->ib.ib_mr.device = e_pd->ib_pd.device; + e_mr->ib.ib_mr.pd = &e_pd->ib_pd; + e_mr->ib.ib_mr.uobject = NULL; + atomic_inc(&(e_pd->ib_pd.usecnt)); + atomic_set(&(e_mr->ib.ib_mr.usecnt), 0); + *e_maxmr = e_mr; + goto ehca_reg_internal_maxmr_exit0; + + ehca_reg_internal_maxmr_exit1: + ehca_mr_delete(e_mr); + ehca_reg_internal_maxmr_exit0: + if (retcode == 0) + EDEB_EX(7, "*e_maxmr=%p lkey=%x rkey=%x", + *e_maxmr, (*e_maxmr)->ib.ib_mr.lkey, + (*e_maxmr)->ib.ib_mr.rkey); + else + EDEB_EX(4, "retcode=%x shca=%p e_pd=%p e_maxmr=%p", + retcode, shca, e_pd, e_maxmr); + return (retcode); +} /* end ehca_reg_internal_maxmr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +int ehca_reg_maxmr(struct ehca_shca *shca, + struct ehca_mr *e_newmr, + u64 *iova_start, + int acl, + struct ehca_pd *e_pd, + u32 *lkey, + u32 *rkey) +{ + int retcode = 0; + u64 rc = H_Success; + struct ehca_pfmr *pfmr = &e_newmr->pf; + struct ehca_mr *e_origmr = shca->maxmr; + u32 hipz_acl = 0; + + EDEB_EN(7,"shca=%p e_origmr=%p e_newmr=%p iova_start=%p acl=%x e_pd=%p", + shca, e_origmr, e_newmr, iova_start, acl, e_pd); + + ehca_mrmw_map_acl(acl, &hipz_acl); + ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl); + + rc = hipz_h_register_smr(shca->ipz_hca_handle, pfmr, &e_origmr->pf, + &shca->pf, &e_origmr->ipz_mr_handle, + (u64)iova_start, hipz_acl, e_pd->fw_pd, + &e_newmr->ipz_mr_handle, lkey, rkey); + if (rc != H_Success) { + EDEB_ERR(4, "hipz_reg_smr failed, rc=%lx e_origmr=%p " + "hca_hndl=%lx mr_hndl=%lx lkey=%x", + rc, e_origmr, shca->ipz_hca_handle.handle, + e_origmr->ipz_mr_handle.handle, + e_origmr->ib.ib_mr.lkey); + retcode = ehca_mrmw_map_rc_reg_smr(rc); + goto ehca_reg_maxmr_exit0; + } + /* successful registration */ + e_newmr->num_pages = e_origmr->num_pages; + e_newmr->start = iova_start; + e_newmr->size = e_origmr->size; + e_newmr->acl = acl; + + ehca_reg_maxmr_exit0: + EDEB_EX(7, "retcode=%x lkey=%x rkey=%x", retcode, *lkey, *rkey); + return (retcode); +} /* end ehca_reg_maxmr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +int ehca_dereg_internal_maxmr(struct ehca_shca *shca) +{ + int retcode = 0; + struct ehca_mr *e_maxmr = 0; + struct ib_pd *ib_pd = 0; + + EDEB_EN(7, "shca=%p shca->maxmr=%p", shca, shca->maxmr); + + if (shca->maxmr == 0) { + EDEB_ERR(4, "bad call, shca=%p", shca); + retcode = -EINVAL; + goto ehca_dereg_internal_maxmr_exit0; + } + + e_maxmr = shca->maxmr; + ib_pd = e_maxmr->ib.ib_mr.pd; + shca->maxmr = 0; /* remove internal max-MR indication from SHCA */ + + retcode = ehca_dereg_mr(&e_maxmr->ib.ib_mr); + if (retcode != 0) { + EDEB_ERR(3, "dereg internal max-MR failed, " + "retcode=%x e_maxmr=%p shca=%p lkey=%x", + retcode, e_maxmr, shca, e_maxmr->ib.ib_mr.lkey); + shca->maxmr = e_maxmr; + goto ehca_dereg_internal_maxmr_exit0; + } + + atomic_dec(&ib_pd->usecnt); + + ehca_dereg_internal_maxmr_exit0: + if (retcode == 0) + EDEB_EX(7, ""); + else + EDEB_EX(4, "retcode=%x shca=%p shca->maxmr=%p", + retcode, shca, shca->maxmr); + return (retcode); +} /* end ehca_dereg_internal_maxmr() */ diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.h b/drivers/infiniband/hw/ehca/ehca_mrmw.h new file mode 100644 index 0000000..4df4b5b --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.h @@ -0,0 +1,739 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * MR/MW declarations and inline functions + * + * Authors: Dietmar Decker + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_mrmw.h,v 1.59 2006/02/06 10:17:34 schickhj Exp $ + */ + +#ifndef _EHCA_MRMW_H_ +#define _EHCA_MRMW_H_ + +#undef DEB_PREFIX +#define DEB_PREFIX "mrmw" + +#include "hipz_structs.h" + + +int ehca_reg_mr(struct ehca_shca *shca, + struct ehca_mr *e_mr, + u64 *iova_start, + u64 size, + int acl, + struct ehca_pd *e_pd, + struct ehca_mr_pginfo *pginfo, + u32 *lkey, /**addr & ~PAGE_MASK)) { + EDEB_ERR(4, "iova_start/addr mismatch, iova_start=%p " + "pbuf->addr=%lx pbuf->size=%lx", + iova_start, pbuf->addr, pbuf->size); + return (-EINVAL); + } + if (((pbuf->addr + pbuf->size) % PAGE_SIZE) && + (num_phys_buf > 1)) { + EDEB_ERR(4, "addr/size mismatch in 1st buf, pbuf->addr=%lx " + "pbuf->size=%lx", pbuf->addr, pbuf->size); + return (-EINVAL); + } + + for (i = 0; i < num_phys_buf; i++) { + if ((i > 0) && (pbuf->addr % PAGE_SIZE)) { + EDEB_ERR(4, "bad address, i=%x pbuf->addr=%lx " + "pbuf->size=%lx", i, pbuf->addr, pbuf->size); + return (-EINVAL); + } + if (((i > 0) && /* not 1st */ + (i < (num_phys_buf - 1)) && /* not last */ + (pbuf->size % PAGE_SIZE)) || (pbuf->size == 0)) { + EDEB_ERR(4, "bad size, i=%x pbuf->size=%lx", + i, pbuf->size); + return (-EINVAL); + } + size_count += pbuf->size; + pbuf++; + } + + *size = size_count; + return (0); +} /* end ehca_mr_chk_buf_and_calc_size() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +/** @brief check page list of map FMR verb for validness +*/ +static inline int ehca_fmr_check_page_list( + struct ehca_mr *e_fmr, /** e_fmr->fmr_max_pages)) { + EDEB_ERR(4, "bad list_len, list_len=%x e_fmr->fmr_max_pages=%x " + "fmr=%p", list_len, e_fmr->fmr_max_pages, e_fmr); + return (-EINVAL); + } + + /* each page must be aligned */ + page = page_list; + for (i = 0; i < list_len; i++) { + if (*page % PAGE_SIZE) { + EDEB_ERR(4, "bad page, i=%x *page=%lx page=%p " + "fmr=%p", i, *page, page, e_fmr); + return (-EINVAL); + } + page++; + } + + return (0); +} /* end ehca_fmr_check_page_list() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +/** @brief setup page buffer from page info + */ +static inline int ehca_set_pagebuf(struct ehca_mr *e_mr, + struct ehca_mr_pginfo *pginfo, + u32 number, + u64 *kpage) /**type, pginfo->num_pages, pginfo->next_buf, + pginfo->next_page, number, kpage, pginfo->page_count, + pginfo->next_listelem, pginfo->region, pginfo->next_chunk, + pginfo->next_nmap); + + if (pginfo->type == EHCA_MR_PGI_PHYS) { + /* loop over desired phys_buf_array entries */ + while (i < number) { + pbuf = pginfo->phys_buf_array + pginfo->next_buf; + numpg = ((pbuf->size + PAGE_SIZE - 1) / PAGE_SIZE); + while (pginfo->next_page < numpg) { + /* sanity check */ + if (pginfo->page_count >= pginfo->num_pages) { + EDEB_ERR(4, "page_count >= num_pages, " + "page_count=%lx num_pages=%lx " + "i=%x", pginfo->page_count, + pginfo->num_pages, i); + retcode = -EFAULT; + goto ehca_set_pagebuf_exit0; + } + *kpage = phys_to_abs((pbuf->addr & PAGE_MASK) + + (pginfo->next_page * + PAGE_SIZE)); + if ((*kpage == 0) && (pbuf->addr != 0)) { + EDEB_ERR(4, "pbuf->addr=%lx" + " pbuf->size=%lx" + " next_page=%lx", + pbuf->addr, pbuf->size, + pginfo->next_page); + retcode = -EFAULT; + goto ehca_set_pagebuf_exit0; + } + (pginfo->next_page)++; + (pginfo->page_count)++; + kpage++; + i++; + if (i >= number) break; + } + if (pginfo->next_page >= numpg) { + (pginfo->next_buf)++; + pginfo->next_page = 0; + } + } + } else if (pginfo->type == EHCA_MR_PGI_USER) { + /* loop over desired chunk entries */ + /* (@TODO: add support for large pages) */ + chunk = pginfo->next_chunk; + prev_chunk = pginfo->next_chunk; + list_for_each_entry_continue(chunk, + (&(pginfo->region->chunk_list)), + list) { + EDEB(9, "chunk->page_list[0]=%lx", + (u64)sg_dma_address(&chunk->page_list[0])); + for (i = pginfo->next_nmap; i < chunk->nmap; i++) { + pgaddr = ( page_to_pfn(chunk->page_list[i].page) + << PAGE_SHIFT ); + *kpage = phys_to_abs(pgaddr); + EDEB(9,"pgaddr=%lx *kpage=%lx", pgaddr, *kpage); + if (*kpage == 0) { + EDEB_ERR(4, "chunk->page_list[i]=%lx" + " i=%x mr=%p", + (u64)sg_dma_address( + &chunk->page_list[i]), + i, e_mr); + retcode = -EFAULT; + goto ehca_set_pagebuf_exit0; + } + (pginfo->page_count)++; + (pginfo->next_nmap)++; + kpage++; + j++; + if (j >= number) break; + } + if ( (pginfo->next_nmap >= chunk->nmap) && + (j >= number) ) { + pginfo->next_nmap = 0; + prev_chunk = chunk; + break; + } else if (pginfo->next_nmap >= chunk->nmap) { + pginfo->next_nmap = 0; + prev_chunk = chunk; + } else if (j >= number) + break; + else + prev_chunk = chunk; + } + pginfo->next_chunk = + list_prepare_entry(prev_chunk, + (&(pginfo->region->chunk_list)), + list); + } else if (pginfo->type == EHCA_MR_PGI_FMR) { + /* loop over desired page_list entries */ + fmrlist = pginfo->page_list + pginfo->next_listelem; + for (i = 0; i < number; i++) { + *kpage = phys_to_abs(*fmrlist); + if (*kpage == 0) { + EDEB_ERR(4, "*fmrlist=%lx fmrlist=%p" + " next_listelem=%lx", *fmrlist, + fmrlist, pginfo->next_listelem); + retcode = -EFAULT; + goto ehca_set_pagebuf_exit0; + } + (pginfo->next_listelem)++; + (pginfo->page_count)++; + fmrlist++; + kpage++; + } + } else { + EDEB_ERR(4, "bad pginfo->type=%x", pginfo->type); + retcode = -EFAULT; + goto ehca_set_pagebuf_exit0; + } + + ehca_set_pagebuf_exit0: + if (retcode == 0) + EDEB_EX(7, "retcode=%x e_mr=%p pginfo=%p type=%x num_pages=%lx " + "next_buf=%lx next_page=%lx number=%x kpage=%p " + "page_count=%lx i=%x next_listelem=%lx region=%p " + "next_chunk=%p next_nmap=%lx", + retcode, e_mr, pginfo, pginfo->type, pginfo->num_pages, + pginfo->next_buf, pginfo->next_page, number, kpage, + pginfo->page_count, i, pginfo->next_listelem, + pginfo->region, pginfo->next_chunk, pginfo->next_nmap); + else + EDEB_EX(4, "retcode=%x e_mr=%p pginfo=%p type=%x num_pages=%lx " + "next_buf=%lx next_page=%lx number=%x kpage=%p " + "page_count=%lx i=%x next_listelem=%lx region=%p " + "next_chunk=%p next_nmap=%lx", + retcode, e_mr, pginfo, pginfo->type, pginfo->num_pages, + pginfo->next_buf, pginfo->next_page, number, kpage, + pginfo->page_count, i, pginfo->next_listelem, + pginfo->region, pginfo->next_chunk, pginfo->next_nmap); + return (retcode); +} /* end ehca_set_pagebuf() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +/** @brief setup 1 page from page info page buffer + */ +static inline int ehca_set_pagebuf_1(struct ehca_mr *e_mr, + struct ehca_mr_pginfo *pginfo, + u64 *rpage) /**type, pginfo->num_pages, pginfo->next_buf, + pginfo->next_page, rpage, pginfo->page_count, + pginfo->next_listelem, pginfo->region, pginfo->next_chunk, + pginfo->next_nmap); + + if (pginfo->type == EHCA_MR_PGI_PHYS) { + /* sanity check */ + if (pginfo->page_count >= pginfo->num_pages) { + EDEB_ERR(4, "page_count >= num_pages, " + "page_count=%lx num_pages=%lx", + pginfo->page_count, pginfo->num_pages); + retcode = -EFAULT; + goto ehca_set_pagebuf_1_exit0; + } + tmp_pbuf = pginfo->phys_buf_array + pginfo->next_buf; + *rpage = phys_to_abs(((tmp_pbuf->addr & PAGE_MASK) + + (pginfo->next_page * PAGE_SIZE))); + if ((*rpage == 0) && (tmp_pbuf->addr != 0)) { + EDEB_ERR(4, "tmp_pbuf->addr=%lx" + " tmp_pbuf->size=%lx next_page=%lx", + tmp_pbuf->addr, tmp_pbuf->size, + pginfo->next_page); + retcode = -EFAULT; + goto ehca_set_pagebuf_1_exit0; + } + (pginfo->next_page)++; + (pginfo->page_count)++; + if (pginfo->next_page >= tmp_pbuf->size / PAGE_SIZE) { + (pginfo->next_buf)++; + pginfo->next_page = 0; + } + } else if (pginfo->type == EHCA_MR_PGI_USER) { + chunk = pginfo->next_chunk; + prev_chunk = pginfo->next_chunk; + list_for_each_entry_continue(chunk, + (&(pginfo->region->chunk_list)), + list) { + pgaddr = ( page_to_pfn(chunk->page_list[ + pginfo->next_nmap].page) + << PAGE_SHIFT ); + *rpage = phys_to_abs(pgaddr); + EDEB(9,"pgaddr=%lx *rpage=%lx", pgaddr, *rpage); + if (*rpage == 0) { + EDEB_ERR(4, "chunk->page_list[]=%lx next_nmap=%lx " + "mr=%p", (u64)sg_dma_address( + &chunk->page_list[ + pginfo->next_nmap]), + pginfo->next_nmap, e_mr); + retcode = -EFAULT; + goto ehca_set_pagebuf_1_exit0; + } + (pginfo->page_count)++; + (pginfo->next_nmap)++; + if (pginfo->next_nmap >= chunk->nmap) { + pginfo->next_nmap = 0; + prev_chunk = chunk; + } + break; + } + pginfo->next_chunk = + list_prepare_entry(prev_chunk, + (&(pginfo->region->chunk_list)), + list); + } else if (pginfo->type == EHCA_MR_PGI_FMR) { + tmp_fmrlist = pginfo->page_list + pginfo->next_listelem; + *rpage = phys_to_abs(*tmp_fmrlist); + if (*rpage == 0) { + EDEB_ERR(4, "*tmp_fmrlist=%lx tmp_fmrlist=%p" + " next_listelem=%lx", *tmp_fmrlist, + tmp_fmrlist, pginfo->next_listelem); + retcode = -EFAULT; + goto ehca_set_pagebuf_1_exit0; + } + (pginfo->next_listelem)++; + (pginfo->page_count)++; + } else { + EDEB_ERR(4, "bad pginfo->type=%x", pginfo->type); + retcode = -EFAULT; + goto ehca_set_pagebuf_1_exit0; + } + + ehca_set_pagebuf_1_exit0: + if (retcode == 0) + EDEB_EX(7, "retcode=%x e_mr=%p pginfo=%p type=%x num_pages=%lx " + "next_buf=%lx next_page=%lx rpage=%p page_count=%lx " + "next_listelem=%lx region=%p next_chunk=%p " + "next_nmap=%lx", + retcode, e_mr, pginfo, pginfo->type, pginfo->num_pages, + pginfo->next_buf, pginfo->next_page, rpage, + pginfo->page_count, pginfo->next_listelem, + pginfo->region, pginfo->next_chunk, pginfo->next_nmap); + else + EDEB_EX(4, "retcode=%x e_mr=%p pginfo=%p type=%x num_pages=%lx " + "next_buf=%lx next_page=%lx rpage=%p page_count=%lx " + "next_listelem=%lx region=%p next_chunk=%p " + "next_nmap=%lx", + retcode, e_mr, pginfo, pginfo->type, pginfo->num_pages, + pginfo->next_buf, pginfo->next_page, rpage, + pginfo->page_count, pginfo->next_listelem, + pginfo->region, pginfo->next_chunk, pginfo->next_nmap); + return (retcode); +} /* end ehca_set_pagebuf_1() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ + +/** @brief check MR if it is a max-MR, i.e. uses whole memory + in case it's a max-MR TRUE is returned, else FALSE +*/ +static inline int ehca_mr_is_maxmr(u64 size, + u64 *iova_start) +{ + /* a MR is treated as max-MR only if it fits following: */ + if ((size == ((u64)high_memory - PAGE_OFFSET)) && + (iova_start == (void*)KERNELBASE)) { + EDEB(6, "this is a max-MR"); + return (TRUE); + } else + return (FALSE); +} /* end ehca_mr_is_maxmr() */ + +/*----------------------------------------------------------------------*/ +/*----------------------------------------------------------------------*/ +/** @brief map access control for MR/MW. + This routine is used for MR and MW. +*/ +static inline void ehca_mrmw_map_acl(int ib_acl, /**flags) - (u64)mr; + memset(&mr->flags, 0, sizeof(*mr) - offset); +} /* end ehca_mr_deletenew() */ + +#endif /*_EHCA_MRMW_H_*/ From rolandd at cisco.com Fri Feb 17 16:57:48 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 17 Feb 2006 16:57:48 -0800 Subject: [openib-general] [PATCH 16/22] ehca post send/receive and poll CQ In-Reply-To: <20060218005532.13620.79663.stgit@localhost.localdomain> References: <20060218005532.13620.79663.stgit@localhost.localdomain> Message-ID: <20060218005748.13620.45620.stgit@localhost.localdomain> From: Roland Dreier There are an awful lot of magic numbers scattered around. Probably they should become enums somewhere. The compatibility defines for using the kernel file in userspace shouldn't go into the kernel. --- drivers/infiniband/hw/ehca/ehca_reqs.c | 401 ++++++++++++++++++++++++++ drivers/infiniband/hw/ehca/ehca_reqs_core.c | 420 +++++++++++++++++++++++++++ 2 files changed, 821 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_reqs.c b/drivers/infiniband/hw/ehca/ehca_reqs.c new file mode 100644 index 0000000..659e6ba --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_reqs.c @@ -0,0 +1,401 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * post_send/recv, poll_cq, req_notify + * + * Authors: Waleri Fomin + * Reinhard Ernst + * Hoang-Nam Nguyen + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_reqs.c,v 1.41 2006/02/06 10:17:34 schickhj Exp $ + */ + + +#define DEB_PREFIX "reqs" + +#include "ehca_kernel.h" +#include "ehca_classes.h" +#include "ehca_tools.h" +#include "hcp_if.h" +#include "ehca_qes.h" +#include "ehca_iverbs.h" + +/* include some inline service routines */ +#include "ehca_asm.h" +#include "ehca_reqs_core.c" + +int ehca_post_send(struct ib_qp *qp, + struct ib_send_wr *send_wr, + struct ib_send_wr **bad_send_wr) +{ + struct ehca_qp *my_qp = NULL; + struct ib_send_wr *cur_send_wr = NULL; + struct ehca_wqe *wqe_p = NULL; + int wqe_cnt = 0; + int retcode = 0; + unsigned long spl_flags = 0; + + EHCA_CHECK_ADR(qp); + my_qp = container_of(qp, struct ehca_qp, ib_qp); + EHCA_CHECK_QP(my_qp); + EHCA_CHECK_ADR(send_wr); + EDEB_EN(7, "ehca_qp=%p qp_num=%x send_wr=%p bad_send_wr=%p", + my_qp, qp->qp_num, send_wr, bad_send_wr); + + /* LOCK the QUEUE */ + spin_lock_irqsave(&my_qp->spinlock_s, spl_flags); + + /* loop processes list of send reqs */ + for (cur_send_wr = send_wr; cur_send_wr != NULL; + cur_send_wr = cur_send_wr->next) { + void *start_addr = + &my_qp->ehca_qp_core.ipz_squeue.current_q_addr; + /* get pointer next to free WQE */ + wqe_p = ipz_QEit_get_inc(&my_qp->ehca_qp_core.ipz_squeue); + if (unlikely(wqe_p == NULL)) { + /* too many posted work requests: queue overflow */ + if (bad_send_wr != NULL) { + *bad_send_wr = cur_send_wr; + } + if (wqe_cnt==0) { + retcode = -ENOMEM; + EDEB_ERR(4, "Too many posted WQEs qp_num=%x", + qp->qp_num); + } + goto post_send_exit0; + } + /* write a SEND WQE into the QUEUE */ + retcode = ehca_write_swqe(&my_qp->ehca_qp_core, + wqe_p, cur_send_wr); + /* if something failed, + reset the free entry pointer to the start value + */ + if (unlikely(retcode != 0)) { + my_qp->ehca_qp_core.ipz_squeue.current_q_addr = + start_addr; + *bad_send_wr = cur_send_wr; + if (wqe_cnt==0) { + retcode = -EINVAL; + EDEB_ERR(4, "Could not write WQE qp_num=%x", + qp->qp_num); + } + goto post_send_exit0; + } + wqe_cnt++; + EDEB(7, "ehca_qp=%p qp_num=%x wqe_cnt=%d", + my_qp, qp->qp_num, wqe_cnt); + } /* eof for cur_send_wr */ + + post_send_exit0: + /* UNLOCK the QUEUE */ + spin_unlock_irqrestore(&my_qp->spinlock_s, spl_flags); + iosync(); /* serialize GAL register access */ + hipz_update_SQA(&my_qp->ehca_qp_core, wqe_cnt); + EDEB_EX(7, "ehca_qp=%p qp_num=%x ret=%x wqe_cnt=%d", + my_qp, qp->qp_num, retcode, wqe_cnt); + return retcode; +} + +int ehca_post_recv(struct ib_qp *qp, + struct ib_recv_wr *recv_wr, + struct ib_recv_wr **bad_recv_wr) +{ + struct ehca_qp *my_qp = NULL; + struct ib_recv_wr *cur_recv_wr = NULL; + struct ehca_wqe *wqe_p = NULL; + int wqe_cnt = 0; + int retcode = 0; + unsigned long spl_flags = 0; + + EHCA_CHECK_ADR(qp); + my_qp = container_of(qp, struct ehca_qp, ib_qp); + EHCA_CHECK_QP(my_qp); + EHCA_CHECK_ADR(recv_wr); + EDEB_EN(7, "ehca_qp=%p qp_num=%x recv_wr=%p bad_recv_wr=%p", + my_qp, qp->qp_num, recv_wr, bad_recv_wr); + + /* LOCK the QUEUE */ + spin_lock_irqsave(&my_qp->spinlock_r, spl_flags); + + /* loop processes list of send reqs */ + for (cur_recv_wr = recv_wr; cur_recv_wr != NULL; + cur_recv_wr = cur_recv_wr->next) { + void *start_addr = + &my_qp->ehca_qp_core.ipz_rqueue.current_q_addr; + /* get pointer next to free WQE */ + wqe_p = ipz_QEit_get_inc(&my_qp->ehca_qp_core.ipz_rqueue); + if (unlikely(wqe_p == NULL)) { + /* too many posted work requests: queue overflow */ + if (bad_recv_wr != NULL) { + *bad_recv_wr = cur_recv_wr; + } + if (wqe_cnt==0) { + retcode = -ENOMEM; + EDEB_ERR(4, "Too many posted WQEs qp_num=%x", + qp->qp_num); + } + goto post_recv_exit0; + } + /* write a RECV WQE into the QUEUE */ + retcode = + ehca_write_rwqe(&my_qp->ehca_qp_core, wqe_p, cur_recv_wr); + /* if something failed, + reset the free entry pointer to the start value + */ + if (unlikely(retcode != 0)) { + my_qp->ehca_qp_core.ipz_rqueue.current_q_addr = + start_addr; + *bad_recv_wr = cur_recv_wr; + if (wqe_cnt==0) { + retcode = -EINVAL; + EDEB_ERR(4, "Could not write WQE qp_num=%x", + qp->qp_num); + } + goto post_recv_exit0; + } + wqe_cnt++; + EDEB(7, "ehca_qp=%p qp_num=%x wqe_cnt=%d", + my_qp, qp->qp_num, wqe_cnt); + } /* eof for cur_recv_wr */ + + post_recv_exit0: + spin_unlock_irqrestore(&my_qp->spinlock_r, spl_flags); + iosync(); /* serialize GAL register access */ + hipz_update_RQA(&my_qp->ehca_qp_core, wqe_cnt); + EDEB_EX(7, "ehca_qp=%p qp_num=%x ret=%x wqe_cnt=%d", + my_qp, qp->qp_num, retcode, wqe_cnt); + return retcode; +} + +/** + * Table converts ehca wc opcode to ib + * Since we use zero to indicate invalid opcode, the actual ib opcode must + * be decremented!!! + */ +static const u8 ib_wc_opcode[255] = { + [0x01] = IB_WC_RECV+1, + [0x02] = IB_WC_RECV_RDMA_WITH_IMM+1, + [0x04] = IB_WC_BIND_MW+1, + [0x08] = IB_WC_FETCH_ADD+1, + [0x10] = IB_WC_COMP_SWAP+1, + [0x20] = IB_WC_RDMA_WRITE+1, + [0x40] = IB_WC_RDMA_READ+1, + [0x80] = IB_WC_SEND+1 +}; + +/** @brief internal function to poll one entry of cq + */ +static inline int ehca_poll_cq_one(struct ib_cq *cq, struct ib_wc *wc) +{ + int retcode = 0; + struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq); + struct ehca_cqe *cqe = NULL; + int cqe_count = 0; + + EDEB_EN(7, "ehca_cq=%p cq_num=%x wc=%p", my_cq, my_cq->cq_number, wc); + + poll_cq_one_read_cqe: + cqe = (struct ehca_cqe *) + ipz_QEit_get_inc_valid(&my_cq->ehca_cq_core.ipz_queue); + if (cqe == NULL) { + retcode = -EAGAIN; + EDEB(7, "Completion queue is empty ehca_cq=%p cq_num=%x " + "retcode=%x", my_cq, my_cq->cq_number, retcode); + goto poll_cq_one_exit0; + } + cqe_count++; + if (unlikely(cqe->status & 0x10)) { /* purge bit set */ + struct ehca_qp *qp=ehca_cq_get_qp(my_cq, cqe->local_qp_number); + int purgeflag = 0; + unsigned long spl_flags = 0; + if (qp==NULL) { /* should not happen */ + EDEB_ERR(4, "cq_num=%x qp_num=%x " + "could not find qp -> ignore cqe", + my_cq->cq_number, cqe->local_qp_number); + EDEB_DMP(4, cqe, 64, "cq_num=%x qp_num=%x", + my_cq->cq_number, cqe->local_qp_number); + /* ignore this purged cqe */ + goto poll_cq_one_read_cqe; + } + spin_lock_irqsave(&qp->spinlock_s, spl_flags); + purgeflag = qp->sqerr_purgeflag; + spin_unlock_irqrestore(&qp->spinlock_s, spl_flags); + if (purgeflag!=0) { + EDEB(6, "Got CQE with purged bit qp_num=%x src_qp=%x", + cqe->local_qp_number, cqe->remote_qp_number); + EDEB_DMP(6, cqe, 64, "qp_num=%x src_qp=%x", + cqe->local_qp_number, cqe->remote_qp_number); + /* ignore this to avoid double cqes of bad wqe + that caused sqe and turn off purge flag */ + qp->sqerr_purgeflag = 0; + goto poll_cq_one_read_cqe; + } + } + + /* tracing cqe */ + if (IS_EDEB_ON(7)) { + EDEB(7, "Received COMPLETION ehca_cq=%p cq_num=%x -----", + my_cq, my_cq->cq_number); + EDEB_DMP(7, cqe, 64, "ehca_cq=%p cq_num=%x", + my_cq, my_cq->cq_number); + EDEB(7, "ehca_cq=%p cq_num=%x -------------------------", + my_cq, my_cq->cq_number); + } + + /* we got a completion! */ + wc->wr_id = cqe->work_request_id; + + /* eval ib_wc_opcode */ + wc->opcode = ib_wc_opcode[cqe->optype]-1; + if (unlikely(wc->opcode == -1)) { + EDEB_ERR(4, "Invalid cqe->OPType=%x cqe->status=%x " + "ehca_cq=%p cq_num=%x", + cqe->optype, cqe->status, my_cq, my_cq->cq_number); + /* dump cqe for other infos */ + EDEB_DMP(4, cqe, 64, "ehca_cq=%p cq_num=%x", my_cq, my_cq->cq_number); + /* update also queue adder to throw away this entry!!! */ + goto poll_cq_one_exit0; + } + /* eval ib_wc_status */ + if (unlikely(cqe->status & 0x80000000)) { /* complete with errors */ + map_ib_wc_status(cqe->status, &wc->status); + wc->vendor_err = wc->status; + } else { + wc->status = IB_WC_SUCCESS; + } + + wc->qp_num = cqe->local_qp_number; + wc->byte_len = ntohl(cqe->nr_bytes_transferred); + wc->pkey_index = cqe->pkey_index; + wc->slid = cqe->rlid; + wc->dlid_path_bits = cqe->dlid; + wc->src_qp = cqe->remote_qp_number; + wc->wc_flags = cqe->w_completion_flags; + wc->imm_data = cqe->immediate_data; + wc->sl = cqe->service_level; + + if (wc->status != IB_WC_SUCCESS) { + EDEB(6, "ehca_cq=%p cq_num=%x WARNING unsuccessful cqe " + "OPType=%x status=%x qp_num=%x src_qp=%x wr_id=%lx cqe=%p", + my_cq, my_cq->cq_number, cqe->optype, cqe->status, + cqe->local_qp_number, cqe->remote_qp_number, + cqe->work_request_id, cqe); + } + + poll_cq_one_exit0: + if (cqe_count>0) { + hipz_update_FECA(&my_cq->ehca_cq_core, cqe_count); + } + + EDEB_EX(7, "retcode=%x ehca_cq=%p cq_number=%x wc=%p " + "status=%x opcode=%x qp_num=%x byte_len=%x", + retcode, my_cq, my_cq->cq_number, wc, wc->status, + wc->opcode, wc->qp_num, wc->byte_len); + return (retcode); +} + +int ehca_poll_cq(struct ib_cq *cq, int num_entries, struct ib_wc *wc) +{ + struct ehca_cq *my_cq = NULL; + int nr = 0; + struct ib_wc *current_wc = NULL; + int retcode = 0; + unsigned long spl_flags = 0; + + EHCA_CHECK_CQ(cq); + EHCA_CHECK_ADR(wc); + + my_cq = container_of(cq, struct ehca_cq, ib_cq); + EHCA_CHECK_CQ(my_cq); + + EDEB_EN(7, "ehca_cq=%p cq_num=%x num_entries=%d wc=%p", + my_cq, my_cq->cq_number, num_entries, wc); + + if (num_entries < 1) { + EDEB_ERR(4, "Invalid num_entries=%d ehca_cq=%p cq_num=%x", + num_entries, my_cq, my_cq->cq_number); + retcode = -EINVAL; + goto poll_cq_exit0; + } + + current_wc = wc; + spin_lock_irqsave(&my_cq->spinlock, spl_flags); + for (nr = 0; nr < num_entries; nr++) { + retcode = ehca_poll_cq_one(cq, current_wc); + if (0 != retcode) { + break; + } + current_wc++; + } /* eof for nr */ + spin_unlock_irqrestore(&my_cq->spinlock, spl_flags); + if (-EAGAIN == retcode || 0 == retcode) { + retcode = nr; + } + + poll_cq_exit0: + EDEB_EX(7, "ehca_cq=%p cq_num=%x retcode=%x wc=%p nr_entries=%d", + my_cq, my_cq->cq_number, retcode, wc, nr); + return (retcode); +} + +int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify) +{ + struct ehca_cq *my_cq = NULL; + int retcode = 0; + + EHCA_CHECK_CQ(cq); + my_cq = container_of(cq, struct ehca_cq, ib_cq); + EHCA_CHECK_CQ(my_cq); + EDEB_EN(7, "ehca_cq=%p cq_num=%x cq_notif=%x", + my_cq, my_cq->cq_number, cq_notify); + + switch (cq_notify) { + case IB_CQ_SOLICITED: + hipz_set_CQx_N0(&my_cq->ehca_cq_core, 1); + break; + case IB_CQ_NEXT_COMP: + hipz_set_CQx_N1(&my_cq->ehca_cq_core, 1); + break; + default: + retcode = -EINVAL; + } + + EDEB_EX(7, "ehca_cq=%p cq_num=%x retcode=%x", + my_cq, my_cq->cq_number, retcode); + + return (retcode); +} + +/* eof ehca_reqs.c */ diff --git a/drivers/infiniband/hw/ehca/ehca_reqs_core.c b/drivers/infiniband/hw/ehca/ehca_reqs_core.c new file mode 100644 index 0000000..c0b7281 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_reqs_core.c @@ -0,0 +1,420 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * post_send/recv, poll_cq, req_notify + * Common code to be included statically in respective user/kernel + * modules, i.e. ehca_ureqs.c/ehca_reqs.c + * This module contains C code only. Including modules must include + * all required header files. + * + * Authors: Waleri Fomin + * Reinhard Ernst + * Hoang-Nam Nguyen + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_reqs_core.c,v 1.40 2006/02/06 10:17:34 schickhj Exp $ + */ + +/** THIS following block of defines + * replaces ib types of kernel space to corresponding ones in user space, + * so that the implemented inline functions below can be compiled and + * work in both user and kernel space. + * However this ASSUMES that there is no functional differences between ib + * types in kernel e.g. ib_send_wr and user space e.g. ibv_send_wr. + */ + +#ifndef __KERNEL__ +#define ib_recv_wr ibv_recv_wr +#define ib_send_wr ibv_send_wr +#define ehca_av ehcau_av +/* ib_wr_opcode */ +#define IB_WR_SEND IBV_WR_SEND +#define IB_WR_SEND_WITH_IMM IBV_WR_SEND_WITH_IMM +#define IB_WR_RDMA_WRITE IBV_WR_RDMA_WRITE +#define IB_WR_RDMA_WRITE_WITH_IMM IBV_WR_RDMA_WRITE_WITH_IMM +#define IB_WR_RDMA_READ IBV_WR_RDMA_READ +/* ib_qp_type */ +#define IB_QPT_RC IBV_QPT_RC +#define IB_QPT_UC IBV_QPT_UC +#define IB_QPT_UD IBV_QPT_UD +/* ib_wc_opcode */ +#define ib_wc_opcode ibv_wc_opcode +#define IB_WC_SEND IBV_WC_SEND +#define IB_WC_RDMA_WRITE IBV_WC_RDMA_WRITE +#define IB_WC_RDMA_READ IBV_WC_RDMA_READ +#define IB_WC_COMP_SWAP IBV_WC_COMP_SWAP +#define IB_WC_FETCH_ADD IBV_WC_FETCH_ADD +#define IB_WC_BIND_MW IBV_WC_BIND_MW +#define IB_WC_RECV IBV_WC_RECV +#define IB_WC_RECV_RDMA_WITH_IMM IBV_WC_RECV_RDMA_WITH_IMM +/* ib_wc_status */ +#define ib_wc_status ibv_wc_status +#define IB_WC_LOC_LEN_ERR IBV_WC_LOC_LEN_ERR +#define IB_WC_LOC_QP_OP_ERR IBV_WC_LOC_QP_OP_ERR +#define IB_WC_LOC_EEC_OP_ERR IBV_WC_LOC_EEC_OP_ERR +#define IB_WC_LOC_PROT_ERR IBV_WC_LOC_PROT_ERR +#define IB_WC_WR_FLUSH_ERR IBV_WC_WR_FLUSH_ERR +#define IB_WC_MW_BIND_ERR IBV_WC_MW_BIND_ERR +#define IB_WC_GENERAL_ERR IBV_WC_GENERAL_ERR +#define IB_WC_REM_INV_REQ_ERR IBV_WC_REM_INV_REQ_ERR +#define IB_WC_REM_ACCESS_ERR IBV_WC_REM_ACCESS_ERR +#define IB_WC_REM_OP_ERR IBV_WC_REM_OP_ERR +#define IB_WC_REM_INV_RD_REQ_ERR IBV_WC_REM_INV_RD_REQ_ERR +#define IB_WC_RETRY_EXC_ERR IBV_WC_RETRY_EXC_ERR +#define IB_WC_RNR_RETRY_EXC_ERR IBV_WC_RNR_RETRY_EXC_ERR +#define IB_WC_REM_ABORT_ERR IBV_WC_REM_ABORT_ERR +#define IB_WC_INV_EECN_ERR IBV_WC_INV_EECN_ERR +#define IB_WC_INV_EEC_STATE_ERR IBV_WC_INV_EEC_STATE_ERR +#define IB_WC_BAD_RESP_ERR IBV_WC_BAD_RESP_ERR +#define IB_WC_FATAL_ERR IBV_WC_FATAL_ERR +#define IB_WC_SUCCESS IBV_WC_SUCCESS +/* ib_send_flags */ +#define IB_SEND_FENCE IBV_SEND_FENCE +#define IB_SEND_SIGNALED IBV_SEND_SIGNALED +#define IB_SEND_SOLICITED IBV_SEND_SOLICITED +#define IB_SEND_INLINE IBV_SEND_INLINE +#endif + +static inline int ehca_write_rwqe(struct ehca_qp_core *qp_core, + struct ehca_wqe *wqe_p, + struct ib_recv_wr *recv_wr) +{ + u8 cnt_ds; + if (unlikely((recv_wr->num_sge < 0) || + (recv_wr->num_sge > qp_core->ipz_rqueue.act_nr_of_sg))) { + EDEB_ERR(4, "Invalid number of WQE SGE. " + "num_sqe=%x max_nr_of_sg=%x", + recv_wr->num_sge, qp_core->ipz_rqueue.act_nr_of_sg); + return (-EINVAL); /* invalid SG list length */ + } + + clear_cacheline(wqe_p); + clear_cacheline((u8 *) wqe_p + 32); + clear_cacheline((u8 *) wqe_p + 64); + + wqe_p->work_request_id = be64_to_cpu(recv_wr->wr_id); + wqe_p->nr_of_data_seg = recv_wr->num_sge; + + for (cnt_ds = 0; cnt_ds < recv_wr->num_sge; cnt_ds++) { + wqe_p->u.all_rcv.sg_list[cnt_ds].vaddr = + be64_to_cpu(recv_wr->sg_list[cnt_ds].addr); + wqe_p->u.all_rcv.sg_list[cnt_ds].lkey = + ntohl(recv_wr->sg_list[cnt_ds].lkey); + wqe_p->u.all_rcv.sg_list[cnt_ds].length = + ntohl(recv_wr->sg_list[cnt_ds].length); + } + + if (IS_EDEB_ON(7)) { + EDEB(7, "RECEIVE WQE written into queue qp_core=%p", qp_core); + EDEB_DMP(7, wqe_p, 16*(6 + wqe_p->nr_of_data_seg), + "qp_core=%p", qp_core); + } + + return (0); +} + +/* internal use only + uncomment this line to enable trace output of GSI send wr */ +/* #define DEBUG_GSI_SEND_WR 1 */ +#if defined(__KERNEL__) && defined(DEBUG_GSI_SEND_WR) + +/* need ib_mad struct */ +#include + +static void trace_send_wr_ud(const struct ib_send_wr *send_wr) +{ + int idx = 0; + int j = 0; + while (send_wr != NULL) { + struct ib_mad_hdr *mad_hdr = send_wr->wr.ud.mad_hdr; + struct ib_sge *sge = send_wr->sg_list; + EDEB(4, "send_wr#%x wr_id=%lx num_sge=%x " + "send_flags=%x opcode=%x",idx, send_wr->wr_id, + send_wr->num_sge, send_wr->send_flags, send_wr->opcode); + if (mad_hdr != NULL) { + EDEB(4, "send_wr#%x mad_hdr base_version=%x " + "mgmt_class=%x class_version=%x method=%x " + "status=%x class_specific=%x tid=%lx attr_id=%x " + "resv=%x attr_mod=%x", + idx, mad_hdr->base_version, mad_hdr->mgmt_class, + mad_hdr->class_version, mad_hdr->method, + mad_hdr->status, mad_hdr->class_specific, + mad_hdr->tid, mad_hdr->attr_id, mad_hdr->resv, + mad_hdr->attr_mod); + } + for (j = 0; j < send_wr->num_sge; j++) { +#ifdef EHCA_USERDRIVER + u8 *data = (u8 *) sge->addr; +#else + u8 *data = (u8 *) abs_to_virt(sge->addr); +#endif + EDEB(4, "send_wr#%x sge#%x addr=%p length=%x lkey=%x", + idx, j, data, sge->length, sge->lkey); + /* assume length is n*16 */ + EDEB_DMP(4, data, sge->length, "send_wr#%x sge#%x", idx, j); + sge++; + } /* eof for j */ + idx++; + send_wr = send_wr->next; + } /* eof while send_wr */ +} + +#endif /* __KERNEL__ && DEBUG_GSI_SEND_WR */ + +static inline int ehca_write_swqe(struct ehca_qp_core *qp_core, + struct ehca_wqe *wqe_p, + const struct ib_send_wr *send_wr) +{ + u32 idx; + u64 dma_length; + struct ehca_av *my_av; + u32 remote_qkey = send_wr->wr.ud.remote_qkey; + + clear_cacheline(wqe_p); + clear_cacheline((u8 *) wqe_p + 32); + + if (unlikely((send_wr->num_sge < 0) || + (send_wr->num_sge > qp_core->ipz_squeue.act_nr_of_sg))) { + EDEB_ERR(4, "Invalid number of WQE SGE. " + "num_sqe=%x max_nr_of_sg=%x", + send_wr->num_sge, qp_core->ipz_rqueue.act_nr_of_sg); + return (-EINVAL); /* invalid SG list length */ + } + + wqe_p->work_request_id = be64_to_cpu(send_wr->wr_id); + + switch (send_wr->opcode) { + case IB_WR_SEND: + case IB_WR_SEND_WITH_IMM: + wqe_p->optype = WQE_OPTYPE_SEND; + break; + case IB_WR_RDMA_WRITE: + case IB_WR_RDMA_WRITE_WITH_IMM: + wqe_p->optype = WQE_OPTYPE_RDMAWRITE; + break; + case IB_WR_RDMA_READ: + wqe_p->optype = WQE_OPTYPE_RDMAREAD; + break; + default: + EDEB_ERR(4, "Invalid opcode=%x", send_wr->opcode); + return (-EINVAL); /* invalid opcode */ + } + + wqe_p->wqef = (send_wr->opcode) & 0xF0; + + wqe_p->wr_flag = 0; + if (send_wr->send_flags & IB_SEND_SIGNALED) { + wqe_p->wr_flag |= WQE_WRFLAG_REQ_SIGNAL_COM; + } + + if (send_wr->opcode == IB_WR_SEND_WITH_IMM || + send_wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM) { + /* this might not work as long as HW does not support it */ + wqe_p->immediate_data = send_wr->imm_data; + wqe_p->wr_flag |= WQE_WRFLAG_IMM_DATA_PRESENT; + } + + wqe_p->nr_of_data_seg = send_wr->num_sge; + + switch (qp_core->qp_type) { +#ifdef __KERNEL__ + case IB_QPT_SMI: + case IB_QPT_GSI: +#endif /* __KERNEL__ */ + /* no break is intential here */ + case IB_QPT_UD: + /* IB 1.2 spec C10-15 compliance */ + if (send_wr->wr.ud.remote_qkey & 0x80000000) { + remote_qkey = qp_core->qkey; + } + wqe_p->destination_qp_number = + ntohl(send_wr->wr.ud.remote_qpn << 8); + wqe_p->local_ee_context_qkey = ntohl(remote_qkey); + if (send_wr->wr.ud.ah==NULL) { + EDEB_ERR(4, "wr.ud.ah is NULL. qp_core=%p", qp_core); + return (-EINVAL); + } + my_av = container_of(send_wr->wr.ud.ah, struct ehca_av, ib_ah); + wqe_p->u.ud_av.ud_av = my_av->av; + + /* omitted check of IB_SEND_INLINE + since HW does not support it */ + for (idx = 0; idx < send_wr->num_sge; idx++) { + wqe_p->u.ud_av.sg_list[idx].vaddr = + be64_to_cpu(send_wr->sg_list[idx].addr); + wqe_p->u.ud_av.sg_list[idx].lkey = + ntohl(send_wr->sg_list[idx].lkey); + wqe_p->u.ud_av.sg_list[idx].length = + ntohl(send_wr->sg_list[idx].length); + } /* eof for idx */ +#ifdef __KERNEL__ + if (qp_core->qp_type == IB_QPT_SMI || + qp_core->qp_type == IB_QPT_GSI) { + wqe_p->u.ud_av.ud_av.pmtu = 1; + } + if (qp_core->qp_type == IB_QPT_GSI) { + wqe_p->pkeyi = + ntohs(send_wr->wr.ud.pkey_index); +#ifdef DEBUG_GSI_SEND_WR + trace_send_wr_ud(send_wr); +#endif /* DEBUG_GSI_SEND_WR */ + } +#endif /* __KERNEL__ */ + break; + + case IB_QPT_UC: + if (send_wr->send_flags & IB_SEND_FENCE) { + wqe_p->wr_flag |= WQE_WRFLAG_FENCE; + } + /* no break is intential here */ + case IB_QPT_RC: + /*@@TODO atomic???*/ + wqe_p->u.nud.remote_virtual_adress = + be64_to_cpu(send_wr->wr.rdma.remote_addr); + wqe_p->u.nud.rkey = ntohl(send_wr->wr.rdma.rkey); + + /* omitted checking of IB_SEND_INLINE + since HW does not support it */ + dma_length = 0; + for (idx = 0; idx < send_wr->num_sge; idx++) { + wqe_p->u.nud.sg_list[idx].vaddr = + be64_to_cpu(send_wr->sg_list[idx].addr); + wqe_p->u.nud.sg_list[idx].lkey = + ntohl(send_wr->sg_list[idx].lkey); + wqe_p->u.nud.sg_list[idx].length = + ntohl(send_wr->sg_list[idx].length); + dma_length += send_wr->sg_list[idx].length; + } /* eof idx */ + wqe_p->u.nud.atomic_1st_op_dma_len = be64_to_cpu(dma_length); + + break; + + default: + EDEB_ERR(4, "Invalid qptype=%x", qp_core->qp_type); + return (-EINVAL); + } + + if (IS_EDEB_ON(7)) { + EDEB(7, "SEND WQE written into queue qp_core=%p ", qp_core); + EDEB_DMP(7, wqe_p, 16*(6 + wqe_p->nr_of_data_seg), + "qp_core=%p", qp_core); + } + return (0); +} + +/** @brief convert cqe_status to ib_wc_status + */ +static inline void map_ib_wc_status(u32 cqe_status, + enum ib_wc_status *wc_status) +{ + if (unlikely(cqe_status & 0x80000000)) { /* complete with errors */ + switch (cqe_status & 0x0000003F) { + case 0x01: + case 0x21: + *wc_status = IB_WC_LOC_LEN_ERR; + break; + case 0x02: + case 0x22: + *wc_status = IB_WC_LOC_QP_OP_ERR; + break; + case 0x03: + case 0x23: + *wc_status = IB_WC_LOC_EEC_OP_ERR; + break; + case 0x04: + case 0x24: + *wc_status = IB_WC_LOC_PROT_ERR; + break; + case 0x05: + case 0x25: + *wc_status = IB_WC_WR_FLUSH_ERR; + break; + case 0x06: + *wc_status = IB_WC_MW_BIND_ERR; + break; + case 0x07: /* remote error - look into bits 20:24 */ + switch ((cqe_status & 0x0000F800) >> 11) { + case 0x0: + /* PSN Sequence Error! + couldn't find a matching VAPI status! */ + *wc_status = IB_WC_GENERAL_ERR; + break; + case 0x1: + *wc_status = IB_WC_REM_INV_REQ_ERR; + break; + case 0x2: + *wc_status = IB_WC_REM_ACCESS_ERR; + break; + case 0x3: + *wc_status = IB_WC_REM_OP_ERR; + break; + case 0x4: + *wc_status = IB_WC_REM_INV_RD_REQ_ERR; + break; + } + break; + case 0x08: + *wc_status = IB_WC_RETRY_EXC_ERR; + break; + case 0x09: + *wc_status = IB_WC_RNR_RETRY_EXC_ERR; + break; + case 0x0A: + case 0x2D: + *wc_status = IB_WC_REM_ABORT_ERR; + break; + case 0x0B: + case 0x2E: + *wc_status = IB_WC_INV_EECN_ERR; + break; + case 0x0C: + case 0x2F: + *wc_status = IB_WC_INV_EEC_STATE_ERR; + break; + case 0x0D: + *wc_status = IB_WC_BAD_RESP_ERR; + break; + case 0x10: + /* WQE purged */ + *wc_status = IB_WC_WR_FLUSH_ERR; + break; + default: + *wc_status = IB_WC_FATAL_ERR; + + } + } else { + *wc_status = IB_WC_SUCCESS; + } +} + From rolandd at cisco.com Fri Feb 17 16:57:59 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 17 Feb 2006 16:57:59 -0800 Subject: [openib-general] [PATCH 21/22] ehca main file In-Reply-To: <20060218005532.13620.79663.stgit@localhost.localdomain> References: <20060218005532.13620.79663.stgit@localhost.localdomain> Message-ID: <20060218005759.13620.10968.stgit@localhost.localdomain> From: Roland Dreier What is ehca_show_flightrecorder() trying to do that snprintf() is not fast enough? If you need to pass a binary structure back to userspace (with a kernel address in it??) then sysfs is not the right place to put it. Look at debugfs; or relayfs might make the most sense for your flightrecorder stuff. --- drivers/infiniband/hw/ehca/ehca_main.c | 1032 ++++++++++++++++++++++++++++++++ 1 files changed, 1032 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c new file mode 100644 index 0000000..2e2be06 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -0,0 +1,1032 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * module start stop, hca detection + * + * Authors: Heiko J Schick + * Christoph Raisch + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_main.c,v 1.137 2006/02/06 16:20:38 schickhj Exp $ + */ + +#define DEB_PREFIX "shca" + +#include "ehca_kernel.h" +#include "ehca_tools.h" +#include "ehca_classes.h" +#include "ehca_iverbs.h" +#include "ehca_eq.h" +#include "ehca_mrmw.h" + +#include "hcp_sense.h" /* TODO: later via hipz_* header file */ +#include "hcp_if.h" /* TODO: later via hipz_* header file */ + +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_AUTHOR("Christoph Raisch "); +MODULE_DESCRIPTION("IBM eServer HCA Driver"); +MODULE_VERSION("EHCA2_0047"); + +#ifdef EHCA_USERDRIVER +int ehca_open_aqp1 = 1; +#else +int ehca_open_aqp1 = 0; +#endif +int ehca_tracelevel = -1; +int ehca_hw_level = 0; +int ehca_nr_ports = 2; +int ehca_use_hp_mr = 0; +int ehca_port_act_time = 30; +int ehca_poll_all_eqs = 1; +int ehca_static_rate = -1; + +module_param_named(open_aqp1, ehca_open_aqp1, int, 0); +module_param_named(tracelevel, ehca_tracelevel, int, 0); +module_param_named(hw_level, ehca_hw_level, int, 0); +module_param_named(nr_ports, ehca_nr_ports, int, 0); +module_param_named(use_hp_mr, ehca_use_hp_mr, int, 0); +module_param_named(port_act_time, ehca_port_act_time, int, 0); +module_param_named(poll_all_eqs, ehca_poll_all_eqs, int, 0); +module_param_named(static_rate, ehca_static_rate, int, 0); + +MODULE_PARM_DESC(open_aqp1, "0 no define AQP1 on startup (default)," + "1 define AQP1 on startup"); +MODULE_PARM_DESC(tracelevel, "0 maximum performance (no messages)," + "9 maximum messages (no performance)"); +MODULE_PARM_DESC(hw_level, "0 autosensing," + "1 v. 0.20," + "2 v. 0.21"); +MODULE_PARM_DESC(nr_ports, "number of connected ports (default: 2)"); +MODULE_PARM_DESC(use_hp_mr, "use high performance MRs," + "0 no (default)," + "1 yes"); +MODULE_PARM_DESC(port_act_time, "time to wait for port activation" + "(default: 30 sec.)"); +MODULE_PARM_DESC(poll_all_eqs, "polls all event queues periodically" + "0 no," + "1 yes (default)"); +MODULE_PARM_DESC(static_rate, "set permanent static rate (default: disabled)"); + +/* This external trace mask controls what will end up in the + * kernel ring buffer. Number 6 means, that everything between + * 0 and 5 will be stored. + */ +u8 ehca_edeb_mask[EHCA_EDEB_TRACE_MASK_SIZE]={6,6,6,6, + 6,6,6,6, + 6,6,6,6, + 6,6,6,6, + 6,6,6,6, + 6,6,6,6, + 6,6,6,6, + 6,6,1,0}; + /* offset 0x1e is flightrecorder */ +EXPORT_SYMBOL(ehca_edeb_mask); + +atomic_t ehca_flightrecorder_index = ATOMIC_INIT(1); +unsigned long ehca_flightrecorder[EHCA_FLIGHTRECORDER_SIZE]; +EXPORT_SYMBOL(ehca_flightrecorder_index); +EXPORT_SYMBOL(ehca_flightrecorder); + +DECLARE_RWSEM(ehca_qp_idr_sem); +DECLARE_RWSEM(ehca_cq_idr_sem); +DEFINE_IDR(ehca_qp_idr); +DEFINE_IDR(ehca_cq_idr); + +struct ehca_module ehca_module; +struct workqueue_struct *ehca_wq; +struct task_struct *ehca_kthread_eq; + +/** + * ehca_init_trace - TODO + */ +void ehca_init_trace(void) +{ + EDEB_EN(7, ""); + + if (ehca_tracelevel != -1) { + int i; + for (i = 0; i < EHCA_EDEB_TRACE_MASK_SIZE; i++) + ehca_edeb_mask[i] = ehca_tracelevel; + } + + EDEB_EX(7, ""); +} + +/** + * ehca_init_flight - TODO + */ +void ehca_init_flight(void) +{ + EDEB_EN(7, ""); + + memset(ehca_flightrecorder, 0xFA, + sizeof(unsigned long) * EHCA_FLIGHTRECORDER_SIZE); + atomic_set(&ehca_flightrecorder_index, 0); + ehca_flightrecorder[0] = 0x12345678abcdef0; + + EDEB_EX(7, ""); +} + +/** + * ehca_flight_to_printk - TODO + */ +void ehca_flight_to_printk(void) +{ + int cur_offset = atomic_read(&ehca_flightrecorder_index); + int new_offset = cur_offset - (EHCA_FLIGHTRECORDER_BACKLOG * 4); + u32 flight_offset; + int i; + + if (new_offset < 0) + new_offset = EHCA_FLIGHTRECORDER_SIZE + new_offset - 4; + + printk(KERN_ERR + "EHCA ----- flight recorder begin " + "-------------------------------------------\n"); + + for (i = 0; i < EHCA_FLIGHTRECORDER_BACKLOG; i++) { + new_offset += 4; + flight_offset = (u32) new_offset % EHCA_FLIGHTRECORDER_SIZE; + + printk(KERN_ERR "EHCA %02d: %.16lX %.16lX %.16lX %.16lX\n", + i + 1, + ehca_flightrecorder[flight_offset], + ehca_flightrecorder[flight_offset + 1], + ehca_flightrecorder[flight_offset + 2], + ehca_flightrecorder[flight_offset + 3]); + } + + printk(KERN_ERR + "EHCA ----- flight recorder end " + "---------------------------------------------\n"); +} + +#define EHCA_CACHE_CREATE(name) \ + ehca_module->cache_##name = \ + kmem_cache_create("ehca_cache_"#name, \ + sizeof(struct ehca_##name), \ + 0, SLAB_HWCACHE_ALIGN, \ + NULL, NULL); \ + if (ehca_module->cache_##name == NULL) { \ + EDEB_ERR(4, "Cannot create "#name" SLAB cache."); \ + return -ENOMEM; \ + } \ + +/** + * ehca_caches_create: TODO + */ +int ehca_caches_create(struct ehca_module *ehca_module) +{ + EDEB_EN(7, ""); + + EHCA_CACHE_CREATE(pd); + EHCA_CACHE_CREATE(cq); + EHCA_CACHE_CREATE(qp); + EHCA_CACHE_CREATE(av); + EHCA_CACHE_CREATE(mw); + EHCA_CACHE_CREATE(mr); + + EDEB_EX(7, ""); + + return 0; +} + +#define EHCA_CACHE_DESTROY(name) \ + ret = kmem_cache_destroy(ehca_module->cache_##name); \ + if (ret != 0) { \ + EDEB_ERR(4, "Cannot destroy "#name" SLAB cache. ret=%x", ret); \ + return ret; \ + } \ + +/** + * ehca_caches_destroy - TODO + */ +int ehca_caches_destroy(struct ehca_module *ehca_module) +{ + int ret; + + EDEB_EN(7, ""); + + EHCA_CACHE_DESTROY(pd); + EHCA_CACHE_DESTROY(cq); + EHCA_CACHE_DESTROY(qp); + EHCA_CACHE_DESTROY(av); + EHCA_CACHE_DESTROY(mw); + EHCA_CACHE_DESTROY(mr); + + EDEB_EX(7, ""); + + return 0; +} + +#define EHCA_HCAAVER EHCA_BMASK_IBM(32,39) +#define EHCA_REVID EHCA_BMASK_IBM(40,63) + +/** + * ehca_num_ports - TODO + */ +int ehca_sense_attributes(struct ehca_shca *shca) +{ + int ret = -EINVAL; + u64 rc = H_Success; + struct query_hca_rblock *rblock; + + EDEB_EN(7, "shca=%p", shca); + + rblock = kmalloc(PAGE_SIZE, GFP_KERNEL); + if (rblock == NULL) { + EDEB_ERR(4, "Cannot allocate rblock memory."); + ret = -ENOMEM; + goto num_ports0; + } + + memset(rblock, 0, PAGE_SIZE); + + rc = hipz_h_query_hca(shca->ipz_hca_handle, rblock); + if (rc != H_Success) { + EDEB_ERR(4, "Cannot query device properties.rc=%lx", rc); + ret = -EPERM; + goto num_ports1; + } + + if (ehca_nr_ports == 1) + shca->num_ports = 1; + else + shca->num_ports = (u8) rblock->num_ports; + + EDEB(6, " ... found %x ports", rblock->num_ports); + + if (ehca_hw_level == 0) { + u32 hcaaver; + u32 revid; + + hcaaver = EHCA_BMASK_GET(EHCA_HCAAVER, rblock->hw_ver); + revid = EHCA_BMASK_GET(EHCA_REVID, rblock->hw_ver); + + EDEB(6, " ... hardware version=%x:%x", + hcaaver, revid); + + if ((hcaaver == 1) && (revid == 0)) + shca->hw_level = 0; + else if ((hcaaver == 1) && (revid == 1)) + shca->hw_level = 1; + else if ((hcaaver == 1) && (revid == 2)) + shca->hw_level = 2; + } + EDEB(6, " ... hardware level=%x", shca->hw_level); + + ret = 0; + + num_ports1: + kfree(rblock); + + num_ports0: + EDEB_EX(7, "ret=%x", ret); + + return ret; +} + +static int init_node_guid(struct ehca_shca* shca) +{ + int ret = 0; + struct query_hca_rblock *rblock; + + EDEB_EN(7, ""); + + rblock = kmalloc(PAGE_SIZE, GFP_KERNEL); + if (rblock == NULL) { + EDEB_ERR(4, "Can't allocate rblock memory."); + ret = -ENOMEM; + goto init_node_guid0; + } + + memset(rblock, 0, PAGE_SIZE); + + if (hipz_h_query_hca(shca->ipz_hca_handle, rblock) != H_Success) { + EDEB_ERR(4, "Can't query device properties"); + ret = -EINVAL; + goto init_node_guid1; + } + + memcpy(&shca->ib_device.node_guid, &rblock->node_guid, (sizeof(u64))); + + init_node_guid1: + kfree(rblock); + + init_node_guid0: + EDEB_EX(7, "node_guid=%lx ret=%x", shca->ib_device.node_guid, ret); + + return ret; +} + +int ehca_register_device(struct ehca_shca *shca) +{ + int ret = 0; + + EDEB_EN(7, "shca=%p", shca); + + ret = init_node_guid(shca); + if (ret != 0) + return ret; + + strlcpy(shca->ib_device.name, "ehca%d", IB_DEVICE_NAME_MAX); + shca->ib_device.owner = THIS_MODULE; + + /* TODO: ABI ver later with define */ + shca->ib_device.uverbs_abi_ver = 1; + shca->ib_device.uverbs_cmd_mask = + (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) | + (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) | + (1ull << IB_USER_VERBS_CMD_QUERY_PORT) | + (1ull << IB_USER_VERBS_CMD_ALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_DEALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_REG_MR) | + (1ull << IB_USER_VERBS_CMD_DEREG_MR) | + (1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) | + (1ull << IB_USER_VERBS_CMD_CREATE_CQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_CQ) | + (1ull << IB_USER_VERBS_CMD_CREATE_QP) | + (1ull << IB_USER_VERBS_CMD_MODIFY_QP) | + (1ull << IB_USER_VERBS_CMD_DESTROY_QP) | + (1ull << IB_USER_VERBS_CMD_ATTACH_MCAST) | + (1ull << IB_USER_VERBS_CMD_DETACH_MCAST); + + shca->ib_device.node_type = RDMA_NODE_IB_CA; + shca->ib_device.phys_port_cnt = shca->num_ports; + shca->ib_device.dma_device = &shca->ibmebus_dev->ofdev.dev; + shca->ib_device.query_device = ehca_query_device; + shca->ib_device.query_port = ehca_query_port; + shca->ib_device.query_gid = ehca_query_gid; + shca->ib_device.query_pkey = ehca_query_pkey; + /* shca->in_device.modify_device = ehca_modify_device */ + shca->ib_device.modify_port = ehca_modify_port; + shca->ib_device.alloc_ucontext = ehca_alloc_ucontext; + shca->ib_device.dealloc_ucontext = ehca_dealloc_ucontext; + shca->ib_device.alloc_pd = ehca_alloc_pd; + shca->ib_device.dealloc_pd = ehca_dealloc_pd; + shca->ib_device.create_ah = ehca_create_ah; + /* shca->ib_device.modify_ah = ehca_modify_ah; */ + shca->ib_device.query_ah = ehca_query_ah; + shca->ib_device.destroy_ah = ehca_destroy_ah; + shca->ib_device.create_qp = ehca_create_qp; + shca->ib_device.modify_qp = ehca_modify_qp; + shca->ib_device.query_qp = ehca_query_qp; + shca->ib_device.destroy_qp = ehca_destroy_qp; + shca->ib_device.post_send = ehca_post_send; + shca->ib_device.post_recv = ehca_post_recv; + shca->ib_device.create_cq = ehca_create_cq; + shca->ib_device.destroy_cq = ehca_destroy_cq; + + /* TODO: disabled due to func signature conflict */ + /* shca->ib_device.resize_cq = ehca_resize_cq; */ + + shca->ib_device.poll_cq = ehca_poll_cq; + /* shca->ib_device.peek_cq = ehca_peek_cq; */ + shca->ib_device.req_notify_cq = ehca_req_notify_cq; + /* shca->ib_device.req_ncomp_notif = ehca_req_ncomp_notif; */ + shca->ib_device.get_dma_mr = ehca_get_dma_mr; + shca->ib_device.reg_phys_mr = ehca_reg_phys_mr; + shca->ib_device.reg_user_mr = ehca_reg_user_mr; + shca->ib_device.query_mr = ehca_query_mr; + shca->ib_device.dereg_mr = ehca_dereg_mr; + shca->ib_device.rereg_phys_mr = ehca_rereg_phys_mr; + shca->ib_device.alloc_mw = ehca_alloc_mw; + shca->ib_device.bind_mw = ehca_bind_mw; + shca->ib_device.dealloc_mw = ehca_dealloc_mw; + shca->ib_device.alloc_fmr = ehca_alloc_fmr; + shca->ib_device.map_phys_fmr = ehca_map_phys_fmr; + shca->ib_device.unmap_fmr = ehca_unmap_fmr; + shca->ib_device.dealloc_fmr = ehca_dealloc_fmr; + shca->ib_device.attach_mcast = ehca_attach_mcast; + shca->ib_device.detach_mcast = ehca_detach_mcast; + /* shca->ib_device.process_mad = ehca_process_mad; */ + shca->ib_device.mmap = ehca_mmap; + + ret = ib_register_device(&shca->ib_device); + + EDEB_EX(7, "ret=%x", ret); + + return ret; +} + +/** + * ehca_create_aqp1 - TODO + * + * @shca: TODO + */ +static int ehca_create_aqp1(struct ehca_shca *shca, u32 port) +{ + struct ehca_sport *sport; + struct ib_cq *ibcq; + struct ib_qp *ibqp; + struct ib_qp_init_attr qp_init_attr; + int ret = 0; + + EDEB_EN(7, "shca=%p port=%x", shca, port); + + sport = &shca->sport[port - 1]; + + if (sport->ibcq_aqp1 != NULL) { + EDEB_ERR(4, "AQP1 CQ is already created."); + return -EPERM; + } + + ibcq = ib_create_cq(&shca->ib_device, NULL, NULL, (void*)(-1), 10); + if (IS_ERR(ibcq)) { + EDEB_ERR(4, "Cannot create AQP1 CQ."); + return PTR_ERR(ibcq); + } + sport->ibcq_aqp1 = ibcq; + + if (sport->ibqp_aqp1 != NULL) { + EDEB_ERR(4, "AQP1 QP is already created."); + ret = -EPERM; + goto create_aqp1; + } + + memset(&qp_init_attr, 0, sizeof(struct ib_qp_init_attr)); + qp_init_attr.send_cq = ibcq; + qp_init_attr.recv_cq = ibcq; + qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR; + qp_init_attr.cap.max_send_wr = 100; + qp_init_attr.cap.max_recv_wr = 100; + qp_init_attr.cap.max_send_sge = 2; + qp_init_attr.cap.max_recv_sge = 1; + qp_init_attr.qp_type = IB_QPT_GSI; + qp_init_attr.port_num = port; + qp_init_attr.qp_context = NULL; + qp_init_attr.event_handler = NULL; + qp_init_attr.srq = NULL; + + ibqp = ib_create_qp(&shca->pd->ib_pd, &qp_init_attr); + if (IS_ERR(ibqp)) { + EDEB_ERR(4, "Cannot create AQP1 QP."); + ret = PTR_ERR(ibqp); + goto create_aqp1; + } + sport->ibqp_aqp1 = ibqp; + + EDEB_EX(7, "ret=%x", ret); + + return ret; + + create_aqp1: + ib_destroy_cq(sport->ibcq_aqp1); + + EDEB_EX(7, "ret=%x", ret); + + return ret; +} + +/** + * ehca_destroy_aqp1 - TODO + */ +static int ehca_destroy_aqp1(struct ehca_sport *sport) +{ + int ret = 0; + + EDEB_EN(7, "sport=%p", sport); + + ret = ib_destroy_qp(sport->ibqp_aqp1); + if (ret != 0) { + EDEB_ERR(4, "Cannot destroy AQP1 QP. ret=%x", ret); + goto destroy_aqp1; + } + + ret = ib_destroy_cq(sport->ibcq_aqp1); + if (ret != 0) + EDEB_ERR(4, "Cannot destroy AQP1 CQ. ret=%x", ret); + + destroy_aqp1: + EDEB_EX(7, "ret=%x", ret); + + return ret; +} + +static ssize_t ehca_show_debug_level(struct device_driver *ddp, char *buf) +{ + int f; + int total = 0; + total += snprintf(buf + total, PAGE_SIZE - total, "%d", + ehca_edeb_mask[0]); + for (f = 1; f < EHCA_EDEB_TRACE_MASK_SIZE; f++) { + total += snprintf(buf + total, PAGE_SIZE - total, ",%d", + ehca_edeb_mask[f]); + } + + total += snprintf(buf + total, PAGE_SIZE - total, "\n"); + + return total; +} + +static ssize_t ehca_store_debug_level(struct device_driver *ddp, + const char *buf, size_t count) +{ + int f; + for (f = 0; f < EHCA_EDEB_TRACE_MASK_SIZE; f++) { + char value = buf[f * 2] - '0'; + if ((value <= 9) && (count >= f * 2)) { + ehca_edeb_mask[f] = value; + } + } + return count; +} +DRIVER_ATTR(debug_level, S_IRUSR | S_IWUSR, + ehca_show_debug_level, ehca_store_debug_level); + +static ssize_t ehca_show_flightrecorder(struct device_driver *ddp, + char *buf) +{ + /* this is not style compliant, but snprintf is not fast enough */ + u64 *lbuf = (u64 *) buf; + lbuf[0] = (u64) & ehca_flightrecorder; + lbuf[1] = EHCA_FLIGHTRECORDER_SIZE; + lbuf[2] = atomic_read(&ehca_flightrecorder_index); + return sizeof(u64) * 3; +} +DRIVER_ATTR(flightrecorder, S_IRUSR, ehca_show_flightrecorder, 0); + +void ehca_create_driver_sysfs(struct ibmebus_driver *drv) +{ + driver_create_file(&drv->driver, &driver_attr_debug_level); + driver_create_file(&drv->driver, &driver_attr_flightrecorder); +} + +void ehca_remove_driver_sysfs(struct ibmebus_driver *drv) +{ + driver_remove_file(&drv->driver, &driver_attr_debug_level); + driver_remove_file(&drv->driver, &driver_attr_flightrecorder); +} + +#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,12) +#define EHCA_RESOURCE_ATTR_H(name) \ +static ssize_t ehca_show_##name(struct device *dev, \ + struct device_attribute *attr, \ + char *buf) +#else +#define EHCA_RESOURCE_ATTR_H(name) \ +static ssize_t ehca_show_##name(struct device *dev, \ + char *buf) +#endif + +#define EHCA_RESOURCE_ATTR(name) \ +EHCA_RESOURCE_ATTR_H(name) \ +{ \ + struct ehca_shca *shca; \ + struct query_hca_rblock *rblock; \ + int len; \ + \ + shca = dev->driver_data; \ + \ + rblock = kmalloc(PAGE_SIZE, GFP_KERNEL); \ + if (rblock == NULL) { \ + EDEB_ERR(4, "Can't allocate rblock memory."); \ + return 0; \ + } \ + \ + memset(rblock, 0, PAGE_SIZE); \ + \ + if (hipz_h_query_hca(shca->ipz_hca_handle, rblock) != H_Success) { \ + EDEB_ERR(4, "Can't query device properties"); \ + kfree(rblock); \ + return 0; \ + } \ + \ + if ((strcmp(#name, "num_ports") == 0) && (ehca_nr_ports == 1)) \ + len = snprintf(buf, 256, "1"); \ + else \ + len = snprintf(buf, 256, "%d", rblock->name); \ + \ + if (len < 0) \ + return 0; \ + buf[len] = '\n'; \ + buf[len+1] = 0; \ + \ + kfree(rblock); \ + \ + return len+1; \ +} \ +static DEVICE_ATTR(name, S_IRUGO, ehca_show_##name, NULL); + +EHCA_RESOURCE_ATTR(num_ports); +EHCA_RESOURCE_ATTR(hw_ver); +EHCA_RESOURCE_ATTR(max_eq); +EHCA_RESOURCE_ATTR(cur_eq); +EHCA_RESOURCE_ATTR(max_cq); +EHCA_RESOURCE_ATTR(cur_cq); +EHCA_RESOURCE_ATTR(max_qp); +EHCA_RESOURCE_ATTR(cur_qp); +EHCA_RESOURCE_ATTR(max_mr); +EHCA_RESOURCE_ATTR(cur_mr); +EHCA_RESOURCE_ATTR(max_mw); +EHCA_RESOURCE_ATTR(cur_mw); +EHCA_RESOURCE_ATTR(max_pd); +EHCA_RESOURCE_ATTR(max_ah); + +static ssize_t ehca_show_adapter_handle(struct device *dev, +#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,12) + struct device_attribute *attr, +#endif + char *buf) +{ + struct ehca_shca *shca = dev->driver_data; + + return sprintf(buf, "%lx\n", shca->ipz_hca_handle.handle); + +} +static DEVICE_ATTR(adapter_handle, S_IRUGO, ehca_show_adapter_handle, NULL); + + + +void ehca_create_device_sysfs(struct ibmebus_dev *dev) +{ + device_create_file(&dev->ofdev.dev, &dev_attr_adapter_handle); + device_create_file(&dev->ofdev.dev, &dev_attr_num_ports); + device_create_file(&dev->ofdev.dev, &dev_attr_hw_ver); + device_create_file(&dev->ofdev.dev, &dev_attr_max_eq); + device_create_file(&dev->ofdev.dev, &dev_attr_cur_eq); + device_create_file(&dev->ofdev.dev, &dev_attr_max_cq); + device_create_file(&dev->ofdev.dev, &dev_attr_cur_cq); + device_create_file(&dev->ofdev.dev, &dev_attr_max_qp); + device_create_file(&dev->ofdev.dev, &dev_attr_cur_qp); + device_create_file(&dev->ofdev.dev, &dev_attr_max_mr); + device_create_file(&dev->ofdev.dev, &dev_attr_cur_mr); + device_create_file(&dev->ofdev.dev, &dev_attr_max_mw); + device_create_file(&dev->ofdev.dev, &dev_attr_cur_mw); + device_create_file(&dev->ofdev.dev, &dev_attr_max_pd); + device_create_file(&dev->ofdev.dev, &dev_attr_max_ah); +} + +void ehca_remove_device_sysfs(struct ibmebus_dev *dev) +{ + device_remove_file(&dev->ofdev.dev, &dev_attr_adapter_handle); + device_remove_file(&dev->ofdev.dev, &dev_attr_num_ports); + device_remove_file(&dev->ofdev.dev, &dev_attr_hw_ver); + device_remove_file(&dev->ofdev.dev, &dev_attr_max_eq); + device_remove_file(&dev->ofdev.dev, &dev_attr_cur_eq); + device_remove_file(&dev->ofdev.dev, &dev_attr_max_cq); + device_remove_file(&dev->ofdev.dev, &dev_attr_cur_cq); + device_remove_file(&dev->ofdev.dev, &dev_attr_max_qp); + device_remove_file(&dev->ofdev.dev, &dev_attr_cur_qp); + device_remove_file(&dev->ofdev.dev, &dev_attr_max_mr); + device_remove_file(&dev->ofdev.dev, &dev_attr_cur_mr); + device_remove_file(&dev->ofdev.dev, &dev_attr_max_mw); + device_remove_file(&dev->ofdev.dev, &dev_attr_cur_mw); + device_remove_file(&dev->ofdev.dev, &dev_attr_max_pd); + device_remove_file(&dev->ofdev.dev, &dev_attr_max_ah); +} + +/** + * ehca_probe - TODO + */ +static int __devinit ehca_probe(struct ibmebus_dev *dev, + const struct of_device_id *id) +{ + struct ehca_shca *shca; + u64 *handle; + struct ib_pd *ibpd; + int ret = 0; + + EDEB_EN(7, "name=%s", dev->name); + + handle = (u64 *)get_property(dev->ofdev.node, "ibm,hca-handle", NULL); + if (!handle) { + EDEB_ERR(4, "Cannot get eHCA handle for adapter: %s.", + dev->ofdev.node->full_name); + return -ENODEV; + } + + if (!(*handle)) { + EDEB_ERR(4, "Wrong eHCA handle for adapter: %s.", + dev->ofdev.node->full_name); + return -ENODEV; + } + + shca = (struct ehca_shca *)ib_alloc_device(sizeof(*shca)); + if (shca == NULL) { + EDEB_ERR(4, "Cannot allocate shca memory."); + return -ENOMEM; + } + + shca->ibmebus_dev = dev; + shca->ipz_hca_handle.handle = *handle; + dev->ofdev.dev.driver_data = shca; + + ret = ehca_sense_attributes(shca); + if (ret < 0) { + EDEB_ERR(4, "Cannot sense eHCA attributes."); + goto probe1; + } + + /* create event queues */ + ret = ehca_create_eq(shca, &shca->eq, EHCA_EQ, 2048); + if (ret != 0) { + EDEB_ERR(4, "Cannot create EQ."); + goto probe1; + } + + ret = ehca_create_eq(shca, &shca->neq, EHCA_NEQ, 513); + if (ret != 0) { + EDEB_ERR(4, "Cannot create NEQ."); + goto probe2; + } + + /* create internal protection domain */ + ibpd = ehca_alloc_pd(&shca->ib_device, (void*)(-1), 0); + if (IS_ERR(ibpd)) { + EDEB_ERR(4, "Cannot create internal PD."); + ret = PTR_ERR(ibpd); + goto probe3; + } + + shca->pd = container_of(ibpd, struct ehca_pd, ib_pd); + shca->pd->ib_pd.device = &shca->ib_device; + + /* create internal max MR */ + if (shca->maxmr == 0) { + struct ehca_mr *e_maxmr = 0; + ret = ehca_reg_internal_maxmr(shca, shca->pd, &e_maxmr); + if (ret != 0) { + EDEB_ERR(4, "Cannot create internal MR. ret=%x", ret); + goto probe4; + } + shca->maxmr = e_maxmr; + } + + ret = ehca_register_device(shca); + if (ret != 0) { + EDEB_ERR(4, "Cannot register Infiniband device."); + goto probe5; + } + + /* create AQP1 for port 1 */ + if (ehca_open_aqp1 == 1) { + shca->sport[0].port_state = IB_PORT_DOWN; + ret = ehca_create_aqp1(shca, 1); + if (ret != 0) { + EDEB_ERR(4, "Cannot create AQP1 for port 1."); + goto probe6; + } + } + + /* create AQP1 for port 2 */ + if ((ehca_open_aqp1 == 1) && (shca->num_ports == 2)) { + shca->sport[1].port_state = IB_PORT_DOWN; + ret = ehca_create_aqp1(shca, 2); + if (ret != 0) { + EDEB_ERR(4, "Cannot create AQP1 for port 2."); + goto probe7; + } + } + + ehca_create_device_sysfs(dev); + + spin_lock(&ehca_module.shca_lock); + list_add(&shca->shca_list, &ehca_module.shca_list); + spin_unlock(&ehca_module.shca_lock); + + EDEB_EX(7, "ret=%x", ret); + + return 0; + + probe7: + ret = ehca_destroy_aqp1(&shca->sport[0]); + if (ret != 0) + EDEB_ERR(4, "Cannot destroy AQP1 for port 1. ret=%x", ret); + + probe6: + ib_unregister_device(&shca->ib_device); + + probe5: + ret = ehca_dereg_internal_maxmr(shca); + if (ret != 0) + EDEB_ERR(4, "Cannot destroy internal MR. ret=%x", ret); + + probe4: + ret = ehca_dealloc_pd(&shca->pd->ib_pd); + if (ret != 0) + EDEB_ERR(4, "Cannot destroy internal PD. ret=%x", ret); + + probe3: + ret = ehca_destroy_eq(shca, &shca->neq); + if (ret != 0) + EDEB_ERR(4, "Cannot destroy NEQ. ret=%x", ret); + + probe2: + ret = ehca_destroy_eq(shca, &shca->eq); + if (ret != 0) + EDEB_ERR(4, "Cannot destroy EQ. ret=%x", ret); + + probe1: + ib_dealloc_device(&shca->ib_device); + + EDEB_EX(4, "ret=%x", ret); + + return -EINVAL; +} + +static int __devexit ehca_remove(struct ibmebus_dev *dev) +{ + struct ehca_shca *shca = dev->ofdev.dev.driver_data; + int ret; + + EDEB_EN(7, "shca=%p", shca); + + ehca_remove_device_sysfs(dev); + + if (ehca_open_aqp1 == 1) { + int i; + + for (i = 0; i < shca->num_ports; i++) { + ret = ehca_destroy_aqp1(&shca->sport[i]); + if (ret != 0) + EDEB_ERR(4, "Cannot destroy AQP1 for port %x." + " ret=%x", ret, i); + } + } + + ib_unregister_device(&shca->ib_device); + + ret = ehca_dereg_internal_maxmr(shca); + if (ret != 0) + EDEB_ERR(4, "Cannot destroy internal MR. ret=%x", ret); + + ret = ehca_dealloc_pd(&shca->pd->ib_pd); + if (ret != 0) + EDEB_ERR(4, "Cannot destroy internal PD. ret=%x", ret); + + ret = ehca_destroy_eq(shca, &shca->eq); + if (ret != 0) + EDEB_ERR(4, "Cannot destroy EQ. ret=%x", ret); + + ret = ehca_destroy_eq(shca, &shca->neq); + if (ret != 0) + EDEB_ERR(4, "Canot destroy NEQ. ret=%x", ret); + + ib_dealloc_device(&shca->ib_device); + + spin_lock(&ehca_module.shca_lock); + list_del(&shca->shca_list); + spin_unlock(&ehca_module.shca_lock); + + EDEB_EX(7, "ret=%x", ret); + + return ret; +} + +static struct of_device_id ehca_device_table[] = +{ + { + .name = "lhca", + .compatible = "IBM,lhca", + }, + {}, +}; + +static struct ibmebus_driver ehca_driver = { + .name = "ehca", + .id_table = ehca_device_table, + .probe = ehca_probe, + .remove = ehca_remove, +}; + +/** + * ehca_module_init - eHCA initialization routine. + */ +int __init ehca_module_init(void) +{ + int ret = 0; + + printk(KERN_INFO "eHCA Infiniband Device Driver " + "(Rel.: EHCA2_0047)\n"); + EDEB_EN(7, ""); + + idr_init(&ehca_qp_idr); + idr_init(&ehca_cq_idr); + + INIT_LIST_HEAD(&ehca_module.shca_list); + spin_lock_init(&ehca_module.shca_lock); + + ehca_init_trace(); + ehca_init_flight(); + + ehca_wq = create_workqueue("ehca"); + if (ehca_wq == NULL) { + EDEB_ERR(4, "Cannot create workqueue."); + ret = -ENOMEM; + goto module_init0; + } + + if ((ret = ehca_caches_create(&ehca_module)) != 0) { + ehca_catastrophic("Cannot create SLAB caches"); + ret = -ENOMEM; + goto module_init1; + } + + if ((ret = ibmebus_register_driver(&ehca_driver)) != 0) { + ehca_catastrophic("Cannot register eHCA device driver"); + ret = -EINVAL; + goto module_init2; + } + + ehca_create_driver_sysfs(&ehca_driver); + + if (ehca_poll_all_eqs != 1) { + EDEB_ERR(4, "WARNING!!!"); + EDEB_ERR(4, "It is possible to lose interrupts."); + + return 0; + } + + ehca_kthread_eq = kthread_create(ehca_poll_eqs, &ehca_module, + "ehca_poll_eqs"); + if (IS_ERR(ehca_kthread_eq)) { + EDEB_ERR(4, "Cannot create kthread_eq"); + ret = PTR_ERR(ehca_kthread_eq); + goto module_init3; + } + + wake_up_process(ehca_kthread_eq); + + EDEB_EX(7, "ret=%x", ret); + + return 0; + + module_init3: + ehca_remove_driver_sysfs(&ehca_driver); + ibmebus_unregister_driver(&ehca_driver); + + module_init2: + ehca_caches_destroy(&ehca_module); + + module_init1: + destroy_workqueue(ehca_wq); + + module_init0: + EDEB_EX(7, "ret=%x", ret); + + return ret; +}; + +/** + * ehca_module_exit - eHCA exit routine. + */ +void __exit ehca_module_exit(void) +{ + EDEB_EN(7, ""); + + if (ehca_poll_all_eqs == 1) + kthread_stop(ehca_kthread_eq); + + ehca_remove_driver_sysfs(&ehca_driver); + ibmebus_unregister_driver(&ehca_driver); + + if (ehca_caches_destroy(&ehca_module) != 0) + ehca_catastrophic("Cannot destroy SLAB caches"); + + destroy_workqueue(ehca_wq); + +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,15) + idr_destroy_ext(&ehca_cq_idr); + idr_destroy_ext(&ehca_qp_idr); +#else + idr_destroy(&ehca_cq_idr); + idr_destroy(&ehca_qp_idr); +#endif + + EDEB_EX(7, ""); +}; + +module_init(ehca_module_init); +module_exit(ehca_module_exit); From rolandd at cisco.com Fri Feb 17 16:57:43 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 17 Feb 2006 16:57:43 -0800 Subject: [openib-general] [PATCH 14/22] ehca completion queue handling In-Reply-To: <20060218005532.13620.79663.stgit@localhost.localdomain> References: <20060218005532.13620.79663.stgit@localhost.localdomain> Message-ID: <20060218005743.13620.29456.stgit@localhost.localdomain> From: Roland Dreier --- drivers/infiniband/hw/ehca/ehca_cq.c | 416 ++++++++++++++++++++++++++++++++++ 1 files changed, 416 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_cq.c b/drivers/infiniband/hw/ehca/ehca_cq.c new file mode 100644 index 0000000..ebee9c3 --- /dev/null +++ b/drivers/infiniband/hw/ehca/ehca_cq.c @@ -0,0 +1,416 @@ +/* + * IBM eServer eHCA Infiniband device driver for Linux on POWER + * + * Completion queue handling + * + * Authors: Waleri Fomin + * Reinhard Ernst + * Heiko J Schick + * Hoang-Nam Nguyen + * + * + * Copyright (c) 2005 IBM Corporation + * + * All rights reserved. + * + * This source code is distributed under a dual license of GPL v2.0 and OpenIB + * BSD. + * + * OpenIB BSD License + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials + * provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * $Id: ehca_cq.c,v 1.61 2006/02/06 10:17:34 schickhj Exp $ + */ + +#define DEB_PREFIX "e_cq" + +#include "ehca_kernel.h" +#include "ehca_common.h" +#include "ehca_iverbs.h" +#include "ehca_classes.h" +#include "ehca_irq.h" +#include "hcp_if.h" +#include +#include + +#define HIPZ_CQ_REGISTER_ORIG 0 + +int ehca_cq_assign_qp(struct ehca_cq *cq, struct ehca_qp *qp) +{ + unsigned int qp_num = qp->ehca_qp_core.real_qp_num; + unsigned int key = qp_num%QP_HASHTAB_LEN; + unsigned long spl_flags = 0; + spin_lock_irqsave(&cq->spinlock, spl_flags); + list_add(&qp->list_entries, &cq->qp_hashtab[key]); + spin_unlock_irqrestore(&cq->spinlock, spl_flags); + EDEB(7, "cq_num=%x real_qp_num=%x", cq->cq_number, qp_num); + return 0; +} + +int ehca_cq_unassign_qp(struct ehca_cq *cq, unsigned int real_qp_num) +{ + int ret = -EINVAL; + unsigned int key = real_qp_num%QP_HASHTAB_LEN; + struct list_head *iter = NULL; + struct ehca_qp *qp = NULL; + unsigned long spl_flags = 0; + spin_lock_irqsave(&cq->spinlock, spl_flags); + list_for_each(iter, &cq->qp_hashtab[key]) { + qp = list_entry(iter, struct ehca_qp, list_entries); + if (qp->ehca_qp_core.real_qp_num == real_qp_num) { + list_del(iter); + EDEB(7, "removed qp from cq .cq_num=%x real_qp_num=%x", + cq->cq_number, real_qp_num); + ret = 0; + break; + } + } + spin_unlock_irqrestore(&cq->spinlock, spl_flags); + if (ret!=0) { + EDEB_ERR(4, "qp not found cq_num=%x real_qp_num=%x", + cq->cq_number, real_qp_num); + } + return ret; +} + +struct ehca_qp* ehca_cq_get_qp(struct ehca_cq *cq, int real_qp_num) +{ + struct ehca_qp *ret = NULL; + unsigned int key = real_qp_num%QP_HASHTAB_LEN; + struct list_head *iter = NULL; + struct ehca_qp *qp = NULL; + list_for_each(iter, &cq->qp_hashtab[key]) { + qp = list_entry(iter, struct ehca_qp, list_entries); + if (qp->ehca_qp_core.real_qp_num == real_qp_num) { + ret = qp; + break; + } + } + return ret; +} + +struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, + struct ib_ucontext *context, + struct ib_udata *udata) +{ + struct ib_cq *cq = NULL; + struct ehca_cq *my_cq = NULL; + u32 number_of_entries = cqe; + struct ehca_shca *shca = NULL; + struct ipz_adapter_handle adapter_handle; + struct ipz_eq_handle eq_handle; + struct ipz_cq_handle *cq_handle_ref = NULL; + u32 act_nr_of_entries = 0; + u32 act_pages = 0; + u32 counter = 0; + void *vpage = NULL; + u64 rpage = 0; + struct h_galpa gal; + u64 CQx_FEC = 0; + u64 hipz_rc = H_Success; + int ipz_rc = 0; + int ret = 0; + const u32 additional_cqe=20; + int i= 0; + + EHCA_CHECK_DEVICE_P(device); + EDEB_EN(7, "device=%p cqe=%x context=%p", + device, cqe, context); + /* cq's maximum depth is 4GB-64 + * but we need additional 20 as buffer for receiving errors cqes + */ + if (cqe>=0xFFFFFFFF-64-additional_cqe) { + return ERR_PTR(-EINVAL); + } + number_of_entries += additional_cqe; + + my_cq = ehca_cq_new(); + if (my_cq == NULL) { + cq = ERR_PTR(-ENOMEM); + EDEB_ERR(4, + "Out of memory for ehca_cq struct " + "device=%p", device); + goto create_cq_exit0; + } + cq = &my_cq->ib_cq; + + shca = container_of(device, struct ehca_shca, ib_device); + adapter_handle = shca->ipz_hca_handle; + eq_handle = shca->eq.ipz_eq_handle; + cq_handle_ref = &my_cq->ipz_cq_handle; + + do { + if (!idr_pre_get(&ehca_cq_idr, GFP_KERNEL)) { + cq = ERR_PTR(-ENOMEM); + EDEB_ERR(4, + "Can't reserve idr resources. " + "device=%p", device); + goto create_cq_exit1; + } + + down_write(&ehca_cq_idr_sem); + ret = idr_get_new(&ehca_cq_idr, my_cq, &my_cq->token); + up_write(&ehca_cq_idr_sem); + + } while (ret == -EAGAIN); + + if (ret) { + cq = ERR_PTR(-ENOMEM); + EDEB_ERR(4, + "Can't allocate new idr entry. " + "device=%p", device); + goto create_cq_exit1; + } + + hipz_rc = hipz_h_alloc_resource_cq(adapter_handle, + &my_cq->pf, + eq_handle, + my_cq->token, + number_of_entries, + cq_handle_ref, + &act_nr_of_entries, + &act_pages, + &my_cq->ehca_cq_core.galpas); + if (hipz_rc != H_Success) { + EDEB_ERR(4, + "hipz_h_alloc_resource_cq() failed " + "hipz_rc=%lx device=%p", hipz_rc, device); + cq = ERR_PTR(ehca2ib_return_code(hipz_rc)); + goto create_cq_exit2; + } + + ipz_rc = + ipz_queue_ctor(&my_cq->ehca_cq_core.ipz_queue, act_pages, + EHCA_PAGESIZE, sizeof(struct ehca_cqe), 0); + if (!ipz_rc) { + EDEB_ERR(4, + "ipz_queue_ctor() failed " + "ipz_rc=%x device=%p", ipz_rc, device); + cq = ERR_PTR(-EINVAL); + goto create_cq_exit3; + } + + for (counter = 0; counter < act_pages; counter++) { + vpage = ipz_QPageit_get_inc(&my_cq->ehca_cq_core.ipz_queue); + if (!vpage) { + EDEB_ERR(4, "ipz_QPageit_get_inc() " + "returns NULL device=%p", device); + cq = ERR_PTR(-EAGAIN); + goto create_cq_exit4; + } + rpage = ehca_kv_to_g(vpage); + + hipz_rc = hipz_h_register_rpage_cq(adapter_handle, + my_cq->ipz_cq_handle, + &my_cq->pf, + 0, + HIPZ_CQ_REGISTER_ORIG, + rpage, + 1, + my_cq->ehca_cq_core.galpas. + kernel); + + if (hipz_rc < H_Success) { + EDEB_ERR(4, "hipz_h_register_rpage_cq() failed " + "ehca_cq=%p cq_num=%x hipz_rc=%lx " + "counter=%i act_pages=%i", + my_cq, my_cq->cq_number, + hipz_rc, counter, act_pages); + cq = ERR_PTR(-EINVAL); + goto create_cq_exit4; + } + + if (counter == (act_pages - 1)) { + vpage = ipz_QPageit_get_inc( + &my_cq->ehca_cq_core.ipz_queue); + if ((hipz_rc != H_Success) || (vpage != 0)) { + EDEB_ERR(4, "Registration of pages not " + "complete ehca_cq=%p cq_num=%x " + "hipz_rc=%lx", + my_cq, my_cq->cq_number, hipz_rc); + cq = ERR_PTR(-EAGAIN); + goto create_cq_exit4; + } + } else { + if (hipz_rc != H_PAGE_REGISTERED) { + EDEB_ERR(4, "Registration of page failed " + "ehca_cq=%p cq_num=%x hipz_rc=%lx" + "counter=%i act_pages=%i", + my_cq, my_cq->cq_number, + hipz_rc, counter, act_pages); + cq = ERR_PTR(-ENOMEM); + goto create_cq_exit4; + } + } + } + + ipz_QEit_reset(&my_cq->ehca_cq_core.ipz_queue); + + gal = my_cq->ehca_cq_core.galpas.kernel; + CQx_FEC = hipz_galpa_load(gal, CQTEMM_OFFSET(CQx_FEC)); + EDEB(8, "ehca_cq=%p cq_num=%x CQx_FEC=%lx", + my_cq, my_cq->cq_number, CQx_FEC); + + my_cq->ib_cq.cqe = my_cq->nr_of_entries = + act_nr_of_entries-additional_cqe; + my_cq->cq_number = (my_cq->ipz_cq_handle.handle) & 0xffff; + + for (i=0; iqp_hashtab[i]); + } + + if (context) { + struct ehca_create_cq_resp resp; + struct vm_area_struct * vma; + resp.cq_number = my_cq->cq_number; + resp.token = my_cq->token; + resp.ehca_cq_core = my_cq->ehca_cq_core; + + ehca_mmap_nopage(((u64) (my_cq->token) << 32) | 0x12000000, + my_cq->ehca_cq_core.ipz_queue.queue_length, + ((void**)&resp.ehca_cq_core.ipz_queue.queue), + &vma); + my_cq->uspace_queue = (u64)resp.ehca_cq_core.ipz_queue.queue; + ehca_mmap_register(my_cq->ehca_cq_core.galpas.user.fw_handle, + ((void**)&resp.ehca_cq_core.galpas.kernel.fw_handle), + &vma); + my_cq->uspace_fwh = (u64)resp.ehca_cq_core.galpas.kernel.fw_handle; + if (ib_copy_to_udata(udata, &resp, sizeof(resp))) { + EDEB_ERR(4, "Copy to udata failed."); + goto create_cq_exit4; + } + } + + EDEB_EX(7,"retcode=%p ehca_cq=%p cq_num=%x cq_size=%x", + cq, my_cq, my_cq->cq_number, act_nr_of_entries); + return cq; + + create_cq_exit4: + ipz_queue_dtor(&my_cq->ehca_cq_core.ipz_queue); + + create_cq_exit3: + hipz_rc = hipz_h_destroy_cq(adapter_handle, my_cq, 1); + EDEB(3, "hipz_h_destroy_cq() failed ehca_cq=%p cq_num=%x hipz_rc=%lx", + my_cq, my_cq->cq_number, hipz_rc); + + create_cq_exit2: + /* dereg idr */ + down_write(&ehca_cq_idr_sem); + idr_remove(&ehca_cq_idr, my_cq->token); + up_write(&ehca_cq_idr_sem); + + create_cq_exit1: + /* free cq struct */ + ehca_cq_delete(my_cq); + + create_cq_exit0: + EDEB_EX(7, "An error has occured retcode=%p ", cq); + return cq; +} + +int ehca_destroy_cq(struct ib_cq *cq) +{ + u64 hipz_rc = H_Success; + int retcode = 0; + struct ehca_cq *my_cq = NULL; + int cq_num = 0; + struct ib_device *device = NULL; + struct ehca_shca *shca = NULL; + struct ipz_adapter_handle adapter_handle; + + EHCA_CHECK_CQ(cq); + my_cq = container_of(cq, struct ehca_cq, ib_cq); + cq_num = my_cq->cq_number; + device = cq->device; + EHCA_CHECK_DEVICE(device); + shca = container_of(device, struct ehca_shca, ib_device); + adapter_handle = shca->ipz_hca_handle; + EDEB_EN(7, "ehca_cq=%p cq_num=%x", + my_cq, my_cq->cq_number); + + down_write(&ehca_cq_idr_sem); + idr_remove(&ehca_cq_idr, my_cq->token); + up_write(&ehca_cq_idr_sem); + + /* un-mmap if vma alloc */ + if (my_cq->uspace_queue!=0) { + struct ehca_cq_core *cq_core = &my_cq->ehca_cq_core; + retcode = ehca_munmap(my_cq->uspace_queue, + cq_core->ipz_queue.queue_length); + retcode = ehca_munmap(my_cq->uspace_fwh, 4096); + } + + hipz_rc = hipz_h_destroy_cq(adapter_handle, my_cq, 0); + if (hipz_rc == H_R_STATE) { + /* cq in err: read err data and destroy it forcibly */ + EDEB(4, "ehca_cq=%p cq_num=%x ressource=%lx in err state. " + "Try to delete it forcibly.", + my_cq, my_cq->cq_number, my_cq->ipz_cq_handle.handle); + ehca_error_data(shca, my_cq->ipz_cq_handle.handle); + hipz_rc = hipz_h_destroy_cq(adapter_handle, my_cq, 1); + if (hipz_rc == H_Success) { + EDEB(4, "ehca_cq=%p cq_num=%x deleted successfully.", + my_cq, my_cq->cq_number); + } + } + if (hipz_rc != H_Success) { + EDEB_ERR(4,"hipz_h_destroy_cq() failed " + "hipz_rc=%lx ehca_cq=%p cq_num=%x", + hipz_rc, my_cq, my_cq->cq_number); + retcode = ehca2ib_return_code(hipz_rc); + goto destroy_cq_exit0;/*@TODO*/ + } + ipz_queue_dtor(&my_cq->ehca_cq_core.ipz_queue); + ehca_cq_delete(my_cq); + + destroy_cq_exit0: + EDEB_EX(7, "ehca_cq=%p cq_num=%x retcode=%x ", + my_cq, cq_num, retcode); + return retcode; +} + +int ehca_resize_cq(struct ib_cq *cq, int cqe) +{ + int retcode = 0; + struct ehca_cq *my_cq = NULL; + + if (unlikely(NULL == cq)) { + EDEB_ERR(4, "cq is NULL"); + return -EFAULT; + } + + my_cq = container_of(cq, struct ehca_cq, ib_cq); + EDEB_EN(7, "ehca_cq=%p cq_num=%x", + my_cq, my_cq->cq_number); + /*TODO proper resize still needs to be done*/ + if (cqe > cq->cqe) { + retcode = -EINVAL; + } + EDEB_EX(7, "ehca_cq=%p cq_num=%x", + my_cq, my_cq->cq_number); + return retcode; +} + +/* eof ehca_cq.c */ From rolandd at cisco.com Fri Feb 17 16:58:02 2006 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 17 Feb 2006 16:58:02 -0800 Subject: [openib-general] [PATCH 22/22] ehca Makefile/Kconfig changes In-Reply-To: <20060218005532.13620.79663.stgit@localhost.localdomain> References: <20060218005532.13620.79663.stgit@localhost.localdomain> Message-ID: <20060218005801.13620.38625.stgit@localhost.localdomain> From: Roland Dreier --- drivers/infiniband/Kconfig | 2 ++ drivers/infiniband/Makefile | 1 + drivers/infiniband/hw/ehca/Kbuild | 8 ++++++++ drivers/infiniband/hw/ehca/Kconfig | 6 ++++++ 4 files changed, 17 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index bdf0891..2b3ad03 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -31,6 +31,8 @@ config INFINIBAND_USER_ACCESS source "drivers/infiniband/hw/mthca/Kconfig" +source "drivers/infiniband/hw/ehca/Kconfig" + source "drivers/infiniband/ulp/ipoib/Kconfig" source "drivers/infiniband/ulp/srp/Kconfig" diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile index a43fb34..eb7788f 100644 --- a/drivers/infiniband/Makefile +++ b/drivers/infiniband/Makefile @@ -1,4 +1,5 @@ obj-$(CONFIG_INFINIBAND) += core/ obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mthca/ +obj-$(CONFIG_INFINIBAND_EHCA) += hw/ehca/ obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ obj-$(CONFIG_INFINIBAND_SRP) += ulp/srp/ diff --git a/drivers/infiniband/hw/ehca/Kbuild b/drivers/infiniband/hw/ehca/Kbuild new file mode 100644 index 0000000..7b610b1 --- /dev/null +++ b/drivers/infiniband/hw/ehca/Kbuild @@ -0,0 +1,8 @@ +obj-$(CONFIG_INFINIBAND_EHCA) += hcad_mod.o + +hcad_mod-objs = ehca_main.o ehca_hca.o ipz_pt_fn.o ehca_classes.o ehca_av.o \ + ehca_pd.o ehca_mrmw.o ehca_cq.o ehca_sqp.o ehca_qp.o hcp_sense.o \ + ehca_eq.o ehca_irq.o hcp_phyp.o ehca_mcast.o ehca_reqs.o \ + ehca_uverbs.o + +CFLAGS +=-DP_SERIES -DEHCA_USE_HCALL -DEHCA_USE_HCALL_KERNEL diff --git a/drivers/infiniband/hw/ehca/Kconfig b/drivers/infiniband/hw/ehca/Kconfig new file mode 100644 index 0000000..b875649 --- /dev/null +++ b/drivers/infiniband/hw/ehca/Kconfig @@ -0,0 +1,6 @@ +config INFINIBAND_EHCA + tristate "eHCA support" + depends on IBMEBUS && INFINIBAND + ---help--- + This is a low level device driver for the IBM + GX based Host channel adapters (HCAs) \ No newline at end of file From greg at kroah.com Fri Feb 17 17:54:13 2006 From: greg at kroah.com (Greg KH) Date: Fri, 17 Feb 2006 17:54:13 -0800 Subject: [openib-general] Re: [PATCH 04/22] OF adapter probing In-Reply-To: <20060218005712.13620.82908.stgit@localhost.localdomain> References: <20060218005532.13620.79663.stgit@localhost.localdomain> <20060218005712.13620.82908.stgit@localhost.localdomain> Message-ID: <20060218015413.GA17653@kroah.com> On Fri, Feb 17, 2006 at 04:57:14PM -0800, Roland Dreier wrote: > +int hipz_count_adapters(void) > +{ > + int num = 0; > + struct device_node *dn = NULL; > + > + EDEB_EN(7, ""); > + > + while ((dn = of_find_node_by_name(dn, "lhca"))) { > + num++; > + } The { } are not needed here. > + > + of_node_put(dn); > + > + if (num == 0) { > + EDEB_ERR(4, "No lhca node name was found in the" > + " Open Firmware device tree."); > + return -ENODEV; > + } > + > + EDEB(6, " ... found %x adapter(s)", num); > + > + EDEB_EX(7, "num=%x", num); > + > + return num; > +} > + > +int hipz_probe_adapters(char **adapter_list) > +{ > + int ret = 0; > + int num = 0; > + struct device_node *dn = NULL; > + char *loc; > + > + EDEB_EN(7, "adapter_list=%p", adapter_list); > + > + while ((dn = of_find_node_by_name(dn, "lhca"))) { > + loc = get_property(dn, "ibm,loc-code", NULL); > + if (loc == NULL) { > + EDEB_ERR(4, "No ibm,loc-code property for" > + " lhca Open Firmware device tree node."); > + ret = -ENODEV; > + goto probe_adapters0; > + } > + > + adapter_list[num] = loc; > + EDEB(6, " ... found adapter[%x] with loc-code: %s", num, loc); > + num++; > + } > + > + probe_adapters0: > + of_node_put(dn); Please use tabs everywhere. Hm, wait, that's a label. Put it where it belongs, over on the left please. thanks, greg k-h From greg at kroah.com Fri Feb 17 17:58:08 2006 From: greg at kroah.com (Greg KH) Date: Fri, 17 Feb 2006 17:58:08 -0800 Subject: [openib-general] Re: [PATCH 02/22] Firmware interface code for IB device. In-Reply-To: <20060218005707.13620.20538.stgit@localhost.localdomain> References: <20060218005532.13620.79663.stgit@localhost.localdomain> <20060218005707.13620.20538.stgit@localhost.localdomain> Message-ID: <20060218015808.GB17653@kroah.com> On Fri, Feb 17, 2006 at 04:57:07PM -0800, Roland Dreier wrote: > From: Roland Dreier > > This is a very large file with way too much code for a .h file. > The functions look too big to be inlined also. Is there any way > for this code to move to a .c file? Roland, your comments are fine, but what about the original author's descriptions of what each patch are? Come on, IBM allows developers to post code to lkml, just look at the archives for proof. For them to use a proxy like this is very strange, and also, there is no Signed-off-by: record from the original authors, which is not ok. And why aren't you using the standard firmware interface in the kernel? > +#ifndef CONFIG_PPC64 > +#ifndef Z_SERIES > +#warning "included with wrong target, this is a p file" > +#endif > +#endif It's a "p" file? What's that? Is this even needed? thanks, greg k-h From rdreier at cisco.com Fri Feb 17 18:04:56 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 17 Feb 2006 18:04:56 -0800 Subject: [openib-general] Re: [PATCH 02/22] Firmware interface code for IB device. In-Reply-To: <20060218015808.GB17653@kroah.com> (Greg KH's message of "Fri, 17 Feb 2006 17:58:08 -0800") References: <20060218005532.13620.79663.stgit@localhost.localdomain> <20060218005707.13620.20538.stgit@localhost.localdomain> <20060218015808.GB17653@kroah.com> Message-ID: Greg> Roland, your comments are fine, but what about the original Greg> author's descriptions of what each patch are? This is actually me breaking up a giant driver into pieces small enough to post to lkml without hitting the 100 KB limit. This is just an RFC -- I assume the driver is going to get merged in the end as one big git changeset with a changelog like "add driver for IBM eHCA InfiniBand adapters". Greg> Come on, IBM allows developers to post code to lkml, just Greg> look at the archives for proof. For them to use a proxy Greg> like this is very strange, and also, there is no Greg> Signed-off-by: record from the original authors, which is Greg> not ok. Well, the eHCA guys tell me that they can't post patches to lkml. You're right that the final merge will have to have an IBM Signed-off-by: line but as I said this is just an RFC. There are many reasons beyond patch format issues that make this stuff unmergeable as-is. Greg> And why aren't you using the standard firmware interface in Greg> the kernel? This is actually stuff to talk to the firmware that sits below the kernel on IBM ppc64 machines, not an interface to load device firmware from userspace. - R. From info at agkhf.com Fri Feb 17 18:03:49 2006 From: info at agkhf.com (info at agkhf.com) Date: 18 Feb 2006 11:03:49 +0900 Subject: [openib-general] $B$*6bM_$7$/$"$j$^$; $s$+!)$D$$$K $B$*6b$H=w at -!"N>J}#G#E#T$7$^$;$s$+!)(B $B2q$($FEv$?$jA0!*!*!*$*6bLc$($FEv$jA0!*!*!*(B $B:#$O$=$s$J;~Be$G$9!#$=$NHkL)$O!&!&!&!&(B http://www.gyakute6.net/?gy04 $B"-!Z$3$N%5%$%H$NHkL)!&!&!&![(B $B"#=w at -2q0wMM$OJ@l9g$O(B gendar7_net at yahoo.ca $B!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a(B From info at eqsz.com Fri Feb 17 18:29:16 2006 From: info at eqsz.com (info at eqsz.com) Date: 18 Feb 2006 11:29:16 +0900 Subject: [openib-general] $B8BDj#5#0?M$NJ}$KAw?.$5$;$FD:$$$F$*$j$^$9!#(B Message-ID: <20060218022916.8494.qmail@mail.eqsz.com> $B$*5RMM$K3Ne$N%a!<%k%^!<%/$N%"%$%3%s$+$i=w at -$ND>%"%I$K%a%C%;!<%8$rAw?.$7$F$/$@$5$$!#0J2AwJ8$H$J$j$^$9!#(B $B%-%(!j!J(B33$B!K$O$8$a$^$7$F!"IaCJ$N at 83h$r>/$7$@$1JQ$($?$/$FEPO?$7$^$7$?!#@dBPHkL)$rP$$$, at d$($J$$AGE($J4X78$r4|BT$7$F$$$^$9!#59$7$/$*4j$$$7$^$9!#(B $B$7$g$&$3!j!J(B23$B!K<~$j$NM'C#$,$J$s$@$+;R6!$C$]$/$FA4A3Nx0&BP>]$K8+$($^$;$s!#!#!#Bg?M$NJ70O5$$JM%$7$$J}$HCN$j9g$$$?$$$G$9!*%i%V%i%V!z%$%A%c%$%A%c$7$?!l9g$O!d"M(Bpriority7_net at yahoo.ca ////////////////////////////////////////////////////////// From kschoche at scl.ameslab.gov Fri Feb 17 19:40:14 2006 From: kschoche at scl.ameslab.gov (Kyle Schochenmaier) Date: Fri, 17 Feb 2006 21:40:14 -0600 Subject: [openib-general] Re: IBG2 installation In-Reply-To: <20060217124312.GC19033@mellanox.co.il> References: <43F574B5.903@gs-lab.com> <20060217124312.GC19033@mellanox.co.il> Message-ID: <43F6971E.1000902@scl.ameslab.gov> Michael S. Tsirkin wrote: > Quoting r. Karun Beer Sharma : > >> Subject: IBG2 installation >> >> Hi: >> >> I have installed IBG2 (ver. 2.0.1 from Mellanox) with 2.6.9-22EL kernel >> version. >> The installation seems OK and I am able to execute some of the commands >> like ibnetdiscover etc. >> >> Then I downloaded Netpipe (ver 3.6.2) and tried to make (make ib) it on >> my machine. I am getting errors of missing header files (ib_defs.h). >> I checked the makefile and observed that VAPI_INC path is required. I >> searhed /usr/include but was not able to find the required header files. >> >> Please let me know if I need to install something else also. >> >> Thanks. >> Regards, >> Karun >> > > Looks like its trying to use gen1 headers (VAPI is gen1). > I dont know much about netpipe - does it support gen2? > > To answer your questions about the NetPipe stuff: The plan is to include the gen2 support in the upcoming Netpipe 3.6.3 release. We have a preliminary version of Netpipe that has gen2 support. It can be checked out from source.scl.ameslab.gov/hg/netpipe3-dev Please note that this is not the intended release and has not been updated yet for the libib*-rc5+ back-compatibility issues. I have a temporary patch for rc5+ support that I could supply if it is wanted immediately. It will be the same thing that is released in 3.6.3, it just hasnt made its way through to the server yet. Hope that helps, please CC future inquiries about Netpipe stuff to the netpipe mailing list as well ;) thanks, - Kyle From heiko.carstens at de.ibm.com Sat Feb 18 02:59:36 2006 From: heiko.carstens at de.ibm.com (Heiko Carstens) Date: Sat, 18 Feb 2006 11:59:36 +0100 Subject: [openib-general] Re: [PATCH 02/22] Firmware interface code for IB device. In-Reply-To: <20060218015808.GB17653@kroah.com> References: <20060218005532.13620.79663.stgit@localhost.localdomain> <20060218005707.13620.20538.stgit@localhost.localdomain> <20060218015808.GB17653@kroah.com> Message-ID: <20060218105936.GD9216@osiris.boeblingen.de.ibm.com> > Come on, IBM allows developers to post code to lkml, just look at the > archives for proof. For them to use a proxy like this is very strange, Things aren't always that easy at IBM. You should know best :) Heiko From hch at infradead.org Sat Feb 18 04:17:53 2006 From: hch at infradead.org (Christoph Hellwig) Date: Sat, 18 Feb 2006 12:17:53 +0000 Subject: [openib-general] Re: [PATCH 01/22] Add powerpc-specific clear_cacheline(), which just compiles to "dcbz". In-Reply-To: <20060218005704.13620.88286.stgit@localhost.localdomain> References: <20060218005532.13620.79663.stgit@localhost.localdomain> <20060218005704.13620.88286.stgit@localhost.localdomain> Message-ID: <20060218121753.GC911@infradead.org> On Fri, Feb 17, 2006 at 04:57:04PM -0800, Roland Dreier wrote: > From: Roland Dreier > > This is horribly non-portable. Yes. If this is needed it should go to an asm/ header, not in a driver. From hch at infradead.org Sat Feb 18 04:19:13 2006 From: hch at infradead.org (Christoph Hellwig) Date: Sat, 18 Feb 2006 12:19:13 +0000 Subject: [openib-general] Re: [PATCH 02/22] Firmware interface code for IB device. In-Reply-To: <20060218005707.13620.20538.stgit@localhost.localdomain> References: <20060218005532.13620.79663.stgit@localhost.localdomain> <20060218005707.13620.20538.stgit@localhost.localdomain> Message-ID: <20060218121913.GD911@infradead.org> On Fri, Feb 17, 2006 at 04:57:07PM -0800, Roland Dreier wrote: > From: Roland Dreier > > This is a very large file with way too much code for a .h file. > The functions look too big to be inlined also. Is there any way > for this code to move to a .c file? > --- > > drivers/infiniband/hw/ehca/hcp_if.h | 2022 +++++++++++++++++++++++++++++++++++ > +#include "ehca_tools.h" > +#include "hipz_structs.h" > +#include "ehca_classes.h" > + > +#ifndef EHCA_USE_HCALL > +#include "hcz_queue.h" > +#include "hcz_mrmw.h" > +#include "hcz_emmio.h" > +#include "sim_prom.h" > +#endif > +#include "hipz_fns.h" > +#include "hcp_sense.h" > +#include "ehca_irq.h" > + > +#ifndef CONFIG_PPC64 > +#ifndef Z_SERIES > +#warning "included with wrong target, this is a p file" > +#endif > +#endif > + > +#ifdef EHCA_USE_HCALL > + > +#ifndef EHCA_USERDRIVER > +#include "hcp_phyp.h" > +#else > +#include "testbench/hcallbridge.h" > +#endif > +#endif the ifdefs should all go away and the build system should make sure it's only built for the right platforms. From hch at infradead.org Sat Feb 18 04:20:11 2006 From: hch at infradead.org (Christoph Hellwig) Date: Sat, 18 Feb 2006 12:20:11 +0000 Subject: [openib-general] Re: [PATCH 02/22] Firmware interface code for IB device. In-Reply-To: References: <20060218005532.13620.79663.stgit@localhost.localdomain> <20060218005707.13620.20538.stgit@localhost.localdomain> <20060218015808.GB17653@kroah.com> Message-ID: <20060218122011.GE911@infradead.org> On Fri, Feb 17, 2006 at 06:04:56PM -0800, Roland Dreier wrote: > Greg> Roland, your comments are fine, but what about the original > Greg> author's descriptions of what each patch are? > > This is actually me breaking up a giant driver into pieces small > enough to post to lkml without hitting the 100 KB limit. > > This is just an RFC -- I assume the driver is going to get merged in > the end as one big git changeset with a changelog like "add driver for > IBM eHCA InfiniBand adapters". > > Greg> Come on, IBM allows developers to post code to lkml, just > Greg> look at the archives for proof. For them to use a proxy > Greg> like this is very strange, and also, there is no > Greg> Signed-off-by: record from the original authors, which is > Greg> not ok. > > Well, the eHCA guys tell me that they can't post patches to lkml. Then they lie. And not posting to lkml is a good reason not to merge an otherwise perfect driver. (which this one is far from) From hch at infradead.org Sat Feb 18 04:23:17 2006 From: hch at infradead.org (Christoph Hellwig) Date: Sat, 18 Feb 2006 12:23:17 +0000 Subject: [openib-general] Re: [PATCH 03/22] pHype specific stuff In-Reply-To: <20060218005709.13620.77409.stgit@localhost.localdomain> References: <20060218005532.13620.79663.stgit@localhost.localdomain> <20060218005709.13620.77409.stgit@localhost.localdomain> Message-ID: <20060218122317.GF911@infradead.org> > +u64 hipz_galpa_load(struct h_galpa galpa, u32 offset) > +{ > + u64 addr = galpa.fw_handle + offset; > + u64 out; > + EDEB_EN(7, "addr=%lx offset=%x ", addr, offset); > + out = *(u64 *) addr; why does this cast an u64 to a pointer? > +#ifndef EHCA_USERDRIVER > +inline static int hcall_map_page(u64 physaddr, u64 * mapaddr) > +{ > + *mapaddr = (u64)(ioremap(physaddr, 4096)); > + > + EDEB(7, "ioremap physaddr=%lx mapaddr=%lx", physaddr, *mapaddr); > + return 0; ioremap returns void __iomem * and casting that to any integer type is wrong. > +inline static int hcall_unmap_page(u64 mapaddr) > +{ > + EDEB(7, "mapaddr=%lx", mapaddr); > + iounmap((void *)(mapaddr)); > + return 0; dito for iounmap and casting back. guys, please run this driver through sparse, thanks. > + /* if phype returns LongBusyXXX, > + * we retry several times, but not forever */ > + for (i = 0; i < 5; i++) { > + __asm__ __volatile__("mr 3,%10\n" > + "mr 4,%11\n" > + "mr 5,%12\n" assembly code under drivers/ is not acceptable. please create and for it or something similar. From mulix at mulix.org Sat Feb 18 04:26:31 2006 From: mulix at mulix.org (Muli Ben-Yehuda) Date: Sat, 18 Feb 2006 14:26:31 +0200 Subject: [openib-general] Re: [PATCH 02/22] Firmware interface code for IB device. In-Reply-To: <20060218122011.GE911@infradead.org> References: <20060218005532.13620.79663.stgit@localhost.localdomain> <20060218005707.13620.20538.stgit@localhost.localdomain> <20060218015808.GB17653@kroah.com> <20060218122011.GE911@infradead.org> Message-ID: <20060218122631.GA30535@granada.merseine.nu> On Sat, Feb 18, 2006 at 12:20:11PM +0000, Christoph Hellwig wrote: > > Well, the eHCA guys tell me that they can't post patches to lkml. > > Then they lie. And not posting to lkml is a good reason not to merge > an otherwise perfect driver. (which this one is far from) I don't speak for IBM or the authors, but there are perfectly reasonable reasons to ask someone else to post a patch on your behalf - including but not limited to to only being able to use Lotus Notes with one's IBM email. I'm sure you've all seen the travesties that Notes inflicts on inline patches. Cheers, Muli -- Muli Ben-Yehuda http://www.mulix.org | http://mulix.livejournal.com/ From hch at infradead.org Sat Feb 18 04:29:10 2006 From: hch at infradead.org (Christoph Hellwig) Date: Sat, 18 Feb 2006 12:29:10 +0000 Subject: [openib-general] Re: [PATCH 02/22] Firmware interface code for IB device. In-Reply-To: <20060218122631.GA30535@granada.merseine.nu> References: <20060218005532.13620.79663.stgit@localhost.localdomain> <20060218005707.13620.20538.stgit@localhost.localdomain> <20060218015808.GB17653@kroah.com> <20060218122011.GE911@infradead.org> <20060218122631.GA30535@granada.merseine.nu> Message-ID: <20060218122910.GA1521@infradead.org> On Sat, Feb 18, 2006 at 02:26:31PM +0200, Muli Ben-Yehuda wrote: > I don't speak for IBM or the authors, but there are perfectly > reasonable reasons to ask someone else to post a patch on your behalf > - including but not limited to to only being able to use Lotus Notes > with one's IBM email. I'm sure you've all seen the travesties that > Notes inflicts on inline patches. sure. and there's free webmail accounts that take about 10 minutes to setup as well as various people offering shell access to linux machines if you ask nicely. so this really is not an issue. I think this is more about ibm politics (espeically in boeblingen) sometimes making it pretty hard to post things. But that doesn't mean it's impossible, it just means they didn't try hard enough. From arjan at infradead.org Sat Feb 18 04:32:35 2006 From: arjan at infradead.org (Arjan van de Ven) Date: Sat, 18 Feb 2006 13:32:35 +0100 Subject: [openib-general] Re: [PATCH 02/22] Firmware interface code for IB device. In-Reply-To: <20060218122631.GA30535@granada.merseine.nu> References: <20060218005532.13620.79663.stgit@localhost.localdomain> <20060218005707.13620.20538.stgit@localhost.localdomain> <20060218015808.GB17653@kroah.com> <20060218122011.GE911@infradead.org> <20060218122631.GA30535@granada.merseine.nu> Message-ID: <1140265955.4035.19.camel@laptopd505.fenrus.org> On Sat, 2006-02-18 at 14:26 +0200, Muli Ben-Yehuda wrote: > On Sat, Feb 18, 2006 at 12:20:11PM +0000, Christoph Hellwig wrote: > > > > Well, the eHCA guys tell me that they can't post patches to lkml. > > > > Then they lie. And not posting to lkml is a good reason not to merge > > an otherwise perfect driver. (which this one is far from) > > I don't speak for IBM or the authors, but there are perfectly > reasonable reasons to ask someone else to post a patch on your behalf > - including but not limited to to only being able to use Lotus Notes > with one's IBM email. I'm sure you've all seen the travesties that > Notes inflicts on inline patches. there are ways around that with webmail etc. The bigger issue is: if people can't be bothered to do those steps, why would they be bothered to do this for maintenance and bugfixes etc etc? Basically it's now already a de-facto unmaintained driver.... From info at schihei.de Sat Feb 18 04:46:10 2006 From: info at schihei.de (Heiko J Schick) Date: Sat, 18 Feb 2006 13:46:10 +0100 Subject: [openib-general] [PATCH 04/22] OF adapter probing In-Reply-To: <20060218005712.13620.82908.stgit@localhost.localdomain> References: <20060218005532.13620.79663.stgit@localhost.localdomain> <20060218005712.13620.82908.stgit@localhost.localdomain> Message-ID: Hello Roland, sorry, this file is not used anymore. The functions int hipz_count_adapters(void); int hipz_probe_adapters(char **adapter_list); u64 hipz_get_adapter_handle(char *adapter); nowadays handled by the IBMEBUS [1] bus device driver. [1]: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/ linux-2.6.git;a=commit;h=d7a301033f1990188f65abf4fe8e5b90ef0e3888 Regards, Heiko On Feb 18, 2006, at 1:57 AM, Roland Dreier wrote: > From: Roland Dreier > > hipz_probe_adapters() looks a little funny -- it seems to bail out > of all the remaining adapters if one of them isn't quite right. > --- > > drivers/infiniband/hw/ehca/hcp_sense.c | 144 +++++++++++++++++++++ > +++++++++++ > drivers/infiniband/hw/ehca/hcp_sense.h | 136 +++++++++++++++++++++ > +++++++++ > 2 files changed, 280 insertions(+), 0 deletions(-) > > diff --git a/drivers/infiniband/hw/ehca/hcp_sense.c b/drivers/ > infiniband/hw/ehca/hcp_sense.c > new file mode 100644 > index 0000000..83fa4a3 > --- /dev/null > +++ b/drivers/infiniband/hw/ehca/hcp_sense.c > @@ -0,0 +1,144 @@ > +/* > + * IBM eServer eHCA Infiniband device driver for Linux on POWER > + * > + * ehca detection and query code for POWER > + * > + * Authors: Heiko J Schick > + * > + * Copyright (c) 2005 IBM Corporation > + * > + * All rights reserved. > + * > + * This source code is distributed under a dual license of GPL > v2.0 and OpenIB > + * BSD. > + * > + * OpenIB BSD License > + * > + * Redistribution and use in source and binary forms, with or without > + * modification, are permitted provided that the following > conditions are met: > + * > + * Redistributions of source code must retain the above copyright > notice, this > + * list of conditions and the following disclaimer. > + * > + * Redistributions in binary form must reproduce the above > copyright notice, > + * this list of conditions and the following disclaimer in the > documentation > + * and/or other materials > + * provided with the distribution. > + * > + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND > CONTRIBUTORS "AS IS" > + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT > LIMITED TO, THE > + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A > PARTICULAR PURPOSE > + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR > CONTRIBUTORS BE > + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, > EXEMPLARY, OR > + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, > PROCUREMENT OF > + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR > + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF > LIABILITY, WHETHER > + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR > OTHERWISE) > + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF > ADVISED OF THE > + * POSSIBILITY OF SUCH DAMAGE. > + * > + * $Id: hcp_sense.c,v 1.10 2006/02/06 10:17:34 schickhj Exp $ > + */ > + > +#define DEB_PREFIX "snse" > + > +#include "ehca_kernel.h" > +#include "ehca_tools.h" > + > +int hipz_count_adapters(void) > +{ > + int num = 0; > + struct device_node *dn = NULL; > + > + EDEB_EN(7, ""); > + > + while ((dn = of_find_node_by_name(dn, "lhca"))) { > + num++; > + } > + > + of_node_put(dn); > + > + if (num == 0) { > + EDEB_ERR(4, "No lhca node name was found in the" > + " Open Firmware device tree."); > + return -ENODEV; > + } > + > + EDEB(6, " ... found %x adapter(s)", num); > + > + EDEB_EX(7, "num=%x", num); > + > + return num; > +} > + > +int hipz_probe_adapters(char **adapter_list) > +{ > + int ret = 0; > + int num = 0; > + struct device_node *dn = NULL; > + char *loc; > + > + EDEB_EN(7, "adapter_list=%p", adapter_list); > + > + while ((dn = of_find_node_by_name(dn, "lhca"))) { > + loc = get_property(dn, "ibm,loc-code", NULL); > + if (loc == NULL) { > + EDEB_ERR(4, "No ibm,loc-code property for" > + " lhca Open Firmware device tree node."); > + ret = -ENODEV; > + goto probe_adapters0; > + } > + > + adapter_list[num] = loc; > + EDEB(6, " ... found adapter[%x] with loc-code: %s", num, loc); > + num++; > + } > + > + probe_adapters0: > + of_node_put(dn); > + > + EDEB_EX(7, "ret=%x", ret); > + > + return ret; > +} > + > +u64 hipz_get_adapter_handle(char *adapter) > +{ > + struct device_node *dn = NULL; > + char *loc; > + u64 *u64data = NULL; > + u64 ret = 0; > + > + EDEB_EN(7, "adapter=%p", adapter); > + > + while ((dn = of_find_node_by_name(dn, "lhca"))) { > + loc = get_property(dn, "ibm,loc-code", NULL); > + if (loc == NULL) { > + EDEB_ERR(4, "No ibm,loc-code property for" > + " lhca Open Firmware device tree node."); > + goto get_adapter_handle0; > + } > + > + if (strcmp(loc, adapter) == 0) { > + u64data = > + (u64 *) get_property(dn, "ibm,hca-handle", NULL); > + break; > + } > + } > + > + if (u64data == NULL) { > + EDEB_ERR(4, "No ibm,hca-handle property for" > + " lhca Open Firmware device tree node with" > + " ibm,loc-code: %s.", adapter); > + goto get_adapter_handle0; > + } > + > + ret = *u64data; > + > + get_adapter_handle0: > + of_node_put(dn); > + > + EDEB_EX(7, "ret=%lx",ret); > + > + return ret; > +} > diff --git a/drivers/infiniband/hw/ehca/hcp_sense.h b/drivers/ > infiniband/hw/ehca/hcp_sense.h > new file mode 100644 > index 0000000..a49040b > --- /dev/null > +++ b/drivers/infiniband/hw/ehca/hcp_sense.h > @@ -0,0 +1,136 @@ > +/* > + * IBM eServer eHCA Infiniband device driver for Linux on POWER > + * > + * ehca detection and query code for POWER > + * > + * Authors: Heiko J Schick > + * > + * Copyright (c) 2005 IBM Corporation > + * > + * All rights reserved. > + * > + * This source code is distributed under a dual license of GPL > v2.0 and OpenIB > + * BSD. > + * > + * OpenIB BSD License > + * > + * Redistribution and use in source and binary forms, with or without > + * modification, are permitted provided that the following > conditions are met: > + * > + * Redistributions of source code must retain the above copyright > notice, this > + * list of conditions and the following disclaimer. > + * > + * Redistributions in binary form must reproduce the above > copyright notice, > + * this list of conditions and the following disclaimer in the > documentation > + * and/or other materials > + * provided with the distribution. > + * > + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND > CONTRIBUTORS "AS IS" > + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT > LIMITED TO, THE > + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A > PARTICULAR PURPOSE > + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR > CONTRIBUTORS BE > + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, > EXEMPLARY, OR > + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, > PROCUREMENT OF > + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR > + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF > LIABILITY, WHETHER > + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR > OTHERWISE) > + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF > ADVISED OF THE > + * POSSIBILITY OF SUCH DAMAGE. > + * > + * $Id: hcp_sense.h,v 1.11 2006/02/06 10:17:34 schickhj Exp $ > + */ > + > +#ifndef HCP_SENSE_H > +#define HCP_SENSE_H > + > +int hipz_count_adapters(void); > +int hipz_probe_adapters(char **adapter_list); > +u64 hipz_get_adapter_handle(char *adapter); > + > +/* query hca response block */ > +struct query_hca_rblock { > + u32 cur_reliable_dg; > + u32 cur_qp; > + u32 cur_cq; > + u32 cur_eq; > + u32 cur_mr; > + u32 cur_mw; > + u32 cur_ee_context; > + u32 cur_mcast_grp; > + u32 cur_qp_attached_mcast_grp; > + u32 reserved1; > + u32 cur_ipv6_qp; > + u32 cur_eth_qp; > + u32 cur_hp_mr; > + u32 reserved2[3]; > + u32 max_rd_domain; > + u32 max_qp; > + u32 max_cq; > + u32 max_eq; > + u32 max_mr; > + u32 max_hp_mr; > + u32 max_mw; > + u32 max_mrwpte; > + u32 max_special_mrwpte; > + u32 max_rd_ee_context; > + u32 max_mcast_grp; > + u32 max_qps_attached_all_mcast_grp; > + u32 max_qps_attached_mcast_grp; > + u32 max_raw_ipv6_qp; > + u32 max_raw_ethy_qp; > + u32 internal_clock_frequency; > + u32 max_pd; > + u32 max_ah; > + u32 max_cqe; > + u32 max_wqes_wq; > + u32 max_partitions; > + u32 max_rr_ee_context; > + u32 max_rr_qp; > + u32 max_rr_hca; > + u32 max_act_wqs_ee_context; > + u32 max_act_wqs_qp; > + u32 max_sge; > + u32 max_sge_rd; > + u32 memory_page_size_supported; > + u64 max_mr_size; > + u32 local_ca_ack_delay; > + u32 num_ports; > + u32 vendor_id; > + u32 vendor_part_id; > + u32 hw_ver; > + u64 node_guid; > + u64 hca_cap_indicators; > + u32 data_counter_register_size; > + u32 max_shared_rq; > + u32 max_isns_eq; > + u32 max_neq; > +} __attribute__ ((packed)); > + > +/* query port response block */ > +struct query_port_rblock { > + u32 state; > + u32 bad_pkey_cntr; > + u32 lmc; > + u32 lid; > + u32 subnet_timeout; > + u32 qkey_viol_cntr; > + u32 sm_sl; > + u32 sm_lid; > + u32 capability_mask; > + u32 init_type_reply; > + u32 pkey_tbl_len; > + u32 gid_tbl_len; > + u64 gid_prefix; > + u32 port_nr; > + u16 pkey_entries[16]; > + u8 reserved1[32]; > + u32 trent_size; > + u32 trbuf_size; > + u64 max_msg_sz; > + u32 max_mtu; > + u32 vl_cap; > + u8 reserved2[1900]; > + u64 guid_entries[255]; > +} __attribute__ ((packed)); > + > +#endif > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/ > openib-general > From hch at lst.de Sat Feb 18 07:09:20 2006 From: hch at lst.de (Christoph Hellwig) Date: Sat, 18 Feb 2006 16:09:20 +0100 Subject: [openib-general] [PATCH 08/22] Generic ehca headers In-Reply-To: <20060218005723.13620.10389.stgit@localhost.localdomain> References: <20060218005532.13620.79663.stgit@localhost.localdomain> <20060218005723.13620.10389.stgit@localhost.localdomain> Message-ID: <20060218150920.GA23817@lst.de> On Fri, Feb 17, 2006 at 04:57:23PM -0800, Roland Dreier wrote: > From: Roland Dreier > > The defines of TRUE and FALSE look rather useless. Why are they needed? > > What is struct ehca_cache for? It doesn't seem to be used anywhere. > > ehca_kv_to_g() looks completely horrible. The whole idea of using > vmalloc()ed kernel memory to do DMA seems unacceptable to me. When you want to do scatter-gather dma on kernel-virtual contingous areas allocate the pages individually and map them into kva using vmap(). Then dma can be performed using dma_map_page, or in case you have lots of pages dma_map_sg after creating an S/G list. From rdreier at cisco.com Sat Feb 18 08:02:33 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sat, 18 Feb 2006 08:02:33 -0800 Subject: [openib-general] [PATCH 04/22] OF adapter probing In-Reply-To: (Heiko J. Schick's message of "Sat, 18 Feb 2006 13:46:10 +0100") References: <20060218005532.13620.79663.stgit@localhost.localdomain> <20060218005712.13620.82908.stgit@localhost.localdomain> Message-ID: Heiko> Hello Roland, sorry, this file is not used anymore. The Heiko> functions OK, please delete it from the svn tree. Thanks, Roland From rdreier at cisco.com Sat Feb 18 08:32:28 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sat, 18 Feb 2006 08:32:28 -0800 Subject: [openib-general] Re: [PATCH 02/22] Firmware interface code for IB device. In-Reply-To: <1140265955.4035.19.camel@laptopd505.fenrus.org> (Arjan van de Ven's message of "Sat, 18 Feb 2006 13:32:35 +0100") References: <20060218005532.13620.79663.stgit@localhost.localdomain> <20060218005707.13620.20538.stgit@localhost.localdomain> <20060218015808.GB17653@kroah.com> <20060218122011.GE911@infradead.org> <20060218122631.GA30535@granada.merseine.nu> <1140265955.4035.19.camel@laptopd505.fenrus.org> Message-ID: Arjan> The bigger issue is: if people can't be bothered to do Arjan> those steps, why would they be bothered to do this for Arjan> maintenance and bugfixes etc etc? Basically it's now Arjan> already a de-facto unmaintained driver.... I don't think that's really a fair statement. The IBM people have been active and responsive in maintaining their driving in the openib.org svn tree. However, they asked me to post their driver for review because it would be difficult for them to do it. IBM people: can you clarify the restrictions you have? Why do you feel you can't post your own driver for review? Will you be able to post smaller patches to lkml in the future if the driver is merged? Thanks, Roland From sean.hefty at intel.com Sat Feb 18 09:05:16 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Sat, 18 Feb 2006 09:05:16 -0800 Subject: [openib-general] RE: [PATCH 1 of 3] mad: large RMPP support, Round 2 In-Reply-To: <20060212152744.GA19049@mellanox.co.il> Message-ID: Thanks for updating this. I know that Hal already committed these changes, but I have a couple of additional comments that I want to capture. >+struct ib_mad_multipacket_seg { >+ struct list_head list; >+ u32 num; >+ u16 size; >+ u8 data[0]; >+}; Does size change between segments? It doesn't seem like it should need to for segments that belong to the same MAD. We can remove size from here and add a seg_size to ib_mad_send_buf. > /** >+ * *ib_mad_get_multipacket_seg - returns a given RMPP segment. >+ * @send_buf: Previously allocated send data buffer. >+ * @seg_num: number of segment to return >+ * >+ * This routine returns a pointer to a segment of a multipacket RMPP message. >+ */ >+struct ib_mad_multipacket_seg >+*ib_mad_get_multipacket_seg(struct ib_mad_send_buf *send_buf, int seg_num); I think we're serialized everywhere this call is made, but this call is not thread safe. We may want to make a note of this in the comment. - Sean From arjan at infradead.org Sat Feb 18 09:02:42 2006 From: arjan at infradead.org (Arjan van de Ven) Date: Sat, 18 Feb 2006 18:02:42 +0100 Subject: [openib-general] Re: [PATCH 02/22] Firmware interface code for IB device. In-Reply-To: References: <20060218005532.13620.79663.stgit@localhost.localdomain> <20060218005707.13620.20538.stgit@localhost.localdomain> <20060218015808.GB17653@kroah.com> <20060218122011.GE911@infradead.org> <20060218122631.GA30535@granada.merseine.nu> <1140265955.4035.19.camel@laptopd505.fenrus.org> Message-ID: <1140282163.6514.7.camel@laptopd505.fenrus.org> On Sat, 2006-02-18 at 08:32 -0800, Roland Dreier wrote: > Arjan> The bigger issue is: if people can't be bothered to do > Arjan> those steps, why would they be bothered to do this for > Arjan> maintenance and bugfixes etc etc? Basically it's now > Arjan> already a de-facto unmaintained driver.... > > I don't think that's really a fair statement. It's a concern at least; if they're just having trouble posting really big files that's one thing.. if they're not allowed to post at all that's another. > IBM people: can you clarify the restrictions you have? Why do you > feel you can't post your own driver for review? Will you be able to > post smaller patches to lkml in the future if the driver is merged? And can you respond to questions and user questions on lkml? From sean.hefty at intel.com Sat Feb 18 09:40:09 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Sat, 18 Feb 2006 09:40:09 -0800 Subject: [openib-general] RE: [PATCH 3 of 3] mad: large RMPP support, Round 2 In-Reply-To: <20060212153036.GC19049@mellanox.co.il> Message-ID: >+static inline void adjust_last_ack(struct ib_mad_send_wr_private *wr) >+{ >+ struct ib_mad_multipacket_seg *seg; >+ >+ if (wr->last_ack < 2) >+ return; >+ else if (!wr->last_ack_seg) >+ list_for_each_entry(seg, &wr->multipacket_list, list) { >+ if (wr->last_ack == seg->num) { >+ wr->last_ack_seg = seg; >+ break; >+ } >+ } >+ else >+ list_for_each_entry(seg, &wr->last_ack_seg->list, list) { >+ if (wr->last_ack == seg->num) { >+ wr->last_ack_seg = seg; >+ break; >+ } >+ } >+} If we initialize last_ack_seg to the start of the list, can we combine the else if and else checks together? >@@ -647,6 +672,7 @@ static void process_rmpp_ack(struct ib_m > > if (seg_num > mad_send_wr->last_ack) { > mad_send_wr->last_ack = seg_num; >+ adjust_last_ack(mad_send_wr); If last_ack_seg references a segment that contains the seg_num, can we eliminate last_ack? >+static inline int alloc_send_rmpp_segs(struct ib_mad_send_wr_private *send_wr, >+ int message_size, int hdr_len, >+ int data_len, u8 rmpp_version, >+ gfp_t gfp_mask) >+{ >+ struct ib_mad_multipacket_seg *seg; >+ struct ib_rmpp_mad *rmpp_mad = send_wr->send_buf.mad; >+ int seg_size, i = 2; >+ >+ rmpp_mad->rmpp_hdr.paylen_newwin = >+ cpu_to_be32(hdr_len - IB_MGMT_RMPP_HDR + data_len); >+ rmpp_mad->rmpp_hdr.rmpp_version = rmpp_version; >+ rmpp_mad->rmpp_hdr.rmpp_type = IB_MGMT_RMPP_TYPE_DATA; >+ ib_set_rmpp_flags(&rmpp_mad->rmpp_hdr, IB_MGMT_RMPP_FLAG_ACTIVE); >+ send_wr->total_length = message_size; >+ /* allocate RMPP buffers */ >+ message_size -= sizeof(struct ib_mad); >+ seg_size = sizeof(struct ib_mad) - hdr_len; >+ while (message_size > 0) { >+ seg = kmalloc(sizeof(struct ib_mad_multipacket_seg) + seg_size, >+ gfp_mask); It would be convenient if the MAD were cleared for the user, so we don't end of transferring random data, especially at the end of the user data. >+ if (!seg) { >+ printk(KERN_ERR "ib_create_send_mad: RMPP mem " >+ "alloc failed for len %zd, gfp %#x\n", >+ sizeof(struct ib_mad_multipacket_seg) + seg_size, >+ gfp_mask); >+ free_send_multipacket_list(send_wr); >+ return -ENOMEM; >+ } >+ seg->size = seg_size; Okay, seg_size is the same for all segments belonging to a single MAD. We can move this it ib_mad_send_buf. >+ mad_send_wr->last_ack_seg = NULL; >+ mad_send_wr->seg_num_seg = NULL; Mad_send_wr is cleared on allocation. Are there better values to initialize these variables to? The first/second segment? >+struct ib_mad_multipacket_seg >+*ib_rmpp_get_multipacket_seg(struct ib_mad_send_wr_private *wr, int seg_num) >+{ >+ struct ib_mad_multipacket_seg *seg; >+ >+ if (seg_num == 2) { >+ wr->seg_num_seg = >+ container_of(wr->multipacket_list.next, >+ struct ib_mad_multipacket_seg, list); >+ return wr->seg_num_seg; >+ } >+ >+ /* get first list entry if was not already done */ >+ if (!wr->seg_num_seg) See previous comment. >+struct ib_mad_multipacket_seg >+*ib_mad_get_multipacket_seg(struct ib_mad_send_buf *send_buf, int seg_num) >+{ >+ struct ib_mad_send_wr_private *wr; >+ >+ if (seg_num < 2) >+ return NULL; >+ >+ wr = container_of(send_buf, struct ib_mad_send_wr_private, send_buf); >+ return ib_rmpp_get_multipacket_seg(wr, seg_num); >+} Treating the first segment special seems somewhat confusing. (Maybe this is a result of how the MAD/RMPP header is copied down from userspace?) >+ if (!rmpp_active && length > sizeof(struct ib_mad)) { >+ ret = -EINVAL; >+ goto err_ah; >+ } >+ > packet->msg = ib_create_send_mad(agent, > be32_to_cpu(packet->mad.hdr.qpn), > 0, rmpp_active, I think that ib_create_send_mad performs the same check. - Sean From greg at kroah.com Sat Feb 18 10:15:09 2006 From: greg at kroah.com (Greg KH) Date: Sat, 18 Feb 2006 10:15:09 -0800 Subject: [openib-general] Re: [PATCH 02/22] Firmware interface code for IB device. In-Reply-To: References: <20060218005532.13620.79663.stgit@localhost.localdomain> <20060218005707.13620.20538.stgit@localhost.localdomain> <20060218015808.GB17653@kroah.com> <20060218122011.GE911@infradead.org> <20060218122631.GA30535@granada.merseine.nu> <1140265955.4035.19.camel@laptopd505.fenrus.org> Message-ID: <20060218181509.GA892@kroah.com> On Sat, Feb 18, 2006 at 08:32:28AM -0800, Roland Dreier wrote: > Arjan> The bigger issue is: if people can't be bothered to do > Arjan> those steps, why would they be bothered to do this for > Arjan> maintenance and bugfixes etc etc? Basically it's now > Arjan> already a de-facto unmaintained driver.... > > I don't think that's really a fair statement. The IBM people have > been active and responsive in maintaining their driving in the > openib.org svn tree. However, they asked me to post their driver for > review because it would be difficult for them to do it. Checking stuff into a private svn tree is vastly different from posting to lkml in public. In fact, it looks like the svn tree is so far ahead of the in-kernel stuff, that most people are just using it instead of the in-kernel code. I know at least one company has asked a distro to just accept the svn snapshot over the in-kernel IB code, which makes me wonder if the in-kernel stuff is even useful to people? Why have it, if companies insist on using the out-of-tree stuff instead? thanks, greg k-h From hch at infradead.org Sat Feb 18 10:19:32 2006 From: hch at infradead.org (Christoph Hellwig) Date: Sat, 18 Feb 2006 18:19:32 +0000 Subject: [openib-general] Re: [PATCH 02/22] Firmware interface code for IB device. In-Reply-To: <20060218181509.GA892@kroah.com> References: <20060218005532.13620.79663.stgit@localhost.localdomain> <20060218005707.13620.20538.stgit@localhost.localdomain> <20060218015808.GB17653@kroah.com> <20060218122011.GE911@infradead.org> <20060218122631.GA30535@granada.merseine.nu> <1140265955.4035.19.camel@laptopd505.fenrus.org> <20060218181509.GA892@kroah.com> Message-ID: <20060218181932.GA6410@infradead.org> On Sat, Feb 18, 2006 at 10:15:09AM -0800, Greg KH wrote: > On Sat, Feb 18, 2006 at 08:32:28AM -0800, Roland Dreier wrote: > > Arjan> The bigger issue is: if people can't be bothered to do > > Arjan> those steps, why would they be bothered to do this for > > Arjan> maintenance and bugfixes etc etc? Basically it's now > > Arjan> already a de-facto unmaintained driver.... > > > > I don't think that's really a fair statement. The IBM people have > > been active and responsive in maintaining their driving in the > > openib.org svn tree. However, they asked me to post their driver for > > review because it would be difficult for them to do it. > > Checking stuff into a private svn tree is vastly different from posting > to lkml in public. In fact, it looks like the svn tree is so far ahead > of the in-kernel stuff, that most people are just using it instead of > the in-kernel code. > > I know at least one company has asked a distro to just accept the svn > snapshot over the in-kernel IB code, which makes me wonder if the > in-kernel stuff is even useful to people? Why have it, if companies > insist on using the out-of-tree stuff instead? The openib tree isn't private. It's mostly just a staging area for development. Any company that wants it included into a distro release is completely clueless. From rdreier at cisco.com Sat Feb 18 10:52:58 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sat, 18 Feb 2006 10:52:58 -0800 Subject: [openib-general] Re: [PATCH 02/22] Firmware interface code for IB device. In-Reply-To: <20060218181509.GA892@kroah.com> (Greg KH's message of "Sat, 18 Feb 2006 10:15:09 -0800") References: <20060218005532.13620.79663.stgit@localhost.localdomain> <20060218005707.13620.20538.stgit@localhost.localdomain> <20060218015808.GB17653@kroah.com> <20060218122011.GE911@infradead.org> <20060218122631.GA30535@granada.merseine.nu> <1140265955.4035.19.camel@laptopd505.fenrus.org> <20060218181509.GA892@kroah.com> Message-ID: Greg> Checking stuff into a private svn tree is vastly different Greg> from posting to lkml in public. In fact, it looks like the Greg> svn tree is so far ahead of the in-kernel stuff, that most Greg> people are just using it instead of the in-kernel code. It's not a private svn tree -- the IBM ehca development is available to anyone via svn at https://openib.org/svn/gen2/trunk/src/linux-kernel/infiniband/hw/ehca Greg> I know at least one company has asked a distro to just Greg> accept the svn snapshot over the in-kernel IB code, which Greg> makes me wonder if the in-kernel stuff is even useful to Greg> people? Why have it, if companies insist on using the Greg> out-of-tree stuff instead? The IB driver stack is still in its early stages, so although I'm pushing for things to be merged as fast as possible, the unfortunate fact is that lots of things that people want to use (including the IBM ehca driver) are not upstream and are not ready to go upstream yet. But that doesn't mean we should give up on merging them. Distro politics are just distro politics -- and there will always be pressure on distros to ship stuff that's not upstream yet. - R. From greg at kroah.com Sat Feb 18 11:53:27 2006 From: greg at kroah.com (Greg KH) Date: Sat, 18 Feb 2006 11:53:27 -0800 Subject: [openib-general] Re: [PATCH 02/22] Firmware interface code for IB device. In-Reply-To: References: <20060218005532.13620.79663.stgit@localhost.localdomain> <20060218005707.13620.20538.stgit@localhost.localdomain> <20060218015808.GB17653@kroah.com> <20060218122011.GE911@infradead.org> <20060218122631.GA30535@granada.merseine.nu> <1140265955.4035.19.camel@laptopd505.fenrus.org> <20060218181509.GA892@kroah.com> Message-ID: <20060218195327.GA1382@kroah.com> On Sat, Feb 18, 2006 at 10:52:58AM -0800, Roland Dreier wrote: > Greg> Checking stuff into a private svn tree is vastly different > Greg> from posting to lkml in public. In fact, it looks like the > Greg> svn tree is so far ahead of the in-kernel stuff, that most > Greg> people are just using it instead of the in-kernel code. > > It's not a private svn tree -- the IBM ehca development is available > to anyone via svn at https://openib.org/svn/gen2/trunk/src/linux-kernel/infiniband/hw/ehca Sorry, I didn't mean to say "private", but rather, "seperate". Doing kernel development in a seperate development tree from the mainline kernel is very problematic, as has been documented many times in the past. > Distro politics are just distro politics -- and there will always be > pressure on distros to ship stuff that's not upstream yet. Luckily the distros know better than to accept this anymore, as they have been burned too many times in the past... thanks, greg k-h From rdreier at cisco.com Sat Feb 18 13:31:52 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sat, 18 Feb 2006 13:31:52 -0800 Subject: [openib-general] Re: [PATCH 02/22] Firmware interface code for IB device. In-Reply-To: <20060218195327.GA1382@kroah.com> (Greg KH's message of "Sat, 18 Feb 2006 11:53:27 -0800") References: <20060218005532.13620.79663.stgit@localhost.localdomain> <20060218005707.13620.20538.stgit@localhost.localdomain> <20060218015808.GB17653@kroah.com> <20060218122011.GE911@infradead.org> <20060218122631.GA30535@granada.merseine.nu> <1140265955.4035.19.camel@laptopd505.fenrus.org> <20060218181509.GA892@kroah.com> <20060218195327.GA1382@kroah.com> Message-ID: Greg> Sorry, I didn't mean to say "private", but rather, Greg> "seperate". Doing kernel development in a seperate Greg> development tree from the mainline kernel is very Greg> problematic, as has been documented many times in the past. As a general rule I agree with that. However, the openib svn tree we're talking about is not some project that is off in space never merging with the kernel; as Christoph said, it's really just a staging area for stuff that isn't ready for upstream yet.n Perhaps it would be more politically correct to use git to develop kernel code, but in the end that's really just a technical difference that shouldn't matter. Roland> Distro politics are just distro politics -- and there will Roland> always be pressure on distros to ship stuff that's not Roland> upstream yet. Greg> Luckily the distros know better than to accept this anymore, Greg> as they have been burned too many times in the past... OK, that's great. But now I don't understand your original point. You say there are people putting pressure on distros to ship what's in openib svn rather than the upstream kernel, but if the distros are going to ignore them, what does it matter? And this thread started with me trying to help the IBM people make progress towards merging a big chunk of that svn tree upstream. That should make you happy, right? - R. From greg at kroah.com Sat Feb 18 15:29:34 2006 From: greg at kroah.com (Greg KH) Date: Sat, 18 Feb 2006 15:29:34 -0800 Subject: [openib-general] Re: [PATCH 02/22] Firmware interface code for IB device. In-Reply-To: References: <20060218015808.GB17653@kroah.com> <20060218122011.GE911@infradead.org> <20060218122631.GA30535@granada.merseine.nu> <1140265955.4035.19.camel@laptopd505.fenrus.org> <20060218181509.GA892@kroah.com> <20060218195327.GA1382@kroah.com> Message-ID: <20060218232934.GA2624@kroah.com> On Sat, Feb 18, 2006 at 01:31:52PM -0800, Roland Dreier wrote: > Greg> Sorry, I didn't mean to say "private", but rather, > Greg> "seperate". Doing kernel development in a seperate > Greg> development tree from the mainline kernel is very > Greg> problematic, as has been documented many times in the past. > > As a general rule I agree with that. However, the openib svn tree > we're talking about is not some project that is off in space never > merging with the kernel; as Christoph said, it's really just a staging > area for stuff that isn't ready for upstream yet.n > > Perhaps it would be more politically correct to use git to develop > kernel code, but in the end that's really just a technical difference > that shouldn't matter. Yes, that doesn't matter. But it seems that the svn tree is vastly different from the in-kernel code. So much so that some companies feel that the in-kernel stuff just isn't worth running at all. > Roland> Distro politics are just distro politics -- and there will > Roland> always be pressure on distros to ship stuff that's not > Roland> upstream yet. > > Greg> Luckily the distros know better than to accept this anymore, > Greg> as they have been burned too many times in the past... > > OK, that's great. But now I don't understand your original point. > You say there are people putting pressure on distros to ship what's in > openib svn rather than the upstream kernel, but if the distros are > going to ignore them, what does it matter? It takes a _lot_ of effort to ignore them, as it's very difficult to do so. Especially when companies try to play the different distros off of each other, but that's not an issue that the mainline kernel developers need to worry about :) > And this thread started with me trying to help the IBM people make > progress towards merging a big chunk of that svn tree upstream. That > should make you happy, right? Yes, that does make me happy. But it doesn't make me happy to see IBM not being able to participate in kernel development by posting and defending their own code to lkml. I thought IBM knew better than that... thanks, greg k-h From rdreier at cisco.com Sat Feb 18 16:09:31 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sat, 18 Feb 2006 16:09:31 -0800 Subject: [openib-general] Re: [PATCH 02/22] Firmware interface code for IB device. In-Reply-To: <20060218232934.GA2624@kroah.com> (Greg KH's message of "Sat, 18 Feb 2006 15:29:34 -0800") References: <20060218015808.GB17653@kroah.com> <20060218122011.GE911@infradead.org> <20060218122631.GA30535@granada.merseine.nu> <1140265955.4035.19.camel@laptopd505.fenrus.org> <20060218181509.GA892@kroah.com> <20060218195327.GA1382@kroah.com> <20060218232934.GA2624@kroah.com> Message-ID: Greg> Yes, that doesn't matter. But it seems that the svn tree is Greg> vastly different from the in-kernel code. So much so that Greg> some companies feel that the in-kernel stuff just isn't Greg> worth running at all. I don't want to belabor this issue... but the svn tree is not vastly different than what's in the kernel. It has some things that aren't upstream yet, and which are important to some people. For example, the IBM ehca driver we're talking about, as well as the PathScale driver, SDP (sockets direct protocol), etc. It just takes time for this new code to get to the point where both the developers of the new stuff feel it's ready to be merged, and the kernel community agrees that it should be merged. Greg> Yes, that does make me happy. But it doesn't make me happy Greg> to see IBM not being able to participate in kernel Greg> development by posting and defending their own code to lkml. Greg> I thought IBM knew better than that... Agreed. But let's not get sidetracked on that internal IBM issue. The ehca developers have assured me that they can and will participate in the thread reviewing their driver. It seems like it's better for me to help them work around their internal problems by acting as a proxy, than for me to delay merging their driver just because someone in IBM management is clueless. - R. From dotanb at mellanox.co.il Sat Feb 18 23:00:36 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Sun, 19 Feb 2006 09:00:36 +0200 Subject: [openib-general] [VAPI]VAPI_poll_cq: CQ is empty Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3016C4A98@mtlexch01.mtl.com> -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Ian Jiang Sent: Friday, February 17, 2006 9:20 AM To: openib-general Subject: [openib-general] [VAPI]VAPI_poll_cq: CQ is empty To get familiar to the IBGD-1.8.0 VAPI, I wrote a program very simple, according to two examples *hca_per* and *rctp* in IBGD. A Sender and a Receiver ran on tow different nodes just to complete a Send/Recv progress. Sender ====== (a) Create IB resources: 1. List HCAs (only one HCA in fact) 2. Get the handle of the HCA 3. Query the HCA 4. Allocate a PD 5. Quey Port 1 of the HCA (only one Port in fact) 6. Create Send CQ and Recv CQ 7. Create QP (b) Modify QP to INIT state: 1. qp_move_to_init(¶ms); (c) Create MRs for Recv and Send respectively: 1. user_mr_create(¶ms.in_mr, params.mr_sz_req); 2. user_mr_create(¶ms.out_mr, params.mr_sz_req); (d) Send parameters to Receiver (e) Get ready to transfer: 1. Modify QP to RTR state 2. Modify QP to RTS state (f) Post Send 1. post_send_req(¶ms.ib_res, ¶ms.out_mr) (g) Wait Send to complete 1. reap_send_req(¶ms.ib_res, ¶ms.out_mr, 1/* not block*/); Receiver ======= (a) Wait parameters from Sender (b) Create IB resources: 1. List HCAs (only one HCA in fact) 2. Get the handle of the HCA 3. Query the HCA 4. Allocate a PD 5. Quey Port 1 of the HCA (only one Port in fact) 6. Create Send CQ and Recv CQ 7. Create QP (b) Modify QP to INIT state: 1. qp_move_to_init(¶ms); (c) Create MRs for Recv and Send respectively: 1. user_mr_create(¶ms.in_mr, params.mr_sz_req); 2. user_mr_create(¶ms.out_mr, params.mr_sz_req); (d) Post Recv 1. post_recv_req(¶ms.ib_res, ¶ms.in_mr) (e) Get ready to transfer: 1. Modify QP to RTR state 2. Modify QP to RTS state (g) Wait Recv to complete 1. reap_recv_req(¶ms.ib_res, ¶ms.in_mr, 1/* not block*/); Problem: ======= Both VAPI_poll_cq for Send CQ and Recv CQ returned "CQ is empty". And I failed to find out where the problem was, so turned to OpenIB for help. I am afraid that I am not clear enough about the CQ processing. Any suggestion is appreciated! Here are some pieces fo codes: ========================= /*********************************** Create IB Resources ****************************************/ int ib_res_create(struct ib_resource *ib_res_p) { VAPI_ret_t vapi_ret; u_int32_t num_of_hcas; VAPI_hca_id_t inst_hca_id; VAPI_cqe_num_t num_of_cqe; VAPI_srq_attr_t srq_props; VAPI_srq_attr_t actual_srq_props; VAPI_qp_init_attr_t qp_init_attr; VAPI_qp_init_attr_ext_t qp_ext_attr; VAPI_qp_prop_t qp_prop; if (ib_res_p == NULL) { PRINT_ERR("NULL ib_res_p\n"); return -1; } ini_ib_res(ib_res_p); /* list HCAs */ vapi_ret = EVAPI_list_hcas(1, &num_of_hcas, &inst_hca_id); if ((vapi_ret != VAPI_OK) && (vapi_ret != VAPI_EAGAIN)) { printf("list HCAs failed\n"); VAPIERR(vapi_ret); return -1; } PRINT_TRACE("number of HCAs: %d, HCA ID: %s\n", num_of_hcas, (char *)inst_hca_id); switch(num_of_hcas) { case 0: printf("No HCAs installed\n"); return -1; case 1: strcpy(ib_res_p->hca_id, inst_hca_id); break; default: /* ToDo: deal with multiple HCAs */ printf("ToDo: deal with multiple HCAs\n"); printf("Use the first HCA\n"); strcpy(ib_res_p->hca_id, inst_hca_id); } PRINT_TRACE("HCA to be used: %s\n", (char *)ib_res_p->hca_id); /* get the handle of the HCA */ vapi_ret = EVAPI_get_hca_hndl(ib_res_p->hca_id, &ib_res_p->hca_hndl); if (vapi_ret != VAPI_OK) { printf("HCA not open\n"); VAPIERR(vapi_ret); goto clean_exit; } /* query the HCA */ vapi_ret = VAPI_query_hca_cap(ib_res_p->hca_hndl, &ib_res_p->hca_vendor, &ib_res_p->hca_cap); if (vapi_ret != VAPI_OK) { printf("Query HCA failed\n"); VAPIERR(vapi_ret); goto clean_exit; } PRINT_HCA_CAP(&ib_res_p->hca_vendor, &ib_res_p->hca_cap); /* allocate PD */ //vapi_ret = EVAPI_alloc_pd(ib_res_p->hca_hndl, MAX_NUM_AVS, &ib_res_p->pd_hndl); vapi_ret = VAPI_alloc_pd(ib_res_p->hca_hndl, &ib_res_p->pd_hndl); if (vapi_ret != VAPI_OK) { printf("Allocate PA failed\n"); VAPIERR(vapi_ret); goto clean_exit; } PRINT_TRACE("PD allocated: %ld\n", ib_res_p->pd_hndl); /* query Port */ vapi_ret = VAPI_query_hca_port_prop(ib_res_p->hca_hndl, DEFAULT_PORT_NUM, &ib_res_p->hca_port); if (vapi_ret != VAPI_OK) { printf("Query Port %d failed\n", DEFAULT_PORT_NUM); VAPIERR(vapi_ret); goto clean_exit; } PRINT_PORT_PROP(&ib_res_p->hca_port); /* send CQ */ vapi_ret = VAPI_create_cq(ib_res_p->hca_hndl, MIN_SEND_CQE_NUM, &ib_res_p->s_cq_hndl, &num_of_cqe); if (vapi_ret != VAPI_OK) { printf("Create CQ for send failed\n"); VAPIERR(vapi_ret); goto clean_exit; } PRINT_TRACE("CQ for send created. CQE NUM: %d\n", num_of_cqe); /* receive CQ */ vapi_ret = VAPI_create_cq(ib_res_p->hca_hndl, MIN_SEND_CQE_NUM, &ib_res_p->r_cq_hndl, &num_of_cqe); if (vapi_ret != VAPI_OK) { printf("Create CQ for send failed\n"); VAPIERR(vapi_ret); goto clean_exit; } PRINT_TRACE("CQ for receive created. CQE NUM: %d\n", num_of_cqe); /* QP */ qp_init_attr.rq_cq_hndl = ib_res_p->r_cq_hndl; qp_init_attr.sq_cq_hndl = ib_res_p->s_cq_hndl; qp_init_attr.cap.max_oust_wr_rq = QP_INI_MAX_OUST_WR_RQ_NUM; qp_init_attr.cap.max_oust_wr_sq = QP_INI_MAX_OUST_WR_SQ_NUM; qp_init_attr.cap.max_sg_size_rq = QP_INI_MAX_SG_SIZE_RQ_NUM; qp_init_attr.cap.max_sg_size_sq = QP_INI_MAX_SG_SIZE_SQ_NUM; qp_init_attr.pd_hndl = ib_res_p->pd_hndl; qp_init_attr.rdd_hndl = 0; qp_init_attr.sq_sig_type = VAPI_SIGNAL_REQ_WR; qp_init_attr.rq_sig_type = VAPI_SIGNAL_ALL_WR; qp_init_attr.ts_type = VAPI_TS_RC; vapi_ret = VAPI_create_qp_ext(ib_res_p->hca_hndl, &qp_init_attr, &qp_ext_attr, &ib_res_p->qp_entry.qp_hndl, &qp_prop); if (vapi_ret != VAPI_OK) { printf("Create QP failed\n"); VAPIERR(vapi_ret); goto clean_exit; } ib_res_p->qp_entry.qp_num = qp_prop.qp_num; ib_res_p->qp_entry.srq_hndl = ib_res_p->srq_hndl; PRINT_TRACE("QP created\n"); PRINT_QP_PROP(&qp_prop); return 0; clean_exit: clean_ib_res(ib_res_p); return -1; } /*****************************Modify QP state ***************************************/ int qp_move_to_init(test_params_t *param_p) { VAPI_qp_attr_mask_t qp_attr_mask; VAPI_qp_attr_t qp_attr; VAPI_qp_cap_t qp_cap; VAPI_ret_t res; QP_ATTR_MASK_CLR_ALL(qp_attr_mask); qp_attr.qp_state = VAPI_INIT; QP_ATTR_MASK_SET(qp_attr_mask, QP_ATTR_QP_STATE); qp_attr.pkey_ix = 0; QP_ATTR_MASK_SET(qp_attr_mask, QP_ATTR_PKEY_IX); qp_attr.port = DEFAULT_PORT_NUM; QP_ATTR_MASK_SET(qp_attr_mask,QP_ATTR_PORT); qp_attr.remote_atomic_flags = VAPI_EN_REM_WRITE | VAPI_EN_REM_READ | VAPI_EN_REM_ATOMIC_OP; QP_ATTR_MASK_SET(qp_attr_mask,QP_ATTR_REMOTE_ATOMIC_FLAGS); res = VAPI_modify_qp(param_p->ib_res.hca_hndl, param_p->ib_res.qp_entry.qp_hndl, &qp_attr, &qp_attr_mask, &qp_cap); if (res != VAPI_OK) { printf("Error: Modifying QP to INIT: %s\n", VAPI_strerror(res)); return -1; } PRINT_TRACE("Modified QP to INIT\n"); print_qp_cap(&qp_cap); return 0; } int qp_move_to_rtr(test_params_t *param_p) { VAPI_qp_attr_mask_t qp_attr_mask; VAPI_qp_attr_t qp_attr; VAPI_qp_cap_t qp_cap; VAPI_ret_t res; param_p->mtu = (param_p->ib_res.hca_vendor.vendor_part_id == 23108) ? MTU1024 : MTU2048; QP_ATTR_MASK_CLR_ALL(qp_attr_mask); qp_attr.qp_state = VAPI_RTR; QP_ATTR_MASK_SET(qp_attr_mask, QP_ATTR_QP_STATE); qp_attr.av.sl = 0; /*USED_SL*/ qp_attr.av.grh_flag = FALSE; qp_attr.av.dlid = param_p->dst_msg.lid; qp_attr.av.static_rate = 2; /* 1x */ qp_attr.av.src_path_bits = 0; QP_ATTR_MASK_SET(qp_attr_mask,QP_ATTR_AV); qp_attr.path_mtu = param_p->mtu; QP_ATTR_MASK_SET(qp_attr_mask,QP_ATTR_PATH_MTU); qp_attr.rq_psn = START_PSN; QP_ATTR_MASK_SET(qp_attr_mask,QP_ATTR_RQ_PSN); qp_attr.qp_ous_rd_atom = QP_OUS_RD_ATOM; QP_ATTR_MASK_SET(qp_attr_mask,QP_ATTR_QP_OUS_RD_ATOM); qp_attr.dest_qp_num = param_p->dst_msg.qp_num; QP_ATTR_MASK_SET(qp_attr_mask,QP_ATTR_DEST_QP_NUM); qp_attr.min_rnr_timer = 0; QP_ATTR_MASK_SET(qp_attr_mask,QP_ATTR_MIN_RNR_TIMER); res = VAPI_modify_qp(param_p->ib_res.hca_hndl, param_p->ib_res.qp_entry.qp_hndl, &qp_attr, &qp_attr_mask, &qp_cap); if (res != VAPI_OK) { printf("Error: Modifying QP to RTR: %s\n", VAPI_strerror(res)); return -1/*(RET_ERR)*/; } PRINT_TRACE("Modified QP to RTR\n"); print_qp_cap(&qp_cap); return 0; } int qp_move_to_rts(test_params_t *param_p) { VAPI_qp_attr_mask_t qp_attr_mask; VAPI_qp_attr_t qp_attr; VAPI_qp_cap_t qp_cap; VAPI_ret_t res; QP_ATTR_MASK_CLR_ALL(qp_attr_mask); qp_attr.qp_state = VAPI_RTS; QP_ATTR_MASK_SET(qp_attr_mask,QP_ATTR_QP_STATE); qp_attr.sq_psn = START_PSN; QP_ATTR_MASK_SET(qp_attr_mask,QP_ATTR_SQ_PSN); qp_attr.timeout = 18; QP_ATTR_MASK_SET(qp_attr_mask,QP_ATTR_TIMEOUT); qp_attr.retry_count = 6; QP_ATTR_MASK_SET(qp_attr_mask,QP_ATTR_RETRY_COUNT); qp_attr.rnr_retry = 6; QP_ATTR_MASK_SET(qp_attr_mask,QP_ATTR_RNR_RETRY); qp_attr.ous_dst_rd_atom = QP_OUS_RD_ATOM; QP_ATTR_MASK_SET(qp_attr_mask,QP_ATTR_OUS_DST_RD_ATOM); res = VAPI_modify_qp(param_p->ib_res.hca_hndl, param_p->ib_res.qp_entry.qp_hndl, &qp_attr, &qp_attr_mask, &qp_cap); if (res != VAPI_OK) { printf("Error: Modifying QP to RTS: %s\n", VAPI_strerror(res)); return /*(RET_ERR)*/-1; } PRINT_TRACE("Modified QP to RTS\n"); print_qp_cap(&qp_cap); return 0; } /************************** Recv/Send requests ******************************/ /* * post receive request */ int post_recv_req(struct ib_resource *ib_res_p, struct user_mr *u_mr_p) { VAPI_ret_t res; VAPI_rr_desc_t rr; VAPI_sg_lst_entry_t sg_entry_r; VAPI_hca_hndl_t hca_hndl; VAPI_qp_hndl_t qp_hndl; VAPI_srq_hndl_t srq_hndl; if (ib_res_p == NULL) { PRINT_ERR("NULL ib_res_p\n"); return -1; } if (u_mr_p == NULL) { PRINT_ERR("NULL user mr pointer\n"); return -1; } hca_hndl = ib_res_p->hca_hndl; qp_hndl = ib_res_p->qp_entry.qp_hndl; hca_hndl = ib_res_p->srq_hndl; rr.opcode = VAPI_RECEIVE; rr.comp_type = VAPI_SIGNALED; rr.sg_lst_len = 1; sg_entry_r.lkey = u_mr_p->mrw_rep.l_key; sg_entry_r.len = u_mr_p->mrw_req.size; sg_entry_r.addr = (VAPI_virt_addr_t)(MT_virt_addr_t)u_mr_p->user_buf; rr.sg_lst_p = &sg_entry_r; rr.id = sg_entry_r.addr; PRINT_RECV_REQ(&rr); res = VAPI_post_rr(hca_hndl, qp_hndl, &rr); if (res != VAPI_OK) { printf("VAPI post Recv Req failed\n"); VAPIERR(res); return -1; } return 0; } /* * post send request */ int post_send_req(struct ib_resource *ib_res_p, struct user_mr *u_mr_p) { VAPI_ret_t res; VAPI_sr_desc_t sr; VAPI_sg_lst_entry_t sg_entry_s; VAPI_hca_hndl_t hca_hndl; VAPI_qp_hndl_t qp_hndl; if (ib_res_p == NULL) { PRINT_ERR("NULL ib_res_p\n"); return -1; } if (u_mr_p == NULL) { PRINT_ERR("NULL user mr pointer\n"); return -1; } hca_hndl = ib_res_p->hca_hndl; qp_hndl = ib_res_p->qp_entry.qp_hndl; sr.comp_type = VAPI_SIGNALED; sr.set_se = FALSE; sr.opcode = VAPI_SEND; sr.remote_qkey = 0; sr.sg_lst_len = 1; sg_entry_s.lkey = u_mr_p->mrw_rep.l_key; sg_entry_s.len = u_mr_p->mrw_req.size; sg_entry_s.addr = (VAPI_virt_addr_t)(MT_virt_addr_t)u_mr_p->user_buf; sr.sg_lst_p = &sg_entry_s; sr.id = sg_entry_s.addr; PRINT_SEND_REQ(&sr); res = VAPI_post_sr(hca_hndl, qp_hndl, &sr); if (res != VAPI_OK) { printf("VAPI post Send Req failed\n"); VAPIERR(res); return -1; } return 0; } int reap_send_req(struct ib_resource *ib_res_p, struct user_mr *u_mr_p, int block) { VAPI_ret_t res; VAPI_wc_desc_t wc_desc; VAPI_hca_hndl_t hca_hndl; VAPI_cq_hndl_t s_cq_hndl; int poll_cnt = 0; if (ib_res_p == NULL) { PRINT_ERR("NULL ib_res_p\n"); return -1; } if (u_mr_p == NULL) { PRINT_ERR("NULL user mr pointer\n"); return -1; } hca_hndl = ib_res_p->hca_hndl; s_cq_hndl = ib_res_p->s_cq_hndl; if (block) { do { poll_cnt++; MTPERF_TIME_START(VAPI_poll_cq); res = VAPI_poll_cq(hca_hndl, s_cq_hndl, &wc_desc); //res = EVAPI_poll_cq_block(hca_hndl, s_cq_hndl, REAP_REQ_WAIT_TIME, &wc_desc); MTPERF_TIME_END(VAPI_poll_cq); if (res != VAPI_OK && res != VAPI_CQ_EMPTY) { PRINT_ERR("Poll CQ block failed\n"); VAPIERR(res); return -1; } show_qp_state(hca_hndl, ib_res_p->qp_entry.qp_hndl, ib_res_p->qp_entry.qp_num); VAPI_RET(res); } while(res == VAPI_CQ_EMPTY && poll_cnt < 10); if (wc_desc.status != VAPI_SUCCESS) { PRINT_ERR("Req unsuccess: %s\n", VAPI_wc_status_sym(wc_desc.status)); PRINT_WC_DESC(&wc_desc); return -1; } } else { printf("ToDo: %s for unblock\n", __func__); } PRINT_TRACE("Req success\n"); PRINT_WC_DESC(&wc_desc); return 0; } int reap_recv_req(struct ib_resource *ib_res_p, struct user_mr *u_mr_p, int block) { VAPI_ret_t res; VAPI_wc_desc_t wc_desc; VAPI_hca_hndl_t hca_hndl; VAPI_cq_hndl_t r_cq_hndl; int poll_cnt = 0; if (ib_res_p == NULL) { PRINT_ERR("NULL ib_res_p\n"); return -1; } if (u_mr_p == NULL) { PRINT_ERR("NULL user mr pointer\n"); return -1; } hca_hndl = ib_res_p->hca_hndl; r_cq_hndl = ib_res_p->r_cq_hndl; if (block) { do { poll_cnt++; res = VAPI_poll_cq(hca_hndl, r_cq_hndl,&wc_desc); if (res != VAPI_OK && res != VAPI_CQ_EMPTY) { PRINT_ERR("Poll CQ block failed\n"); VAPIERR(res); return -1; } sleep(1); } while(res == VAPI_CQ_EMPTY && poll_cnt < 20); if (wc_desc.status != VAPI_SUCCESS) { PRINT_ERR("Req failed: %s\n", VAPI_wc_status_sym(wc_desc.status)); PRINT_WC_DESC(&wc_desc); return -1; } } else { printf("ToDo: %s for unblock\n", __func__); } PRINT_TRACE("Req success\n"); PRINT_WC_DESC(&wc_desc); return 0; } -- Ian Jiang ianjiang.ict at gmail.com Laboratory of Spatial Information Technology Division of System Architecture Institute of Computing Technology Chinese Academy of Sciences -------------- next part -------------- An HTML attachment was scrubbed... URL: From dotanb at mellanox.co.il Sat Feb 18 23:12:09 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Sun, 19 Feb 2006 09:12:09 +0200 Subject: [openib-general] [VAPI]VAPI_poll_cq: CQ is empty Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3016C4AA9@mtlexch01.mtl.com> Sorry, about the previous email, it was sent by mistake ... I believe that the problem is that the min_rnr_timer value is 0 (which means infinite timeout between the rnr retries) and there is rnr nak between the two sides (because you don't sync between the sides, and this is the reason for the empty CQ ... Let me describe the problem: The sender sent a send message which should consume RR (Receive Request) at the receiver side, but when the message have reached to the receiver there wasn't any RR in the RQ, so he sent to the sender rnr-nack, the sender got the rnr-nack and is waiting the min_rnr_timer which is infinite ... You should do the following things: * Put a non zero value in the min_rnr_timer (you may get completion with bad status: rnr exceeded if the receiver won't be ready in time ...) * Post RR in the responder in the init state * Optional: sync between the sides (post SR at the sender only when there is RR in the receiver side). Dotan -------------- next part -------------- An HTML attachment was scrubbed... URL: From ogerlitz at voltaire.com Sun Feb 19 01:08:52 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 19 Feb 2006 11:08:52 +0200 (IST) Subject: [openib-general] [PATCH] iser: fixed sparse warnings Message-ID: ------------------------------------------------------------------------ r5442 | ogerlitz | 2006-02-19 11:05:17 +0200 (Sun, 19 Feb 2006) | 4 lines fixed sparse warnings Signed-off-by: Or Gerlitz Index: iser_socket.c =================================================================== --- iser_socket.c (revision 5427) +++ iser_socket.c (revision 5442) @@ -45,7 +45,7 @@ static int iser_sock_create(struct socke static int iser_sock_release(struct socket *); static int iser_sock_connect(struct socket *, struct sockaddr *, int, int); static int iser_sock_shutdown(struct socket *,int); -static int iser_sock_getsockopt(struct socket *,int,int,char *,int *); +static int iser_sock_getsockopt(struct socket *,int,int,char __user *,int __user *); static unsigned int iser_sock_poll(struct file *,struct socket *, struct poll_table_struct *); @@ -55,40 +55,40 @@ struct iser_sock { }; static struct net_proto_family iser_proto_family = { - family: PF_ISER, - create: iser_sock_create, - authentication: 0, - encryption: 0, - encrypt_net: 0, + .family = PF_ISER, + .create = iser_sock_create, + .authentication = 0, + .encryption = 0, + .encrypt_net = 0 }; static struct proto_ops iser_proto_ops = { - family: AF_ISER, - owner: THIS_MODULE, + .family = AF_ISER, + .owner = THIS_MODULE, - connect: iser_sock_connect, - release: iser_sock_release, - shutdown: iser_sock_shutdown, - - bind: sock_no_bind, - poll: iser_sock_poll, - socketpair: sock_no_socketpair, - accept: sock_no_accept, - getname: sock_no_getname, - ioctl: sock_no_ioctl, - listen: sock_no_listen, - setsockopt: sock_setsockopt, - getsockopt: iser_sock_getsockopt, - sendmsg: sock_no_sendmsg, - recvmsg: sock_no_recvmsg, - mmap: sock_no_mmap, - sendpage: sock_no_sendpage, + .connect = iser_sock_connect, + .release = iser_sock_release, + .shutdown = iser_sock_shutdown, + + .bind = sock_no_bind, + .poll = iser_sock_poll, + .socketpair = sock_no_socketpair, + .accept = sock_no_accept, + .getname = sock_no_getname, + .ioctl = sock_no_ioctl, + .listen = sock_no_listen, + .setsockopt = sock_setsockopt, + .getsockopt = iser_sock_getsockopt, + .sendmsg = sock_no_sendmsg, + .recvmsg = sock_no_recvmsg, + .mmap = sock_no_mmap, + .sendpage = sock_no_sendpage }; static struct proto iser_sock_proto = { - name: "ib_iser", - owner: THIS_MODULE, - obj_size: sizeof(struct iser_sock), + .name = "ib_iser", + .owner = THIS_MODULE, + .obj_size = sizeof(struct iser_sock) }; struct iser_conn *iser_conn_from_sock(struct socket *sock) @@ -202,7 +202,7 @@ int iser_sock_shutdown(struct socket *so } static int iser_sock_getsockopt(struct socket *sock, int level, int optname, - char *optval, int *optlen) + char __user *optval, int __user *optlen) { return 0; } From jackm at mellanox.co.il Sun Feb 19 02:42:59 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Sun, 19 Feb 2006 12:42:59 +0200 Subject: [openib-general] RE: [PATCH 3 of 3] mad: large RMPP support, Round 2 Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3016C4BCC@mtlexch01.mtl.com> Thanks for agreeing to check in the patch! Feedback below (see my inserted comments bracketed by >>>>>>>>>>>>>> / <<<<<<<<<<<< and prefixed by "[JPM]"). Jack -----Original Message----- From: Sean Hefty [mailto:sean.hefty at intel.com] Sent: Saturday, February 18, 2006 7:40 PM To: Jack Morgenstein Cc: Michael S. Tsirkin; openib-general at openib.org Subject: RE: [PATCH 3 of 3] mad: large RMPP support, Round 2 >+static inline void adjust_last_ack(struct ib_mad_send_wr_private *wr) >+{ >+ struct ib_mad_multipacket_seg *seg; >+ >+ if (wr->last_ack < 2) >+ return; >+ else if (!wr->last_ack_seg) >+ list_for_each_entry(seg, &wr->multipacket_list, list) { >+ if (wr->last_ack == seg->num) { >+ wr->last_ack_seg = seg; >+ break; >+ } >+ } >+ else >+ list_for_each_entry(seg, &wr->last_ack_seg->list, list) { >+ if (wr->last_ack == seg->num) { >+ wr->last_ack_seg = seg; >+ break; >+ } >+ } >+} If we initialize last_ack_seg to the start of the list, can we combine the else if and else checks together? >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [JPM] This is a problem, since we will then always skip over segment 2 (list_for_each_entry starts with taking the next segment). If we initialize so that the last_seg scan will start with the first rmpp "overflow" segment (seg 2), we'll need to allocate a dummy initial segment just so that we can reference &wr->last_ack_seg->list -- seems more conplicated to me. This way, last_ack_seg gets initialized painlessly. <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< >@@ -647,6 +672,7 @@ static void process_rmpp_ack(struct ib_m > > if (seg_num > mad_send_wr->last_ack) { > mad_send_wr->last_ack = seg_num; >+ adjust_last_ack(mad_send_wr); If last_ack_seg references a segment that contains the seg_num, can we eliminate last_ack? >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [JPM]: No, because the seg-num ack'ed may be segment number 1 (which has no entry in the multipacket list). The last_ack_seg pointer's job is only to avoid the O(n-squared) search in scanning the segment list. The RMPP logic uses last_ack (see, for example abort_send() and ib_retry_rmpp() in the same file. <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< >+static inline int alloc_send_rmpp_segs(struct ib_mad_send_wr_private *send_wr, >+ int message_size, int hdr_len, >+ int data_len, u8 rmpp_version, >+ gfp_t gfp_mask) >+{ >+ struct ib_mad_multipacket_seg *seg; >+ struct ib_rmpp_mad *rmpp_mad = send_wr->send_buf.mad; >+ int seg_size, i = 2; >+ >+ rmpp_mad->rmpp_hdr.paylen_newwin = >+ cpu_to_be32(hdr_len - IB_MGMT_RMPP_HDR + data_len); >+ rmpp_mad->rmpp_hdr.rmpp_version = rmpp_version; >+ rmpp_mad->rmpp_hdr.rmpp_type = IB_MGMT_RMPP_TYPE_DATA; >+ ib_set_rmpp_flags(&rmpp_mad->rmpp_hdr, IB_MGMT_RMPP_FLAG_ACTIVE); >+ send_wr->total_length = message_size; >+ /* allocate RMPP buffers */ >+ message_size -= sizeof(struct ib_mad); >+ seg_size = sizeof(struct ib_mad) - hdr_len; >+ while (message_size > 0) { >+ seg = kmalloc(sizeof(struct ib_mad_multipacket_seg) + seg_size, >+ gfp_mask); It would be convenient if the MAD were cleared for the user, so we don't end of transferring random data, especially at the end of the user data. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [JPM]: This is done in file user_mad.c, file ib_umad_write(): See patch 3 of 3, hunk starting with: @@ -502,14 +510,32 @@ static ssize_t ib_umad_write(struct file following lines: + /* Pad last segment with zeroes. */ + if (seg->size - s) + memset(seg->data + s, 0, seg->size - s); I did it this way to avoid using kzalloc for all segment allocations, or kmalloc/memset for all. This way, only the last segment is "memset"'ed and only in the padding area. We can probably get rid of the "if", and just do the padding, with possibly zero bytes to be padded. <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< >+ if (!seg) { >+ printk(KERN_ERR "ib_create_send_mad: RMPP mem " >+ "alloc failed for len %zd, gfp %#x\n", >+ sizeof(struct ib_mad_multipacket_seg) + seg_size, >+ gfp_mask); >+ free_send_multipacket_list(send_wr); >+ return -ENOMEM; >+ } >+ seg->size = seg_size; Okay, seg_size is the same for all segments belonging to a single MAD. We can move this it ib_mad_send_buf. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [JPM]: This was done to hide segment implementation details from the user_mad layer. You are correct that we can get rid of the size field, and store it only once. However, I think that the proper place for this value is the "ib_mad_send_wr_private" structure, which also holds the list pointer. How about if we change struct ib_mad_multipacket_seg *ib_mad_get_multipacket_seg(struct ib_mad_send_buf *send_buf, int seg_num) to also return the segment size: struct ib_mad_multipacket_seg *ib_mad_get_multipacket_seg(struct ib_mad_send_buf *send_buf, int seg_num, int *seg_size) ? The mad layer (which has access to the private structure) can then return the segment size still without exposing implementation details. <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< >+ mad_send_wr->last_ack_seg = NULL; >+ mad_send_wr->seg_num_seg = NULL; Mad_send_wr is cleared on allocation. Are there better values to initialize these variables to? The first/second segment? >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [JPM] IMHO, it is cleaner initialize to NULL. However, as you point out, these two initialization lines are redundant. Furthermore, there may be no segment to initialize to (if this is a non-rmpp MAD, or a single-segment RMPP mad). <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< >+struct ib_mad_multipacket_seg >+*ib_rmpp_get_multipacket_seg(struct ib_mad_send_wr_private *wr, int seg_num) >+{ >+ struct ib_mad_multipacket_seg *seg; >+ >+ if (seg_num == 2) { >+ wr->seg_num_seg = >+ container_of(wr->multipacket_list.next, >+ struct ib_mad_multipacket_seg, list); >+ return wr->seg_num_seg; >+ } >+ >+ /* get first list entry if was not already done */ >+ if (!wr->seg_num_seg) See previous comment. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [JPM]: On examination, can just get rid of the following lines: + if (seg_num == 2) { + wr->seg_num_seg = + container_of(wr->multipacket_list.next, + struct ib_mad_multipacket_seg, list); + return wr->seg_num_seg; + } + (i.e., just initialize if seg_num_seg is null. If seg_num = 2, the patch lines below will handle this case: + if (wr->seg_num_seg->num == seg_num) + return wr->seg_num_seg; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< >+struct ib_mad_multipacket_seg >+*ib_mad_get_multipacket_seg(struct ib_mad_send_buf *send_buf, int seg_num) >+{ >+ struct ib_mad_send_wr_private *wr; >+ >+ if (seg_num < 2) >+ return NULL; >+ >+ wr = container_of(send_buf, struct ib_mad_send_wr_private, send_buf); >+ return ib_rmpp_get_multipacket_seg(wr, seg_num); >+} Treating the first segment special seems somewhat confusing. (Maybe this is a result of how the MAD/RMPP header is copied down from userspace?) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [JPM] Its confusing either way. At least this way we don't allocate an extra (first) segment when using RMPP. The initial 256 bytes of MAD get allocated anyway: In procedure ib_create_send_mad(), you will note that for all MADS, the initial allocate (line 851) is for the ib_mad_send_wr_private element plus a MAD packet (256 bytes). Later, if needed for RMPP, additional segments are allocated. To treat regular MADs one way, and RMPP fully another way, the size of the initial allocation would need to be conditional on whether this is RMPP or not -- which complicates the code. <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< >+ if (!rmpp_active && length > sizeof(struct ib_mad)) { >+ ret = -EINVAL; >+ goto err_ah; >+ } >+ > packet->msg = ib_create_send_mad(agent, > be32_to_cpu(packet->mad.hdr.qpn), > 0, rmpp_active, I think that ib_create_send_mad performs the same check. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [JPM]: You are correct. In fact, the test just before this one is also performed in ib_create_send_mad(), and may also be deleted (this was not part of the patch): if (rmpp_active && !agent->rmpp_version) { ret = -EINVAL; goto err_ah; } <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< - Sean From kimball.morrisi6o at gmail.com Sun Feb 19 18:43:32 2006 From: kimball.morrisi6o at gmail.com (Alice Crowe) Date: Sun, 19 Feb 2006 18:43:32 -0800 Subject: [openib-general] Hey buddy, whats up Message-ID: <20060219105113.5F8C52283D8@openib.ca.sandia.gov> An HTML attachment was scrubbed... URL: From hch at lst.de Sun Feb 19 04:01:55 2006 From: hch at lst.de (Christoph Hellwig) Date: Sun, 19 Feb 2006 13:01:55 +0100 Subject: [openib-general] NFS/RDMA client release for Linux 2.6.15 In-Reply-To: <7.0.1.0.2.20060207211754.0409e8a0@netapp.com> References: <7.0.1.0.2.20060207211754.0409e8a0@netapp.com> Message-ID: <20060219120155.GA10268@lst.de> On Wed, Feb 08, 2006 at 03:58:56PM -0500, Talpey, Thomas wrote: > We have released an updated NFS/RDMA client for Linux at > the project's Sourceforge site: Thanks, this looks much better than the previous patch. Comments: - please don't build the rdma transport unconditional, but make it a user-visible config option - please use the kernel u*/s* types instead of (u)int*_t - please include your local headers after the headers, and keep all the includes at the beginning of the files, just after the licence comment block - chunktype shouldn't be a typedef but a pure enum, and the names look a bit too generic, please add an rdma_ prefix - please kill the XDR_TARGET and pos0 macros, maybe RPC_SEND_SEG0 and RPC_SEND_LEN0, too - RPC_SEND_VECS should become an inline functions and be spelled lowercase - RPC_SEND_COPY is probably too large to be inlined and should be spelled lowercase - the CONFIG_HIGHMEM ifdef block in RPC_SEND_COPY is wrong. Please always use kmap, it does the right thing for non-highmem aswell. The PageHighMem check and using kmap_high directly is always wrong, they are internal implementation details. I'd also suggest evaluating kmap_atomic because it scales much better on SMP systems. - RPC_RECV_VECS should be an inline and spelled lowercase - RPC_RECV_SEG0 and PC_RECV_LEN0 should probably go away. - RPC_RECV_COPY is probably too large to be inlined and should be spelled lowercase - RPC_RECV_COPY same comment about highmem and kmap as in RPC_SEND_COPY - please try to avoid file-scope forward-prototypes but try to order the code in the natural flow where they aren't required - structures like rpcrdma_msg that are on the wire should use __be* for endianess annotations, and the cpu_to_be*/be*_to_cpu accessor functions instead of hton?/ntoh?. Please verify that these annotations are correct using sparse -D__CHECK_ENDIAN__=1 - rdma_convert_physiov/rdma_convert_phys are completely broken. page_to_phys can't be used by driver/fs code. RDMA only deals with bus addresses, not physical addresses. You must use the dma mapping API instead. Also coalescing decisions are made by the dma layer, because they are platform dependent and much more complex then what the code in this patch does. - transport.c is missing a GPL license statement - in transport.c please don't use CamelCase variable names. - MODULE_PARM shouldn't be used in new code, but module_param instead. - please don't use the (void) function() style, it just obsfucates the code without benefit. - try_module_get(THIS_MODULE) is always wrong. Reference counting should happen from the calling module. - please initialize global or file-scope spinlocks with DEFINE_SPINLOCK(). - the traditional name for the second argument to spin_lock_irqsave is just flags, not lock_flags. This doesn't really matter, but following such conventions makes it easier to understand the code for kernel hackers that just occasionally drop into your code. - no need to case the return value from kmalloc/kzalloc/etc. They return void * which can be directly assigned to any pointer type. - please avoid typedes for structure types, like struct rdma_ia, struct rdma_ep, etc.. From yael at mellanox.co.il Sun Feb 19 04:24:31 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 19 Feb 2006 14:24:31 +0200 Subject: [openib-general] Re:[PATCH] OpenSM/st.c: Fix some size_t issues related to memoryallocation in st.c Message-ID: <5zpsljk03k.fsf@mtl066.yok.mtl.com> Hi Hal, The patch in general is fine. I've added one change to the original patch - to avoid casting issues underwindows Below is the full patch. Yael Signed-off-by: Hal Rosenstock Index: include/opensm/st.h =================================================================== --- include/opensm/st.h (revision 5436) +++ include/opensm/st.h (working copy) @@ -40,6 +40,8 @@ #ifndef ST_INCLUDED #define ST_INCLUDED +#include + #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { # define END_C_DECLS } @@ -79,11 +81,11 @@ struct st_table { enum st_retval {ST_CONTINUE, ST_STOP, ST_DELETE}; st_table *st_init_table(struct st_hash_type *); -st_table *st_init_table_with_size(struct st_hash_type *, int); +st_table *st_init_table_with_size(struct st_hash_type *, size_t); st_table *st_init_numtable(void); -st_table *st_init_numtable_with_size(int); +st_table *st_init_numtable_with_size(size_t); st_table *st_init_strtable(void); -st_table *st_init_strtable_with_size(int); +st_table *st_init_strtable_with_size(size_t); int st_delete(st_table *, st_data_t *, st_data_t *); int st_delete_safe(st_table *, st_data_t *, st_data_t *, st_data_t); int st_insert(st_table *, st_data_t, st_data_t); Index: opensm/st.c =================================================================== --- opensm/st.c (revision 5436) +++ opensm/st.c (working copy) @@ -42,7 +42,6 @@ #endif /* HAVE_CONFIG_H */ #include -#include #include #include @@ -102,17 +101,11 @@ static struct st_hash_type type_strhash #define xcalloc calloc #define xrealloc realloc #define xfree free -#if 0 -void *xmalloc(long); -void *xcalloc(long, long); -void *xrealloc(void *, long); -void xfree(void *); -#endif static void rehash(st_table *); -#define alloc(type) (type*)xmalloc((unsigned)sizeof(type)) -#define Calloc(n,s) (char*)xcalloc((n),(s)) +#define alloc(type) (type*)xmalloc(sizeof(type)) +#define Calloc(n,s) (char*)xcalloc((n), (s)) #define EQUAL(table,x,y) ((x)==(y) || (*table->type->compare)(((void*)x),((void *)y)) == 0) @@ -200,7 +193,7 @@ stat_col() st_table* st_init_table_with_size(type, size) struct st_hash_type *type; - int size; + size_t size; { st_table *tbl; @@ -238,7 +231,7 @@ st_init_numtable(void) st_table* st_init_numtable_with_size(size) - int size; + size_t size; { return st_init_table_with_size(&type_numhash, size); } @@ -251,7 +244,7 @@ st_init_strtable(void) st_table* st_init_strtable_with_size(size) - int size; + size_t size; { return st_init_table_with_size(&type_strhash, size); } @@ -314,7 +307,8 @@ st_lookup(table, key, value) return 0; } else { - if (value != 0) *value = ptr->record; + if (value != 0) + *value = ptr->record; return 1; } } @@ -407,7 +401,8 @@ st_copy(old_table) { st_table *new_table; st_table_entry *ptr, *entry; - int i, num_bins = old_table->num_bins; + int i; + size_t num_bins = old_table->num_bins; new_table = alloc(st_table); if (new_table == 0) @@ -417,7 +412,7 @@ st_copy(old_table) *new_table = *old_table; new_table->bins = (st_table_entry**) - Calloc((unsigned)num_bins, sizeof(st_table_entry*)); + Calloc(num_bins, sizeof(st_table_entry*)); if (new_table->bins == 0) { @@ -524,7 +519,7 @@ st_delete_safe(table, key, value, never) } static int -delete_never( st_data_t key, st_data_t value, st_data_t never) +delete_never(st_data_t key, st_data_t value, st_data_t never) { if (value == never) return ST_DELETE; return ST_CONTINUE; From ogerlitz at voltaire.com Sun Feb 19 04:39:24 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 19 Feb 2006 14:39:24 +0200 (IST) Subject: [openib-general] [PATCH] mutex-backport: add mutex_trylock support Message-ID: I've allowed myself to it commit in r5443 added mutex_trylock() to backport hacks Signed-off-by: Or Gerlitz Index: mutex-backport.h =================================================================== --- mutex-backport.h (revision 5225) +++ mutex-backport.h (revision 5443) @@ -10,6 +10,9 @@ #define mutex_init(foo) init_MUTEX(foo) #define mutex_lock(foo) down(foo) #define mutex_lock_interruptible(foo) down_interruptible(foo) +/* this function follows the spin_trylock() convention, so * + * it is negated to the down_trylock() return values! Be careful */ +#define mutex_trylock(foo) !down_trylock(foo) #define mutex_unlock(foo) up(foo) #endif /* __LINUX_MUTEX_H */ From mst at mellanox.co.il Sun Feb 19 04:45:53 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 19 Feb 2006 14:45:53 +0200 Subject: [openib-general] Re: [PATCH] mutex-backport: add mutex_trylock support In-Reply-To: References: Message-ID: <20060219124553.GA22037@mellanox.co.il> Quoting r. Or Gerlitz : > Subject: [PATCH] mutex-backport: add mutex_trylock support > > I've allowed myself to it commit in r5443 > > added mutex_trylock() to backport hacks > > Signed-off-by: Or Gerlitz > > Index: mutex-backport.h > =================================================================== > --- mutex-backport.h (revision 5225) > +++ mutex-backport.h (revision 5443) > @@ -10,6 +10,9 @@ > #define mutex_init(foo) init_MUTEX(foo) > #define mutex_lock(foo) down(foo) > #define mutex_lock_interruptible(foo) down_interruptible(foo) > +/* this function follows the spin_trylock() convention, so * > + * it is negated to the down_trylock() return values! Be careful */ > +#define mutex_trylock(foo) !down_trylock(foo) > #define mutex_unlock(foo) up(foo) > > #endif /* __LINUX_MUTEX_H */ "negated to the down_trylock() return values" probably means !down_trylock which is already obvious. So what does the comment mean? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From ogerlitz at voltaire.com Sun Feb 19 04:56:07 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 19 Feb 2006 14:56:07 +0200 (IST) Subject: [openib-general] [PATCH] iser: convert semaphore to mutex Message-ID: please note that to user iser with 2.6.15 and below kernels one needs to copy mutex-backport.h from https://openib.org/svn/gen2/trunk/src/linux-kernel/infiniband/include/linux to be under /usr/src/linux/include/linux Or. semaphore to mutex conversion Signed-off-by: Or Gerlitz Index: iscsi_iser.h =================================================================== --- iscsi_iser.h (revision 5442) +++ iscsi_iser.h (revision 5444) @@ -52,7 +52,14 @@ #include #include #include -#include + +/* XXX remove this compatibility hack when 2.6.16 is released */ +#include +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,16) +#include +#else +#include +#endif /* XXX end of hack */ #include #include @@ -295,7 +302,7 @@ struct iscsi_iser_conn { struct kfifo *mgmtqueue; /* mgmt (control) xmit queue */ struct kfifo *xmitqueue; /* data-path cmd queue */ struct work_struct xmitwork; /* per-conn. xmit workqueue */ - struct semaphore xmitsema; /* serializes connection xmit, + struct mutex xmitmutex; /* serializes connection xmit, * access to kfifos: * * xmitqueue, * * immqueue, mgmtqueue */ @@ -401,7 +408,7 @@ struct iser_page_vec { }; struct iser_global { - struct semaphore adaptor_list_sem; /* */ + struct mutex adaptor_list_mutex;/* */ struct list_head adaptor_list; /* all iSER adaptors */ kmem_cache_t *desc_cache; Index: iser_verbs.c =================================================================== --- iser_verbs.c (revision 5442) +++ iser_verbs.c (revision 5444) @@ -240,7 +240,7 @@ struct iser_adaptor *iser_adaptor_find_b struct list_head *p_list; struct iser_adaptor *p_adaptor = NULL; - down(&ig.adaptor_list_sem); + mutex_lock(&ig.adaptor_list_mutex); p_list = ig.adaptor_list.next; while (p_list != &ig.adaptor_list) { @@ -267,14 +267,14 @@ struct iser_adaptor *iser_adaptor_find_b end: BUG_ON(p_adaptor == NULL); p_adaptor->refcount++; - up(&ig.adaptor_list_sem); + mutex_unlock(&ig.adaptor_list_mutex); return p_adaptor; } /* if there's no demand for this adaptor, release it */ static void iser_adaptor_try_release(struct iser_adaptor *p_adaptor) { - down(&ig.adaptor_list_sem); + mutex_lock(&ig.adaptor_list_mutex); p_adaptor->refcount--; iser_err("adaptor %p refcount %d\n",p_adaptor,p_adaptor->refcount); if (!p_adaptor->refcount) { @@ -282,7 +282,7 @@ static void iser_adaptor_try_release(str list_del(&p_adaptor->ig_list); kfree(p_adaptor); } - up(&ig.adaptor_list_sem); + mutex_unlock(&ig.adaptor_list_mutex); } /** Index: iscsi_iser.c =================================================================== --- iscsi_iser.c (revision 5442) +++ iscsi_iser.c (revision 5444) @@ -707,10 +707,10 @@ iscsi_iser_xmitworker(void *data) /* * serialize Xmit worker on a per-connection basis. */ - down(&conn->xmitsema); + mutex_lock(&conn->xmitmutex); if (iscsi_iser_data_xmit(conn)) schedule_work(&conn->xmitwork); - up(&conn->xmitsema); + mutex_unlock(&conn->xmitmutex); } @@ -784,11 +784,11 @@ iscsi_iser_queuecommand(struct scsi_cmnd session->cmdsn, session->max_cmdsn - session->exp_cmdsn + 1); spin_unlock(&session->lock); - if (!in_interrupt() && !down_trylock(&conn->xmitsema)) { + if (!in_interrupt() && mutex_trylock(&conn->xmitmutex)) { spin_unlock_irq(host->host_lock); if (iscsi_iser_data_xmit(conn)) schedule_work(&conn->xmitwork); - up(&conn->xmitsema); + mutex_unlock(&conn->xmitmutex); spin_lock_irq(host->host_lock); } else schedule_work(&conn->xmitwork); @@ -922,7 +922,7 @@ iscsi_iser_conn_create(iscsi_sessionh_t spin_unlock_bh(&session->lock); init_timer(&conn->tmabort_timer); - init_MUTEX(&conn->xmitsema); + mutex_init(&conn->xmitmutex); init_waitqueue_head(&conn->ehwait); spin_lock_init(&conn->lock); @@ -949,7 +949,7 @@ iscsi_iser_conn_destroy(iscsi_connh_t co debug_iser("%s: enter\n", __FUNCTION__); - down(&conn->xmitsema); + mutex_lock(&conn->xmitmutex); set_bit(SUSPEND_BIT, &conn->suspend_tx); if (conn->c_stage == ISCSI_CONN_INITIAL_STAGE && conn->sock) { @@ -968,7 +968,7 @@ iscsi_iser_conn_destroy(iscsi_connh_t co } spin_unlock_bh(&session->lock); - up(&conn->xmitsema); + mutex_unlock(&conn->xmitmutex); /* * Block until all in-progress commands for this connection @@ -1135,7 +1135,7 @@ iscsi_iser_conn_stop(iscsi_connh_t connh BUG_ON(!conn->sock); - down(&conn->xmitsema); + mutex_lock(&conn->xmitmutex); spin_lock_irqsave(session->host->host_lock, flags); spin_lock(&session->lock); @@ -1189,7 +1189,7 @@ iscsi_iser_conn_stop(iscsi_connh_t connh sock_release(conn->sock); conn->sock = NULL; } - up(&conn->xmitsema); + mutex_unlock(&conn->xmitmutex); debug_iser("%s: exit\n", __FUNCTION__); } @@ -1353,13 +1353,13 @@ iscsi_iser_eh_abort(struct scsi_cmnd *sc * 1) connection-level failure; * 2) recovery due protocol error; */ - down(&conn->xmitsema); + mutex_lock(&conn->xmitmutex); spin_lock_bh(&session->lock); debug_iser("%s: session->state = %d\n", __FUNCTION__, session->state); if (session->state != ISCSI_STATE_LOGGED_IN) { if (session->state == ISCSI_STATE_TERMINATE) { spin_unlock_bh(&session->lock); - up(&conn->xmitsema); + mutex_unlock(&conn->xmitmutex); debug_scsi("abort failed becuase session->state == ISCSI_STATE_TERMINATE\n"); goto failed; } @@ -1378,7 +1378,7 @@ iscsi_iser_eh_abort(struct scsi_cmnd *sc * 2) session was re-open during time out of ctask. */ spin_unlock_bh(&session->lock); - up(&conn->xmitsema); + mutex_unlock(&conn->xmitmutex); goto success; } conn->tmabort_state = TMABORT_INITIAL; @@ -1430,7 +1430,7 @@ iscsi_iser_eh_abort(struct scsi_cmnd *sc conn->tmabort_state == TMABORT_SUCCESS) { conn->tmabort_state = TMABORT_INITIAL; spin_unlock_bh(&session->lock); - up(&conn->xmitsema); + mutex_unlock(&conn->xmitmutex); goto success; } conn->tmabort_state = TMABORT_INITIAL; @@ -1440,7 +1440,7 @@ iscsi_iser_eh_abort(struct scsi_cmnd *sc spin_unlock_bh(&session->lock); } } - up(&conn->xmitsema); + mutex_unlock(&conn->xmitmutex); /* * block eh thread until: @@ -1517,7 +1517,7 @@ failed: exit: del_timer_sync(&conn->tmabort_timer); - down(&conn->xmitsema); + mutex_lock(&conn->xmitmutex); if (conn->sock) { struct sock *sk = conn->sock->sk; @@ -1525,7 +1525,7 @@ exit: iscsi_iser_ctask_cleanup(conn, ctask); write_unlock_bh(&sk->sk_callback_lock); } - up(&conn->xmitsema); + mutex_unlock(&conn->xmitmutex); debug_iser("%s: exit\n", __FUNCTION__); return rc; } @@ -1793,9 +1793,9 @@ iscsi_iser_conn_send_pdu(iscsi_connh_t c struct iscsi_iser_conn *conn = iscsi_ptr(connh); int rc; - down(&conn->xmitsema); + mutex_lock(&conn->xmitmutex); rc = iscsi_iser_conn_send_generic(connh, hdr, data, data_size); - up(&conn->xmitsema); + mutex_unlock(&conn->xmitmutex); return rc; } @@ -1845,7 +1845,7 @@ static int __init iser_init(void) return -ENOMEM; /* adaptor init is called only after the first addr resolution */ - init_MUTEX(&ig.adaptor_list_sem); + mutex_init(&ig.adaptor_list_mutex); INIT_LIST_HEAD(&ig.adaptor_list); err = iser_register_sockets(); From ogerlitz at voltaire.com Sun Feb 19 04:55:22 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 19 Feb 2006 14:55:22 +0200 Subject: [openib-general] Re: [PATCH] mutex-backport: add mutex_trylock support In-Reply-To: <20060219124553.GA22037@mellanox.co.il> References: <20060219124553.GA22037@mellanox.co.il> Message-ID: <43F86ABA.5060209@voltaire.com> Michael S. Tsirkin wrote: > Quoting r. Or Gerlitz : >> +/* this function follows the spin_trylock() convention, so * >> + * it is negated to the down_trylock() return values! Be careful */ >> +#define mutex_trylock(foo) !down_trylock(foo) > "negated to the down_trylock() return values" probably means !down_trylock > which is already obvious. So what does the comment mean? the comment comes to warn people that a code that was using if(!down_trylock(foo)) need to be changed to if(mutex_trylock(foo)) and viseversa (code that was using if(down_trylock(foo)) etc) - that's all. Or. From mst at mellanox.co.il Sun Feb 19 05:06:37 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 19 Feb 2006 15:06:37 +0200 Subject: [openib-general] Re: [PATCH] mutex-backport: add mutex_trylock support In-Reply-To: <43F86ABA.5060209@voltaire.com> References: <20060219124553.GA22037@mellanox.co.il> <43F86ABA.5060209@voltaire.com> Message-ID: <20060219130637.GC22037@mellanox.co.il> Quoting r. Or Gerlitz : > >>+/* this function follows the spin_trylock() convention, so * > >>+ * it is negated to the down_trylock() return values! Be careful */ > >>+#define mutex_trylock(foo) !down_trylock(foo) > > >"negated to the down_trylock() return values" probably means !down_trylock > >which is already obvious. So what does the comment mean? > > the comment comes to warn people that a code that was using > if(!down_trylock(foo)) need to be changed to if(mutex_trylock(foo)) and > viseversa (code that was using if(down_trylock(foo)) etc) - that's all. Lets be explicit then, and give some actual information: /* NB: mutex_trylock returns 1 on success, down_trylock - 0 on success. */ Make sense? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Sun Feb 19 05:12:42 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 19 Feb 2006 15:12:42 +0200 Subject: [openib-general] Re: Re: Re: [PATCH] change Mellanox SDP workaround toa moduleparameter In-Reply-To: <1140189358.4333.44536.camel@hal.voltaire.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3FEDFE1@mtlexch01.mtl.com> <1140189358.4333.44536.camel@hal.voltaire.com> Message-ID: <20060219131242.GE22037@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: RE: Re: Re: [PATCH] change Mellanox SDP workaround toa moduleparameter > > On Thu, 2006-02-16 at 14:53, Dror Goldenberg wrote: > > > From: openib-general-bounces at openib.org > > > [mailto:openib-general-bounces at openib.org] On Behalf Of Hal Rosenstock > > > Sent: Thursday, February 16, 2006 1:13 PM > > > > > > On Thu, 2006-02-16 at 02:54, Michael S. Tsirkin wrote: > > > > Quoting r. Hal Rosenstock : > > > > > Subject: Re: Re: [PATCH] change Mellanox SDP workaround to a > > > > > moduleparameter > > > > > > > > > > On Wed, 2006-02-15 at 19:03, Roland Dreier wrote: > > > > > > > > > > > I guess the question is what to do when a Tavor (with the > > > > > > performance bug that makes a 1K MTU faster) connects to someone > > > > > > else. > > > > > > > > > > Isn't it the other way 'round (when something with a larger MTU > > > > > connects to Tavor) ? > > > > > > > > Right. I wish we had an MTU field in the REP packet, but we dont. > > > > > > Yes, that would be better IMO too. Not sure why it wasn't > > > done that way. Guess you could file an erratum on this. > > > > > > -- Hal > > > > The SWG defined a generic mechanism which uses REJ to indicate that the > > passive side does not accept a certain REQ fields, and allows the passive > > side to indicate an alternative value. Indirection is also supported through > > the same protocol. It also allows the active side, following the REJ, to use > > an alternate value, other than the one suggested by the passive side, i.e. > > passive side only has a veto capability. This is the mechanism and the short > > theory behind it. Unfortunately it's a bit inefficient in terms of > > performance because of the ping pong of messages. Solving just the MTU might > > not be a good enough argument. The approach should be to enable the active > > side to specify a set of acceptable parameters for each one of the REQ > > fields, and then let the passive side to choose. This may change the CM > > packets all over and will introduce new problems. I don't think that there's > > a good chance of just adding a solution for just one of the fields. Anyway, > > you can still try and propose this to IBTA, I tried it once already :) > > Thanks for the historical perspective. It's harder to overturn an > existing vote on something at the IBTA. Not sure I have the time to take > up this (larger) mission. Assuming the spec says as it is, then: 1. CMA needs to be modified to retry the connection if its rejected because of lower MTU. 2. SDP/SRP protocols specs need a clarification: e.g. current SDP spec says the connection should be closed when we get a REJ. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From love_des_galcons at bqem.com Sun Feb 19 04:35:10 2006 From: love_des_galcons at bqem.com (LOVE des GALCONS) Date: 19 Feb 2006 21:35:10 +0900 Subject: [openib-general] $BK\F|$h$jL5NA$H$5$;$FD:$-$^$9!#(B Message-ID: <20060219123510.16411.qmail@mail.bqem.com>  最近このような招待メールが頻繁に届いていると思いますが、余りにも悪戯(架空請求など)のメールが多い為、当サイトは登録及び紹介などを※完全無料※と致しました。それだけでなく、今現在認証(無料)を行われると全員漏れなく、★6000円分Pt★(通常一週間分)が追加されます。   http://www.dynamite-wave.com/?newjoin ※24時間体制の安心サポートを行っておりますのでPC初心者の方でも安心してご利用になれます。   素敵な出会いを自分で掴み取って下さい♪ なお、希望をされない場合は  nomore at dynamite-wave.com までお願いします。至急、配信停止の手続きをとらせていただきます。 From liconurohan at guntalk.com Sun Feb 19 05:14:51 2006 From: liconurohan at guntalk.com (Rohan Licon) Date: Sun, 19 Feb 2006 08:14:51 -0500 Subject: [openib-general] Re: fl accid news Message-ID: <000001c63556$77983080$9db6a8c0@beamthread> Hi V s AiL l lcUaMw d$w1g, u 2x1 e CbIjAwLzIjSo n$q3h,p3 f 3 n V c IdAkGnR n Ay v$l3g,t7j5r http://www.realdoe.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From ogerlitz at voltaire.com Sun Feb 19 05:29:06 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 19 Feb 2006 15:29:06 +0200 Subject: [openib-general] Re: [PATCH] mutex-backport: add mutex_trylock support In-Reply-To: <20060219130637.GC22037@mellanox.co.il> References: <20060219124553.GA22037@mellanox.co.il> <43F86ABA.5060209@voltaire.com> <20060219130637.GC22037@mellanox.co.il> Message-ID: <43F872A2.3090400@voltaire.com> Michael S. Tsirkin wrote: > Lets be explicit then, and give some actual information: > /* NB: mutex_trylock returns 1 on success, down_trylock - 0 on success. */ > Make sense? it does make sense, anyway I just brought a quote from the code of 2.6.16/kernel/mutex.c so i dont see any reason to change it. Or. From mst at mellanox.co.il Sun Feb 19 05:31:56 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 19 Feb 2006 15:31:56 +0200 Subject: [openib-general] Re: Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: References: Message-ID: <20060219133156.GF22037@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: Plans for libibverbs 1.0, 1.1 and beyond > > Christoph> we also currently prefer svn for the 1.0 release. At > Christoph> some point we'll have to backport bugfixes to the 1.0 > Christoph> release found in a later development version. Having > Christoph> to do that at all isn't really fun, but having to keep > Christoph> some of that code in "sort of" sync between different > Christoph> repositories is even more difficult. > > There seems to be some confusion here. There would only ever be one > libibverbs repository. The only question is whether it remains in svn > or moves to a different SCM. > > In fact moving to git would make porting patches between different > branches far easier. I think the claim is that it's awkward to use different tools to get different components. Related question: can git support working on just the infiniband subdirectory of a tree? Thats important e.g. for people that only want to work on the infiniband stuff against the last stable kernel. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From chiharu509 at uyod.com Sun Feb 19 03:22:47 2006 From: chiharu509 at uyod.com (=?ISO-2022-JP?B?GyRCPSlMbkBpPVUbKEI=?=) Date: 19 Feb 2006 20:22:47 +0900 Subject: [openib-general] $B!!FMA3?=$7Lu$"$j$^$;$s!#(B Message-ID: <20060219112247.8309.qmail@mail.uyod.com> $B!!(Bhttp://www.fireblow.net/?num=452 $B$NCO0h0[@->R2p$rC4Ev$7$F$$$kR2pBP>]Jg=8$N7o$K$D$$$F$4O"Mm$5$;$FD:$-$^$7$?!#(B $BEv%a!<%k$,LBOG$@$H;W$o$l$?J}$O$*l9g$,$4$6$$$^$9!#$4N;>52<$5$$!#!K(B $B:#2s$N$4>7BT$K$"$?$C$F!"Ev7BT$G$9$N$G!"G[?.$5$l$?%"%I%l%9$G$NEPO?NA5Z$S>7BTNA$J$I0l at ZL5NA$H at _Dj$5$;$FD:$-$^$7$?!#$=$l$@$1$G$O$J$/!""c(B1,000$B1_J,"d$N%5!<%S%9%]%$%s%H$bEPO?D>8e$K<+F0DI2C$HCW$7$^$9!#99$K-d3NG'(B($BIT at 5@A5a$J$I$N0-MQ$O0l at Z$4$6$$$^$;$s$N$G!"$40B?42<$5$$(B)$B$r9T$C$?D>8e!"(B(5,000$B1_J,(B)$B$NL5NA%]%$%s%H$b(B24$B;~4VBP1~$GDI2CCW$7$^$9!#(B $B""5.J}$NEPO?>pJs$rF1CO0h0[@-$X?o;~DLC#CW$7$^$9$N$G!"0[@-$+$i$N%a!<%k$,D>@\5.J}$NJ}$KFO$-$^$9!#EvHVAH$G$O0[@-A4$F$N>pJs$r3NG'2DG=$H$J$C$F$*$j$^$9!#(B ====================================== $B!!(B http://www.fireblow.net/?num=452 ====================================== $B"(?M?t8BDj$N$4>7BT$G$9$N$G!"<+F0GQ4~$HG'$a$?>l9g$O8"Mx2s<}$H$J$j$^$9$N$G!"$J$k$Y$/%a!<%k3+Iu$7$F(B24$B;~4V0JFb$K$4EPO?$r40N;$9$k$h$&$*4j$$CW$7$^$9!#(B $B"(%o%s%/%j%C%/:>5=Ey0l at ZL5$$$3$H$rJ]>Z$7$^$9!#(B $B"(3F$5$l$kJ}$O(B refuse at fireblow.net $B$^$G%a!<%k$r$*Aw$j2<$5$$!#(B From mst at mellanox.co.il Sun Feb 19 05:39:15 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 19 Feb 2006 15:39:15 +0200 Subject: [openib-general] Re: [PATCH] mutex-backport: add mutex_trylock support In-Reply-To: <43F872A2.3090400@voltaire.com> References: <20060219124553.GA22037@mellanox.co.il> <43F86ABA.5060209@voltaire.com> <20060219130637.GC22037@mellanox.co.il> <43F872A2.3090400@voltaire.com> Message-ID: <20060219133915.GG22037@mellanox.co.il> Quoting r. Or Gerlitz : > Subject: Re: [PATCH] mutex-backport: add mutex_trylock support > > Michael S. Tsirkin wrote: > >Lets be explicit then, and give some actual information: > >/* NB: mutex_trylock returns 1 on success, down_trylock - 0 on success. */ > >Make sense? > > it does make sense, anyway I just brought a quote from the code of > 2.6.16/kernel/mutex.c so i dont see any reason to change it. Please copy verbatim then: * NOTE: this function follows the spin_trylock() convention, so * it is negated to the down_trylock() return values! Be careful * about this when converting semaphore users to mutexes. Both NOTE and the last line are missing. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies Return-Path: Delivered-To: openib-general at openib.org Received: from mx2.netapp.com (mx2.netapp.com [216.240.18.37]) by openib.ca.sandia.gov (Postfix) with ESMTP id 886582283D8 for ; Sun, 19 Feb 2006 07:03:37 -0800 (PST) Received: from smtp1.corp.netapp.com ([10.57.156.124]) by mx2.netapp.com with ESMTP; 19 Feb 2006 06:56:32 -0800 X-IronPort-AV: i="4.02,127,1139212800"; d="scan'208"; a="359977071:sNHT25492488" Received: from svlexc03.hq.netapp.com (svlexc03.corp.netapp.com [10.57.156.149]) by smtp1.corp.netapp.com (8.13.1/8.13.1/NTAP-1.6) with ESMTP id k1JEuUMq015428; Sun, 19 Feb 2006 06:56:30 -0800 (PST) Received: from lavender.hq.netapp.com ([10.56.11.75]) by svlexc03.hq.netapp.com with Microsoft SMTPSVC(6.0.3790.0); Sun, 19 Feb 2006 06:56:30 -0800 Received: from exnane01.hq.netapp.com ([10.97.0.61]) by lavender.hq.netapp.com with Microsoft SMTPSVC(5.0.2195.6713); Sun, 19 Feb 2006 06:56:30 -0800 Received: from tmt.netapp.com ([10.30.32.67]) by exnane01.hq.netapp.com with Microsoft SMTPSVC(6.0.3790.0); Sun, 19 Feb 2006 09:56:28 -0500 Message-Id: <7.0.1.0.2.20060219093506.040ee028 at netapp.com> X-Mailer: QUALCOMM Windows Eudora Version 7.0.1.0 Date: Sun, 19 Feb 2006 09:56:18 -0500 To: Christoph Hellwig From: "Talpey, Thomas" Subject: Re: [openib-general] NFS/RDMA client release for Linux 2.6.15 In-Reply-To: <20060219120155.GA10268 at lst.de> References: <7.0.1.0.2.20060207211754.0409e8a0 at netapp.com> <20060219120155.GA10268 at lst.de> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-OriginalArrivalTime: 19 Feb 2006 14:56:28.0782 (UTC) FILETIME=[A9C3C8E0:01C63564] Cc: nfsv4 at linux-nfs.org, openib-general at openib.org X-BeenThere: openib-general at openib.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: OpenIB General Mailing List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 19 Feb 2006 15:03:40 -0000 Thanks for the detailed review! Some replies below. I left the IETF list out of this reply since it's basically porting, not protocol. At 07:01 AM 2/19/2006, Christoph Hellwig wrote: >On Wed, Feb 08, 2006 at 03:58:56PM -0500, Talpey, Thomas wrote: >> We have released an updated NFS/RDMA client for Linux at >> the project's Sourceforge site: > >Thanks, this looks much better than the previous patch. > >Comments: > > - please don't build the rdma transport unconditional, but make it > a user-visible config option It's an option, but it's located in fs/Kconfig not net/. This is the way SUNRPC is selected, so we simply followed that. BTW, Chuck's transport switch doesn't support dynamically loading modules yet so there is a dependency to work out until that's in place. > - please use the kernel u*/s* types instead of (u)int*_t We use uint*_t for the user-visible protocol definitions (on the wire) and u32 etc for kernel stuff. I'll recheck if we got something wrong. > - please include your local headers after the headers, There are a couple of issues with header include ordering that seem to change pretty often. In a couple of cases we had to rearrange things to avoid forward declarations, I'll recheck this. > and keep all the includes at the beginning of the files, just after > the licence comment block > - chunktype shouldn't be a typedef but a pure enum, and the > names look a bit too generic, please add an rdma_ prefix Ok on both. > - please kill the XDR_TARGET and pos0 macros, maybe RPC_SEND_SEG0 > and RPC_SEND_LEN0, too > - RPC_SEND_VECS should become an inline functions and be spelled > lowercase > - RPC_SEND_COPY is probably too large to be inlined and should be > spelled lowercase > - RPC_RECV_VECS should be an inline and spelled lowercase > - RPC_RECV_SEG0 and PC_RECV_LEN0 should probably go away. > - RPC_RECV_COPY is probably too large to be inlined and should be > spelled lowercase > - RPC_RECV_COPY same comment about highmem and kmap as in > RPC_SEND_COPY These are killable. They were there to support code sharing for 2.4 kernels and are easy to eliminate now. > - the CONFIG_HIGHMEM ifdef block in RPC_SEND_COPY is wrong. Please > always use kmap, it does the right thing for non-highmem aswell. > The PageHighMem check and using kmap_high directly is always > wrong, they are internal implementation details. I'd also suggest > evaluating kmap_atomic because it scales much better on SMP systems. Yes, there are some issues here which we're still working out. In fact, we can't use kunmap() in the context you mention because in 2.6.14 (or is it .15) it started to check for being invoked in interrupt context. There is one configuration in which we do call it in bh context. The call won't block but the kernel BUG_ON's. This is something on our list to address. > - please try to avoid file-scope forward-prototypes but try to order the > code in the natural flow where they aren't required Good point. Will recheck for these. > - structures like rpcrdma_msg that are on the wire should use __be* > for endianess annotations, and the cpu_to_be*/be*_to_cpu accessor > functions instead of hton?/ntoh?. Please verify that these annotations > are correct using sparse -D__CHECK_ENDIAN__=1 Hmm, okay but existing RPC and NFS code don't do this. I'm reluctant to differ from the style of the governing subsystem. I'll check w/Trond. > - rdma_convert_physiov/rdma_convert_phys are completely broken. > page_to_phys can't be used by driver/fs code. RDMA only deals with bus > addresses, not physical addresses. You must use the dma mapping API > instead. Also coalescing decisions are made by the dma layer, because > they are platform dependent and much more complex then what the code > in this patch does. Now that we are moving to OpenIB api's this is needed. There is some thought necessary w.r.t. our max-performance mode of preregistering memory in DMA mode. That's on our list of course. > - transport.c is missing a GPL license statement Oops. > - in transport.c please don't use CamelCase variable names. This is just for module parameters? These are going away but we don't have the new NFS mount API yet. There is a comment to that effect but maybe it doesn't mention the module stuff. > - MODULE_PARM shouldn't be used in new code, but module_param instead. Ditto. > - please don't use the (void) function() style, it just obsfucates the > code without benefit. Ok. > - try_module_get(THIS_MODULE) is always wrong. Reference counting > should happen from the calling module. This is the same convention used by the other RPC transports. I will pass the comment along. > - please initialize global or file-scope spinlocks with > DEFINE_SPINLOCK(). Ok. > - the traditional name for the second argument to spin_lock_irqsave is > just flags, not lock_flags. This doesn't really matter, but > following such conventions makes it easier to understand the code > for kernel hackers that just occasionally drop into your code. Ok. > - no need to case the return value from kmalloc/kzalloc/etc. They > return void * which can be directly assigned to any pointer type. Ok. Did *I* write that code? Very unlike me. :-) > - please avoid typedes for structure types, like struct rdma_ia, > struct rdma_ep, etc.. I thought we got rid of those. Will recheck. Tom. From info at fnjf.com Sun Feb 19 06:34:52 2006 From: info at fnjf.com (info at fnjf.com) Date: 19 Feb 2006 23:34:52 +0900 Subject: [openib-general] $BM7$S463P$GNx?MBe9T$N;E;v$r$7$^$;$s$+(B Message-ID: <20060219143452.4887.qmail@mail.fnjf.com> $B!Z(B18$B6X![(B http://www.deai-style.net/?cv22 $B!T$3$s$J=w at -$?$A$,5.J}$rBT$C$F$^$9!U(B $B0l?M$G$$$?$/$J$$;~!"C/$+$KMj$j$?$$;~!"4E$($?$$;~!"$^$?!"0l?M$GJk$i$7$F$$$?(B $B$j!"2HB2$HJk$i$7$F$$$F$b2?$H$J$/NT$7$$;~!"Nx?M!&IW$,$$$J$$;~$K!TNx?MBe9T!U$r(B $B$4MxMQ$9$k=w at -$,B?$/$J$C$FMh$^$7$?!#(B [$B40(B][$BA4(B][$BL5(B][$BNA(B] [$BNx(B][$B?M(B][$BBe(B][$B9T(B] http://www.deai-style.net/?cv22 $B"(Be6b$O$9$Y$F5.J}$K(B...$B at hJ'$$2DG=(B *-*-*-*-*-*-*-*-*-*-*-* The reception refusal of mail $B 【プロフィール】 ◎名前:美樹 ◎登録ID:110513 ◎職業:美容師 ◎年収850万円(平成17年確定申告) ◎サイズ:身長165センチ、スリーサイズ上から91、58、89 ◎素顔:学校時代から美容師のかたわら副業はモデルです。 ◎得意料理:ペンネ・ゴルゴンゾーラ、ボロネーゼ、ジェノベーゼソースのペスカトーレ ◎愛車:ホンダ アコードセダン スタイル抜群!網タイツにタイトのミニスカート姿は絶品の一言!(写メあり!)お会いしてくれたら1回5万以上約束! http://www.f-day1.net?rocky 『最近、近くに引っ越して来ました。今、3LDKのマンションに一人暮らしです。 引っ越したばっかりで部屋がまだ片付いていないの… 暇な時でいいから、私の部屋に来ませんか? だって、寂しいから…。来てくれたら、私の手料理ご馳走しますね。 私のパスタ料理とてもおいしいのよ!ぜひ、食べてみて…! 貴方が望むならHだって夜どうし付き合うわ… http://www.f-day1.net?rocky 美樹の唾液たっぷり飲ませてあげるわ… そのかわりあなたのも飲ませてね… それと一日はいているので美樹の汚れと匂いがついているけど、脱ぎたてのTバックもさしあげましょうか? 嫌ならいいけど、くさいから嫌って言わないでね… 確かに深く食い込んでいる上に汗もかいたしおならもしちゃったけど、私、傷ついちゃうみたいで… 美樹に満足あたえてください!美樹もあなたに経済的にお助けします! まえむきに考えてくださいね! あなたのお返事まっています。』 私と付き合うの嫌いですか? stop at f-day1.net お返事くださいね From hch at lst.de Sun Feb 19 08:14:55 2006 From: hch at lst.de (Christoph Hellwig) Date: Sun, 19 Feb 2006 17:14:55 +0100 Subject: [openib-general] NFS/RDMA client release for Linux 2.6.15 In-Reply-To: <7.0.1.0.2.20060219093506.040ee028@netapp.com> References: <7.0.1.0.2.20060207211754.0409e8a0@netapp.com> <20060219120155.GA10268@lst.de> <7.0.1.0.2.20060219093506.040ee028@netapp.com> Message-ID: <20060219161455.GA14037@lst.de> On Sun, Feb 19, 2006 at 09:56:18AM -0500, Talpey, Thomas wrote: > > - please don't build the rdma transport unconditional, but make it > > a user-visible config option > > It's an option, but it's located in fs/Kconfig not net/. This is the way > SUNRPC is selected, so we simply followed that. BTW, Chuck's transport > switch doesn't support dynamically loading modules yet so there is a > dependency to work out until that's in place. Right now it's an option, but not a user-selectable one: --- snip --- +config SUNRPC_XPRT_RDMA + depends on INFINIBAND + tristate --- snip --- to make it user-visible you need to add an option description after the tristate, e.g. tristate "RDMA transport for sunrpc" not strictly required but very useful is an additional help text using the help verb of the kconfig language. In the end form the select on config SUNRPC shouldn't be there either. > > - the CONFIG_HIGHMEM ifdef block in RPC_SEND_COPY is wrong. Please > > always use kmap, it does the right thing for non-highmem aswell. > > The PageHighMem check and using kmap_high directly is always > > wrong, they are internal implementation details. I'd also suggest > > evaluating kmap_atomic because it scales much better on SMP systems. > > Yes, there are some issues here which we're still working out. In fact, we > can't use kunmap() in the context you mention because in 2.6.14 (or is it > .15) it started to check for being invoked in interrupt context. There is > one configuration in which we do call it in bh context. The call won't block > but the kernel BUG_ON's. This is something on our list to address. That's one more reason to use kmap_atomic/kunmap_atomic which is fine from interrupt context. You'll have to carefully check whether you can use an existing KM_ slot or allocate a new one, though. > > - structures like rpcrdma_msg that are on the wire should use __be* > > for endianess annotations, and the cpu_to_be*/be*_to_cpu accessor > > functions instead of hton?/ntoh?. Please verify that these annotations > > are correct using sparse -D__CHECK_ENDIAN__=1 > > Hmm, okay but existing RPC and NFS code don't do this. I'm reluctant to > differ from the style of the governing subsystem. I'll check w/Trond. The nfs code is in the process of beeing converted currently. > > - in transport.c please don't use CamelCase variable names. > > This is just for module parameters? These are going away but we don't have > the new NFS mount API yet. There is a comment to that effect but maybe > it doesn't mention the module stuff. It's for all variables, but afaik you only use mixed case for the module paramaters. From dotanb at mellanox.co.il Sun Feb 19 08:26:40 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Sun, 19 Feb 2006 18:26:40 +0200 Subject: [openib-general] ibv_create_srq doesn't update the SRQ init attributes Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3016C4D5E@mtlexch01.mtl.com> Hi. When one create a QP, the init attributes structure is being changed by the verb to the actual init attributes that were used for this QP. When one create a SRQ, the init attributes structure is not being changed, BUT a query_srq will return the actual values that were used to create the SRQ (those values may be different from the values that the user put in the create_srq). Should the create_srq return the values that were used to create the SRQ in the init_attributes? thanks Dotan Barak Software Verification Engineer Mellanox Technologies Tel: +972-4-9097200 Ext: 231 Fax: +972-4-9593245 P.O. Box 86 Yokneam 20692 ISRAEL. Home: +972-77-8841095 Cell: 052-4222383 -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Sun Feb 19 08:38:43 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 19 Feb 2006 18:38:43 +0200 Subject: [openib-general] [RFC] [PATCH] OpenSM: Add functional partition manager support Message-ID: <20060219163843.GB16012@sashak.voltaire.com> Hello, There is phase 1 of partiton manager for OpenSM. Please review. Thanks, Sasha. This patch implements partition management for OpenSM (Phase 1) as described in osm/doc/OpenSM_PKey_Mgr.txt. Basically at each heavy resweep this will: - recreate partition configuration - update pkey tables for endports - update switch's ports connected to endports - for partitions marked for IPoIB support this will also create appropriate multicast group Signed-off-by: Sasha Khapyorsky diff --git a/osm/doc/partition-config.txt b/osm/doc/partition-config.txt new file mode 100644 index 0000000..b3ba804 --- /dev/null +++ b/osm/doc/partition-config.txt @@ -0,0 +1,90 @@ +OpenSM Partitions configuration. +=============================== + +The default name of OpenSM partitions configuration file is +'/etc/osm-partitions.txt'. The default may be changed by using +--Pconfig (-P) option with OpenSM. + +The default partition will be created by OpenSM unconditionally even +when partition configuration file does not exist or cannot be accessed. + +The default partition has P_Key value 0x7fff. OpenSM's port will have +full membership in default partition. all other end ports will have +partial membership. + + +File Format. +=========== + +Comments: +-------- + +Line content followed after '#' character is comment and ignored by +parser. + + +General file format: +------------------- + +: ; + + +Partition Definition: +-------------------- + +[PartitionName][=PKey][,flag] + +PartitionName - free string, will be used with logging. When omitted + empty string will be used. +PKey - P_Key value for this partition. Only low 15 bits will + be used. When omitted will be autogenerated. +flag - used to indicate IPoIB capability of this partition. + 'ipoib' is only valid value currently (in future other + values may be added). + + +PortGUIDs list: +-------------- + +[PortGUID[=full|=part]] [,PortGUID[=full|=part]] [,PortGUID] ... + +PortGUID - GUID of partition member EndPort. Hexadecimal numbers + should start from 0x. +full or part - indicates full or partial membership for this port. When + omitted (or unrecognized) partial membership is assumed. + +There are two useful keywords for PortGUID definition: + +- 'ALL' means all end ports in this subnet +- 'SELF' means subnet manager's port. + +Empty list means no ports in this partition. + + +Notes: +----- + +White spaces are permitted between delimiters ('=', ',',':',';'). + +The Line can be wrapped after ':' followed after Partition Definition and +between. + +PartitionName does not need to be unique, PKey does need to be unique. +If PKey is repeated then those partition configurations will be merged +(see also next note). + +It is possible to split partition configuration in more than one +definition, but then PKey should be explicitly specified (overwise +different PKey values will be generated for those definitions). + + +Examples: +-------- + +Default=0x7fff : ALL, SELF=full ; + +NewPartition , ipoib : 0x123456=full, 0x3456789034=part, 0x2134af2306 ; + +YetAnotherOne = 0x300 : SELF=full ; +YetAnotherOne = 0x300 : ALL=part ; + diff --git a/osm/include/opensm/osm_base.h b/osm/include/opensm/osm_base.h index 660771f..3da39a6 100644 --- a/osm/include/opensm/osm_base.h +++ b/osm/include/opensm/osm_base.h @@ -222,6 +222,22 @@ BEGIN_C_DECLS #endif /***********/ +/****d* OpenSM: Base/OSM_DEFAULT_PARTITION_CONFIG_FILE +* NAME +* OSM_DEFAULT_PARTITION_CONFIG_FILE +* +* DESCRIPTION +* Specifies the default partition config file name +* +* SYNOPSIS +*/ +#ifdef __WIN__ +#define OSM_DEFAULT_PARTITION_CONFIG_FILE strcat(GetOsmPath(), "osm-partitions.conf") +#else +#define OSM_DEFAULT_PARTITION_CONFIG_FILE "/etc/osm-partitions.conf" +#endif +/***********/ + /****d* OpenSM: Base/OSM_DEFAULT_SWEEP_INTERVAL_SECS * NAME * OSM_DEFAULT_SWEEP_INTERVAL_SECS diff --git a/osm/include/opensm/osm_partition.h b/osm/include/opensm/osm_partition.h index 27678c2..369cf8a 100644 --- a/osm/include/opensm/osm_partition.h +++ b/osm/include/opensm/osm_partition.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -50,6 +50,12 @@ #ifndef _OSM_PARTITION_H_ #define _OSM_PARTITION_H_ +#include +#include +#include +#include +#include + #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { # define END_C_DECLS } @@ -94,12 +100,17 @@ BEGIN_C_DECLS */ typedef struct _osm_prtn { - uint16_t pkey; - cl_map port_guid_tbl; - + cl_map_item_t map_item; + uint16_t pkey; + cl_map_t full_guid_tbl; + cl_map_t part_guid_tbl; + char name[32]; } osm_prtn_t; /* * FIELDS +* map_item +* Linkage structure for cl_qmap. MUST BE FIRST MEMBER! +* * pkey * The IBA defined P_KEY of this Partition. * @@ -111,118 +122,61 @@ typedef struct _osm_prtn * Partition *********/ -/****f* OpenSM: Partition/osm_prtn_construct +/****f* OpenSM: Partition/osm_prtn_delete * NAME -* osm_prtn_construct +* osm_prtn_delete * * DESCRIPTION -* This function constructs a Partition. +* This function destroys and deallocates a Partition object. * * SYNOPSIS */ -void osm_prtn_construct( - IN osm_prtn_t* const p_prtn ); +void osm_prtn_delete( + IN OUT osm_prtn_t** const pp_prtn ); /* * PARAMETERS -* p_prtn -* [in] Pointer to a Partition to construct. +* pp_prtn +* [in][out] Pointer to a pointer to a Partition oject to +* delete. On return, this pointer is NULL. * * RETURN VALUE * This function does not return a value. * * NOTES -* Allows calling osm_prtn_init, osm_prtn_destroy, and osm_prtn_is_inited. -* -* Calling osm_prtn_construct is a prerequisite to calling any other -* method except osm_prtn_init. +* Performs any necessary cleanup of the specified Partition object. * * SEE ALSO -* Partition, osm_prtn_init, osm_prtn_destroy, osm_prtn_is_inited +* Partition, osm_prtn_new *********/ -/****f* OpenSM: Partition/osm_prtn_destroy +/****f* OpenSM: Partition/osm_prtn_new * NAME -* osm_prtn_destroy +* osm_prtn_new * * DESCRIPTION -* The osm_prtn_destroy function destroys a Partition, releasing -* all resources. +* This function allocates and initializes a Partition object. * * SYNOPSIS */ -void osm_prtn_destroy( - IN osm_prtn_t* const p_prtn ); +osm_prtn_t* osm_prtn_new( + IN const char *name, + IN const uint16_t pkey ); /* * PARAMETERS -* p_prtn -* [in] Pointer to a Partition to destroy. -* -* RETURN VALUE -* This function does not return a value. -* -* NOTES -* Performs any necessary cleanup of the specified Partition. -* Further operations should not be attempted on the destroyed object. -* This function should only be called after a call to osm_prtn_construct or -* osm_prtn_init. -* -* SEE ALSO -* Partition, osm_prtn_construct, osm_prtn_init -*********/ - -/****f* OpenSM: Partition/osm_prtn_init -* NAME -* osm_prtn_init -* -* DESCRIPTION -* The osm_prtn_init function initializes a Partition for use. -* -* SYNOPSIS -*/ -ib_api_status_t osm_prtn_init( - IN osm_prtn_t* const p_prtn ); -/* -* PARAMETERS -* p_prtn -* [in] Pointer to an osm_prtn_t object to initialize. -* -* RETURN VALUES -* CL_SUCCESS if initialization was successful. -* -* NOTES -* Allows calling other Partition methods. +* name +* [in] Partition name string * -* SEE ALSO -* Partition, osm_prtn_construct, osm_prtn_destroy, -* osm_prtn_is_inited -*********/ - -/****f* OpenSM: Partition/osm_prtn_is_inited -* NAME -* osm_prtn_is_inited -* -* DESCRIPTION -* Indicates if the object has been initialized with osm_prtn_init. -* -* SYNOPSIS -*/ -boolean_t osm_ptrn_is_inited( - IN const osm_prtn_t* const p_prtn ); -/* -* PARAMETERS -* p_prtn -* [in] Pointer to an osm_prtn_t object. +* pkey +* [in] Partition P_Key value * -* RETURN VALUES -* TRUE if the object was initialized successfully, -* FALSE otherwise. +* RETURN VALUE +* Pointer to the initialize Partition object. * * NOTES -* The osm_prtn_construct or osm_prtn_init must be called before using -* this function. +* Allows calling other partition methods. * * SEE ALSO -* Partition, osm_prtn_construct, osm_prtn_init +* Partition *********/ /****f* OpenSM: Partition/osm_prtn_is_guid @@ -234,9 +188,14 @@ boolean_t osm_ptrn_is_inited( * * SYNOPSIS */ +static inline boolean_t osm_prtn_is_guid( IN const osm_prtn_t* const p_prtn, - IN const uint64 guid ); + IN const ib_net64_t guid ) +{ + return (cl_map_get(&p_prtn->full_guid_tbl, guid) != NULL) || + (cl_map_get(&p_prtn->part_guid_tbl, guid) != NULL); +} /* * PARAMETERS * p_prtn @@ -254,24 +213,28 @@ boolean_t osm_prtn_is_guid( * SEE ALSO *********/ -/****f* OpenSM: Partition/osm_prtn_get_pkey +/****f* OpenSM: Partition/osm_prtn_make_partitions * NAME -* osm_prtn_get_pkey +* osm_prtn_make_partitions * * DESCRIPTION -* Gets the IBA defined P_KEY value for this Partition. +* Makes all partitions in subnet. * * SYNOPSIS */ -uint16_t osm_prtn_get_pkey( - IN const osm_prtn_t* const p_prtn ); +ib_api_status_t osm_prtn_make_partitions( + IN osm_log_t * const p_log, + IN osm_subn_t * const p_subn); /* * PARAMETERS -* p_prtn -* [in] Pointer to an osm_prtn_t object. +* p_log +* [in] Pointer to a log object. +* +* p_subn +* [in] Pointer to subnet object. * * RETURN VALUES -* P_KEY value for this Partition. +* IB_SUCCESS value on success. * * NOTES * diff --git a/osm/include/opensm/osm_sa_mcmember_record.h b/osm/include/opensm/osm_sa_mcmember_record.h index 97c6296..6e4e033 100644 --- a/osm/include/opensm/osm_sa_mcmember_record.h +++ b/osm/include/opensm/osm_sa_mcmember_record.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -315,6 +315,42 @@ osm_mcmr_rcv_create_new_mgrp( * *********/ +/****f* OpenSM: MC Member record Receiver/osm_mcmr_rcv_find_or_create_new_mgrp +* NAME +* osm_mcmr_rcv_find_or_create_new_mgrp +* +* DESCRIPTION +* Create new Multicast group +* +* SYNOPSIS +*/ + +ib_api_status_t +osm_mcmr_rcv_find_or_create_new_mgrp( + IN osm_mcmr_recv_t* const p_mcmr, + IN uint64_t comp_mask, + IN ib_member_rec_t* const p_recvd_mcmember_rec, + OUT osm_mgrp_t **pp_mgrp); +/* +* PARAMETERS +* p_mcmr +* [in] Pointer to an osm_mcmr_recv_t object. +* p_recvd_mcmember_rec +* [in] Received Multicast member record +* +* pp_mgrp +* [out] pointer the osm_mgrp_t object +* +* RETURN VALUES +* IB_SUCCESS, IB_ERROR +* +* NOTES +* +* +* SEE ALSO +* +*********/ + #define JOIN_MC_COMP_MASK (IB_MCR_COMPMASK_MGID | \ IB_MCR_COMPMASK_PORT_GID | \ IB_MCR_COMPMASK_JOIN_STATE) diff --git a/osm/include/opensm/osm_subnet.h b/osm/include/opensm/osm_subnet.h index 0aab874..7841c29 100644 --- a/osm/include/opensm/osm_subnet.h +++ b/osm/include/opensm/osm_subnet.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -73,6 +73,8 @@ BEGIN_C_DECLS #define OSM_SUBNET_VECTOR_CAPACITY 256 +struct _osm_opensm_t; + /****h* OpenSM/Subnet * NAME * Subnet @@ -220,6 +222,7 @@ typedef struct _osm_subn_opt uint8_t log_flags; char * dump_files_dir; char * log_file; + const char * partition_config_file; boolean_t accum_log_file; boolean_t console; cl_map_t port_prof_ignore_guids; @@ -399,6 +402,7 @@ typedef struct _osm_subn_opt */ typedef struct _osm_subn { + struct _osm_opensm_t *p_osm; cl_qmap_t sw_guid_tbl; cl_qmap_t node_guid_tbl; cl_qmap_t port_guid_tbl; @@ -644,6 +648,7 @@ osm_subn_destroy( ib_api_status_t osm_subn_init( IN osm_subn_t* const p_subn, + IN struct _osm_opensm_t * const p_osm, IN const osm_subn_opt_t* const p_opt ); /* * PARAMETERS diff --git a/osm/opensm/Makefile.am b/osm/opensm/Makefile.am index 447176a..ea7b007 100644 --- a/osm/opensm/Makefile.am +++ b/osm/opensm/Makefile.am @@ -80,6 +80,7 @@ opensm_SOURCES = main.c osm_console.c os osm_state_mgr_ctrl.c osm_subnet.c \ osm_sweep_fail_ctrl.c osm_sw_info_rcv.c \ osm_sw_info_rcv_ctrl.c osm_switch.c \ + osm_prtn.c osm_prtn_config.c \ osm_trap_rcv.c osm_trap_rcv_ctrl.c \ osm_ucast_mgr.c osm_ucast_updn.c \ osm_vl15intf.c osm_vl_arb_rcv.c\ diff --git a/osm/opensm/main.c b/osm/opensm/main.c index c5ba443..ed6ed79 100644 --- a/osm/opensm/main.c +++ b/osm/opensm/main.c @@ -182,6 +182,10 @@ show_usage(void) " This option will cause deletion of the log file\n" " (if it previously exists). By default, the log file\n" " is accumulative.\n\n"); + printf( "-P\n" + "--Pconfig\n" + " This option defines the optional partition configurationi file.\n" + " The default name is \'" OSM_DEFAULT_PARTITION_CONFIG_FILE "\'.\n\n"); printf( "-y\n" "--stay_on_fatal\n" " This option will cause SM not to exit on fatal initialization\n" @@ -470,7 +474,7 @@ main( boolean_t cache_options = FALSE; char *ignore_guids_file_name = NULL; uint32_t val; - const char * const short_option = "i:f:ed:g:l:s:t:a:uvVhorcy"; + const char * const short_option = "i:f:ed:g:l:s:t:a:P:uvVhorcy"; /* In the array below, the 2nd parameter specified the number @@ -491,6 +495,7 @@ main( { "D", 1, NULL, 'D'}, { "log_file", 1, NULL, 'f'}, { "erase_log_file",0, NULL, 'e'}, + { "Pconfig", 1, NULL, 'P'}, { "maxsmps", 1, NULL, 'n'}, { "console", 0, NULL, 'q'}, { "V", 0, NULL, 'V'}, @@ -680,6 +685,10 @@ main( printf(" Creating new log file\n"); break; + case 'P': + opt.partition_config_file = optarg; + break; + case 'y': opt.exit_on_fatal = FALSE; printf(" Staying on fatal initialization errors\n"); diff --git a/osm/opensm/osm_opensm.c b/osm/opensm/osm_opensm.c index 6ca6796..94a16ac 100644 --- a/osm/opensm/osm_opensm.c +++ b/osm/opensm/osm_opensm.c @@ -262,7 +262,7 @@ osm_opensm_init( if( status != IB_SUCCESS ) goto Exit; - status = osm_subn_init( &p_osm->subn, p_opt ); + status = osm_subn_init( &p_osm->subn, p_osm, p_opt ); if( status != IB_SUCCESS ) goto Exit; diff --git a/osm/opensm/osm_pkey_mgr.c b/osm/opensm/osm_pkey_mgr.c index 14ed2db..cc02bac 100644 --- a/osm/opensm/osm_pkey_mgr.c +++ b/osm/opensm/osm_pkey_mgr.c @@ -56,6 +56,7 @@ #include #include #include +#include /********************************************************************** **********************************************************************/ @@ -107,121 +108,254 @@ osm_pkey_mgr_init( /********************************************************************** **********************************************************************/ -boolean_t +static ib_api_status_t +osm_pkey_mgr_update_pkey_entry( + IN const osm_pkey_mgr_t * const p_mgr, + IN const osm_physp_t *p_physp, + IN const ib_pkey_table_t *block, + IN const uint16_t block_index) +{ + osm_madw_context_t context; + osm_node_t *p_node = osm_physp_get_node_ptr(p_physp); + uint32_t attr_mod; + + context.pkey_context.node_guid = osm_node_get_node_guid(p_node); + context.pkey_context.port_guid = osm_physp_get_port_guid(p_physp); + context.pkey_context.set_method = TRUE; + attr_mod = block_index; + if (osm_node_get_type(p_node) == IB_NODE_TYPE_SWITCH) + attr_mod |= osm_physp_get_port_num(p_physp) << 16; + return osm_req_set(p_mgr->p_req, osm_physp_get_dr_path_ptr(p_physp), + ( uint8_t * ) block, sizeof( *block ), + IB_MAD_ATTR_P_KEY_TABLE, + cl_hton32( attr_mod ), + CL_DISP_MSGID_NONE, &context ); +} + +/********************************************************************** + **********************************************************************/ + +/* + * Send a new entry for the pkey table for this port when this pkey + * does not exist. Update existed entry when membership was changed. + */ + +static boolean_t __osm_pkey_mgr_process_physical_port( IN const osm_pkey_mgr_t * const p_mgr, - IN osm_node_t * p_node, - IN uint8_t port_num, + IN const ib_net16_t pkey, IN osm_physp_t * p_physp ) { - boolean_t return_val = FALSE; /* TRUE if IB_DEFAULT_PKEY was inserted */ - osm_madw_context_t context; + boolean_t return_val = FALSE; /* TRUE if pkey was inserted or updated */ + ib_api_status_t status; + osm_node_t *p_node = osm_physp_get_node_ptr(p_physp); ib_pkey_table_t *block = NULL; uint16_t block_index; uint16_t num_of_blocks; const osm_pkey_tbl_t *p_pkey_tbl; - uint32_t attr_mod; + ib_net16_t *p_orig_pkey; uint32_t i; - ib_net16_t pkey; - ib_api_status_t status; - boolean_t block_with_empty_entry_found; + boolean_t block_found = FALSE; OSM_LOG_ENTER( p_mgr->p_log, __osm_pkey_mgr_process_physical_port ); - /* - * Send a new entry for the pkey table for this node that includes - * IB_DEFAULT_PKEY when IB_DEFAULT_PARTIAL_PKEY or IB_DEFAULT_PKEY - * don't exist - */ - if ( ( osm_physp_has_pkey( p_mgr->p_log, - IB_DEFAULT_PKEY, - p_physp ) == FALSE ) && - ( osm_physp_has_pkey( p_mgr->p_log, - IB_DEFAULT_PARTIAL_PKEY, p_physp ) == FALSE ) ) - { - context.pkey_context.node_guid = osm_node_get_node_guid( p_node ); - context.pkey_context.port_guid = osm_physp_get_port_guid( p_physp ); - context.pkey_context.set_method = TRUE; - - p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp ); - num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); - block_with_empty_entry_found = FALSE; + p_pkey_tbl = osm_physp_get_pkey_tbl(p_physp); + num_of_blocks = osm_pkey_tbl_get_num_blocks(p_pkey_tbl); + p_orig_pkey = cl_map_get(&p_pkey_tbl->keys, ib_pkey_get_base(pkey)); + + if ( p_orig_pkey && *p_orig_pkey == pkey ) { + if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) ) + { + osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, + "__osm_pkey_mgr_process_physical_port: " + "No need to insert %04x for node 0x%016" PRIx64 + " port %u\n", + cl_ntoh16(pkey), + cl_ntoh64(osm_node_get_node_guid(p_node)), + osm_physp_get_port_num(p_physp)); + } + goto _done; + } + else if (!p_orig_pkey) + { for ( block_index = 0; block_index < num_of_blocks; block_index++ ) { block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); for ( i = 0; i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK; i++ ) { - pkey = block->pkey_entry[i]; - if ( ib_pkey_is_invalid( pkey ) ) + if ( ib_pkey_is_invalid( block->pkey_entry[i] ) ) { - block->pkey_entry[i] = IB_DEFAULT_PKEY; - block_with_empty_entry_found = TRUE; + block->pkey_entry[i] = pkey; + block_found = TRUE; break; } } - if ( block_with_empty_entry_found ) + if ( block_found ) { break; } } - - if ( block_with_empty_entry_found == FALSE ) + } + else + { + *p_orig_pkey = pkey; + for ( block_index = 0; block_index < num_of_blocks; block_index++ ) { - osm_log( p_mgr->p_log, OSM_LOG_ERROR, - "__osm_pkey_mgr_process_physical_port: ERR 0501: " - "No empty entry was found to insert IB_DEFAULT_PKEY for node " - "0x%016" PRIx64 " and port %u\n", - cl_ntoh64( osm_node_get_node_guid( p_node ) ), port_num ); + block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); + i = p_orig_pkey - block->pkey_entry; + if ( i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK ) { + block_found = TRUE; + break; + } } - else - { - /* Building the attribute modifier */ - if ( osm_node_get_type( p_node ) == IB_NODE_TYPE_SWITCH ) - { - /* Port num | Block Index */ - attr_mod = port_num << 16 | block_index; - } - else - { - attr_mod = block_index; - } + } - status = osm_req_set( p_mgr->p_req, - osm_physp_get_dr_path_ptr( p_physp ), - ( uint8_t * ) block, - sizeof( *block ), - IB_MAD_ATTR_P_KEY_TABLE, - cl_hton32( attr_mod ), - CL_DISP_MSGID_NONE, &context ); - return_val = TRUE; /*IB_DEFAULT_PKEY was inserted */ + if ( block_found == FALSE ) + { + osm_log( p_mgr->p_log, OSM_LOG_ERROR, + "__osm_pkey_mgr_process_physical_port: ERR 0501: " + "No empty entry was found to insert %04x for node " + "0x%016" PRIx64 " and port %u\n", + cl_ntoh16(pkey), + cl_ntoh64(osm_node_get_node_guid(p_node)), + osm_physp_get_port_num(p_physp)); + goto _done; + } - if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) ) - { - osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, - "__osm_pkey_mgr_process_physical_port: " - "IB_DEFAULT_PKEY was inserted for node 0x%016" PRIx64 + status = osm_pkey_mgr_update_pkey_entry(p_mgr, p_physp, block, block_index); + + if (status != IB_SUCCESS) + { + osm_log( p_mgr->p_log, OSM_LOG_ERROR, + "__osm_pkey_mgr_process_physical_port: " + "osm_pkey_mgr_update_pkey_entry() failed to update " + "pkey table block %d for node 0x%016" PRIx64 " and port %u\n", + block_index, + cl_ntoh64(osm_node_get_node_guid(p_node)), + osm_physp_get_port_num(p_physp)); + goto _done; + } + + return_val = TRUE; /* pkey was inserted/updated */ + + if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) ) + { + osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, + "__osm_pkey_mgr_process_physical_port: " + "%04x was inserted for node 0x%016" PRIx64 + " and port %u\n", + cl_ntoh16(pkey), + cl_ntoh64(osm_node_get_node_guid(p_node)), + osm_physp_get_port_num(p_physp)); + } + + _done: + OSM_LOG_EXIT( p_mgr->p_log ); + return ( return_val ); +} + + +/********************************************************************** + **********************************************************************/ +static void osm_pkey_mgr_update_peer_port( + const osm_pkey_mgr_t * const p_mgr, + const osm_port_t * const p_port) +{ + osm_physp_t *p, *peer; + osm_node_t *p_node; + ib_pkey_table_t *block, *peer_block; + const osm_pkey_tbl_t *p_pkey_tbl, *p_peer_pkey_tbl; + uint16_t block_index; + uint16_t num_of_blocks; + ib_api_status_t status = IB_SUCCESS; + + p = osm_port_get_default_phys_ptr(p_port); + if (!osm_physp_is_valid(p)) + return; + peer = osm_physp_get_remote(p); + if (!peer || !osm_physp_is_valid(peer)) + return; + p_node = osm_physp_get_node_ptr(peer); + if (osm_node_get_type(p_node) == IB_NODE_TYPE_CA) + return; + + p_pkey_tbl = osm_physp_get_pkey_tbl(p); + p_peer_pkey_tbl = osm_physp_get_pkey_tbl(peer); + num_of_blocks = osm_pkey_tbl_get_num_blocks(p_pkey_tbl); + if (num_of_blocks > osm_pkey_tbl_get_num_blocks(p_peer_pkey_tbl)) + num_of_blocks = osm_pkey_tbl_get_num_blocks(p_peer_pkey_tbl); + + for ( block_index = 0; block_index < num_of_blocks; block_index++ ) + { + block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); + peer_block = osm_pkey_tbl_block_get( p_peer_pkey_tbl, block_index ); + if (cl_memcmp(peer_block, block, sizeof(*block))) { + cl_memcpy(peer_block, block, sizeof(*block)); + status = osm_pkey_mgr_update_pkey_entry(p_mgr, peer, peer_block, block_index); + if (status != IB_SUCCESS) + osm_log( p_mgr->p_log, OSM_LOG_ERROR, + "osm_pkey_mgr_update_peer_port: " + "osm_pkey_mgr_update_pkey_entry() failed to update " + "pkey table block %d for node 0x%016" PRIx64 " and port %u\n", - cl_ntoh64( osm_node_get_node_guid( p_node ) ), - port_num ); - } + block_index, + cl_ntoh64(osm_node_get_node_guid(p_node)), + osm_physp_get_port_num(peer)); } } - else + + if ( num_of_blocks && status == IB_SUCCESS && + osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) ) { - /* default key or partial default key already exist */ - if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) ) + osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, + "osm_pkey_mgr_update_peer_port: " + "pkey table was updated for node 0x%016" PRIx64 + " and port %u\n", + cl_ntoh64(osm_node_get_node_guid(p_node)), + osm_physp_get_port_num(peer)); + } +} + +/********************************************************************** + **********************************************************************/ +static boolean_t osm_pkey_mgr_process_partition_table( + const osm_pkey_mgr_t * const p_mgr, + const osm_prtn_t * const p_prtn, + const boolean_t full) +{ + const cl_map_t * p_tbl = full ? + &p_prtn->full_guid_tbl : &p_prtn->part_guid_tbl; + cl_map_iterator_t i, i_next; + ib_net16_t pkey = p_prtn->pkey; + osm_physp_t *p_physp; + boolean_t result = FALSE; + + if (full) + pkey = cl_hton16(cl_ntoh16(pkey)|0x8000); + + i_next = cl_map_head(p_tbl); + while (i_next != cl_map_end(p_tbl)) { + i = i_next; + i_next = cl_map_next(i); + p_physp = cl_map_obj(i); + if (p_physp && osm_physp_is_valid(p_physp) && + __osm_pkey_mgr_process_physical_port(p_mgr, pkey, p_physp)) { - osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, - "__osm_pkey_mgr_process_physical_port: " - "No need to insert IB_DEFAULT_PKEY for node 0x%016" PRIx64 - " port %u\n", - cl_ntoh64( osm_node_get_node_guid( p_node ) ), port_num ); + result = TRUE; + if (osm_log_is_active(p_mgr->p_log, OSM_LOG_VERBOSE)) + osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, + "osm_pkey_mgr_process_partition_table: " + "Adding %04x for pkey table of node " + "0x%016" PRIx64 " port %u\n", + cl_ntoh16(pkey), + cl_ntoh64(osm_node_get_node_guid( + osm_physp_get_node_ptr(p_physp))), + osm_physp_get_port_num(p_physp)); } } - OSM_LOG_EXIT( p_mgr->p_log ); - return ( return_val ); + return result; } /********************************************************************** @@ -230,51 +364,54 @@ osm_signal_t osm_pkey_mgr_process( IN const osm_pkey_mgr_t * const p_mgr ) { - cl_qmap_t *p_node_guid_tbl; - osm_node_t *p_node; - osm_node_t *p_next_node; - uint8_t port_num; - osm_physp_t *p_physp; + cl_qmap_t *p_tbl; + cl_map_item_t *p_next; + osm_prtn_t *p_prtn; + osm_port_t *p_port; osm_signal_t result = OSM_SIGNAL_DONE; CL_ASSERT( p_mgr ); OSM_LOG_ENTER( p_mgr->p_log, osm_pkey_mgr_process ); - p_node_guid_tbl = &p_mgr->p_subn->node_guid_tbl; + CL_PLOCK_EXCL_ACQUIRE(p_mgr->p_lock); - CL_PLOCK_EXCL_ACQUIRE( p_mgr->p_lock ); + if (osm_prtn_make_partitions(p_mgr->p_log, p_mgr->p_subn) != IB_SUCCESS) + { + osm_log(p_mgr->p_log, OSM_LOG_ERROR, "osm_pkey_mgr_process: " + "osm_prtn_make_partitions() is failed.\n"); + goto _err; + } - p_next_node = ( osm_node_t * ) cl_qmap_head( p_node_guid_tbl ); - while ( p_next_node != ( osm_node_t * ) cl_qmap_end( p_node_guid_tbl ) ) + p_tbl = &p_mgr->p_subn->prtn_pkey_tbl; + + p_next = cl_qmap_head(p_tbl); + while (p_next != cl_qmap_end(p_tbl)) { - p_node = p_next_node; - p_next_node = ( osm_node_t * ) cl_qmap_next( &p_next_node->map_item ); + p_prtn = (osm_prtn_t *)p_next; + p_next = cl_qmap_next(p_next); - for ( port_num = 0; port_num < osm_node_get_num_physp( p_node ); - port_num++ ) - { - p_physp = osm_node_get_physp_ptr( p_node, port_num ); - if ( osm_physp_is_valid( p_physp ) ) - { - if ( __osm_pkey_mgr_process_physical_port - ( p_mgr, p_node, port_num, p_physp ) ) - { - if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) ) - { - osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, - "osm_pkey_mgr_process: " - "Adding IB_DEFAULT_PKEY for pkey table of node " - "0x%016" PRIx64 " port %u\n", - cl_ntoh64( osm_node_get_node_guid( p_node ) ), - port_num ); - } - result = OSM_SIGNAL_DONE_PENDING; - } - } + if (osm_pkey_mgr_process_partition_table(p_mgr, p_prtn, FALSE)) + result = OSM_SIGNAL_DONE_PENDING; + if (osm_pkey_mgr_process_partition_table(p_mgr, p_prtn, TRUE)) + result = OSM_SIGNAL_DONE_PENDING; + } + + p_tbl = &p_mgr->p_subn->port_guid_tbl; + + p_next = cl_qmap_head(p_tbl); + while (p_next != cl_qmap_end(p_tbl)) + { + p_port = (osm_port_t *)p_next; + p_next = cl_qmap_next(p_next); + + if (osm_node_get_type(osm_port_get_parent_node(p_port)) != + IB_NODE_TYPE_SWITCH) { + osm_pkey_mgr_update_peer_port(p_mgr, p_port); } } + _err: CL_PLOCK_RELEASE( p_mgr->p_lock ); OSM_LOG_EXIT( p_mgr->p_log ); return ( result ); diff --git a/osm/opensm/osm_prtn.c b/osm/opensm/osm_prtn.c new file mode 100644 index 0000000..f5f3a32 --- /dev/null +++ b/osm/opensm/osm_prtn.c @@ -0,0 +1,324 @@ +/* + * Copyright (c) 2006 Voltaire, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + + +/* + * Abstract: + * Implementation of osm_prtn_t. + * This object represents an IBA partition. + * This object is part of the opensm family of objects. + * + * Environment: + * Linux User Mode + * + * $Revision$ + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include + +#include +#include +#include +#include +#include +#include + + +extern int osm_prtn_config_parse_file(osm_log_t * const p_log, + osm_subn_t * const p_subn, + const char *file_name); + + +static uint16_t global_pkey_counter; + +/* + * + */ + +osm_prtn_t* osm_prtn_new( + IN const char *name, + IN const uint16_t pkey ) +{ + osm_prtn_t *p = cl_zalloc(sizeof(*p)); + if (!p) + return NULL; + p->pkey = pkey; + cl_map_construct(&p->full_guid_tbl); + cl_map_init(&p->full_guid_tbl, 32); + cl_map_construct(&p->part_guid_tbl); + cl_map_init(&p->part_guid_tbl, 32); + + if (name && *name) + strncpy(p->name, name, sizeof(p->name)); + else + snprintf(p->name, sizeof(p->name), "%04x", cl_ntoh16(pkey)); + + return p; +} + +void osm_prtn_delete( + IN OUT osm_prtn_t** const pp_prtn ) +{ + osm_prtn_t *p = *pp_prtn; + cl_map_remove_all(&p->full_guid_tbl); + cl_map_destroy(&p->full_guid_tbl); + cl_map_remove_all(&p->part_guid_tbl); + cl_map_destroy(&p->part_guid_tbl); + cl_free(p); + *pp_prtn = NULL; +} + + +ib_api_status_t osm_prtn_add_port(osm_log_t *p_log, osm_subn_t *p_subn, + osm_prtn_t *p, ib_net64_t guid, boolean_t full) +{ + cl_qmap_t *p_port_tbl = &p_subn->port_guid_tbl; + ib_api_status_t status = IB_SUCCESS; + cl_map_t *p_tbl; + osm_port_t *p_port; + osm_physp_t *p_physp; + + p_port = (osm_port_t *)cl_qmap_get(p_port_tbl, guid); + if (!p_port || p_port == (osm_port_t *)cl_qmap_end(p_port_tbl)) { + osm_log(p_log, OSM_LOG_VERBOSE, "osm_prtn_add_port: " + "port 0x%" PRIx64 " not found.\n", + cl_ntoh64(guid)); + return status; + } + + p_physp = osm_port_get_default_phys_ptr(p_port); + if (!p_physp) { + osm_log(p_log, OSM_LOG_VERBOSE, "osm_prtn_add_port: " + "no physical for port 0x%" PRIx64 "\n", + cl_ntoh64(guid)); + return status; + } + + if (osm_prtn_is_guid(p, guid)) { + osm_log(p_log, OSM_LOG_VERBOSE, "osm_prtn_add_port: " + "port 0x%" PRIx64 " already in " + "partition \'%s\' (%04x). Will overwrite.\n", + cl_ntoh64(guid), p->name, cl_ntoh16(p->pkey)); + } + + p_tbl = (full == TRUE) ? &p->full_guid_tbl : &p->part_guid_tbl ; + + if (cl_map_insert(p_tbl, guid, p_physp) == NULL) + return IB_INSUFFICIENT_MEMORY; + + return status; +} + + +ib_api_status_t osm_prtn_add_all(osm_log_t *p_log, osm_subn_t *p_subn, + osm_prtn_t *p, boolean_t full) +{ + cl_qmap_t *p_port_tbl = &p_subn->port_guid_tbl; + cl_map_item_t *p_item; + osm_port_t *p_port; + ib_api_status_t status = IB_SUCCESS; + + p_item = cl_qmap_head(p_port_tbl); + while (p_item != cl_qmap_end(p_port_tbl)) { + p_port = (osm_port_t *)p_item; + p_item = cl_qmap_next(p_item); + status = osm_prtn_add_port(p_log, p_subn, p, + osm_port_get_guid(p_port), full); + if (status != IB_SUCCESS) + goto _err; + } + + _err: + return status; +} + + +ib_api_status_t osm_prtn_add_mcgroup(osm_log_t *p_log, + osm_subn_t *p_subn, osm_prtn_t *p) +{ + ib_member_rec_t mc_rec; + ib_net64_t comp_mask; + ib_net16_t pkey; + osm_mgrp_t *p_mgrp = NULL; + osm_sa_t *p_sa = &p_subn->p_osm->sa; + ib_api_status_t status = IB_SUCCESS; + + pkey = cl_hton16(cl_ntoh16(p->pkey)|0x8000); + + cl_memclr(&mc_rec, sizeof(mc_rec)); + + mc_rec.mgid = osm_ipoib_mgid; /* this is ipv4 broadcast */ + cl_memcpy(&mc_rec.mgid.raw[4], &pkey, sizeof(pkey)); + + mc_rec.qkey = CL_HTON32(0x0b1b); + mc_rec.mtu = 4; /* 2048 Bytes */ + mc_rec.tclass = 0; + mc_rec.pkey = pkey; + mc_rec.rate = 0x3; /* 10Gb/sec */ + mc_rec.pkt_life = OSM_DEFAULT_SUBNET_TIMEOUT; + mc_rec.sl_flow_hop = OSM_DEFAULT_SL << 28; + /* Note: scope needs to be consistent with MGID */ + mc_rec.scope_state = 0x21; + + /* mtu and rate will be updated according to CA */ + comp_mask = 0; + + status = osm_mcmr_rcv_find_or_create_new_mgrp(&p_sa->mcmr_rcv, + comp_mask, &mc_rec, &p_mgrp); + + if (!p_mgrp || status != IB_SUCCESS) + osm_log( p_log, OSM_LOG_ERROR, + "osm_prtn_add_mcgroup:" + " failed to create mc group with %04x pkey\n", + cl_ntoh16(pkey)); + + return status; +} + + +static uint16_t __generate_pkey(osm_subn_t *p_subn) +{ + uint16_t pkey; + cl_qmap_t *m = &p_subn->prtn_pkey_tbl; + while ( global_pkey_counter < IB_DEFAULT_PARTIAL_PKEY - 1) { + pkey = ++global_pkey_counter; + if (cl_qmap_get(m, pkey) == cl_qmap_end(m)) + return cl_hton16(pkey); + } + return 0; +} + +osm_prtn_t *osm_prtn_make_new(osm_log_t *p_log, osm_subn_t *p_subn, + const char *name, uint16_t pkey) +{ + osm_prtn_t *p = NULL, *p_check; + + if (pkey == 0 && !(pkey = __generate_pkey(p_subn))) + return NULL; + + if (cl_ntoh16(pkey)&0x8000) { + pkey = cl_hton16(cl_ntoh16(pkey)&~0x8000); + osm_log(p_log, OSM_LOG_VERBOSE, + "osm_prtn_make_new: pkey was striped for" + " partition \'%s\' (%04x)\n", + name, cl_ntoh16(pkey)); + } + + p = osm_prtn_new(name, pkey); + if (!p) { + osm_log(p_log, OSM_LOG_ERROR, + "osm_prtn_make_new: Unable to create" + " partition \'%s\' (%04x)\n", + name, cl_ntoh16(pkey)); + return NULL; + } + + p_check = (osm_prtn_t *)cl_qmap_insert(&p_subn->prtn_pkey_tbl, + p->pkey, &p->map_item); + if (p != p_check) { + osm_log(p_log, OSM_LOG_VERBOSE, + "osm_prtn_make_new: Duplicated partition" + " definition: \'%s\' (%04x) prev name \'%s\'" + " - will use it.\n", + name, cl_ntoh16(pkey), p_check->name); + osm_prtn_delete(&p); + p = p_check; + } + + return p; +} + + +static ib_api_status_t osm_prtn_make_default(osm_log_t * const p_log, + osm_subn_t * const p_subn) +{ + ib_api_status_t status = IB_UNKNOWN_ERROR; + osm_prtn_t *p; + p = osm_prtn_make_new(p_log, p_subn, + "Default", IB_DEFAULT_PARTIAL_PKEY); + if (!p) + goto _err; + status = osm_prtn_add_all(p_log, p_subn, p, FALSE); + if (status != IB_SUCCESS) + goto _err;; + cl_map_remove(&p->part_guid_tbl, p_subn->sm_port_guid); + status = osm_prtn_add_port(p_log, p_subn, p, + p_subn->sm_port_guid, TRUE); + _err: + return status; +} + + +ib_api_status_t osm_prtn_make_partitions(osm_log_t * const p_log, + osm_subn_t * const p_subn) +{ + const char *file_name; + ib_api_status_t status = IB_SUCCESS; + osm_prtn_t *p, *p_next; + + file_name = p_subn->opt.partition_config_file ? + p_subn->opt.partition_config_file : + "/etc/osm-partitions.conf"; + + /* cl_qmap uses self addresses we cannot just save + qmap state and clean it later, so clean all now */ + p_next = (osm_prtn_t *)cl_qmap_head(&p_subn->prtn_pkey_tbl); + while (p_next != (osm_prtn_t *)cl_qmap_end(&p_subn->prtn_pkey_tbl)) { + p = p_next; + p_next = (osm_prtn_t *)cl_qmap_next(&p->map_item); + osm_prtn_delete(&p); + } + cl_qmap_init(&p_subn->prtn_pkey_tbl); + + global_pkey_counter = 0; + + status = osm_prtn_make_default(p_log, p_subn); + if(status != IB_SUCCESS) + goto _err; + + if (osm_prtn_config_parse_file(p_log, p_subn, file_name)) { + osm_log(p_log, OSM_LOG_VERBOSE, + "osm_prtn_make_partitions: Partition configuration " + "file \'%s\' was not fully processed (or does not exist).\n", + file_name); + } + + _err: + return status; +} diff --git a/osm/opensm/osm_prtn_config.c b/osm/opensm/osm_prtn_config.c new file mode 100644 index 0000000..97a835a --- /dev/null +++ b/osm/opensm/osm_prtn_config.c @@ -0,0 +1,387 @@ +/* + * Copyright (c) 2006 Voltaire, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + + +/* + * Abstract: + * Implementation of opensm partition management configuration + * + * Environment: + * Linux User Mode + * + * $Revision$ + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include +#include + +#include +#include +#include +#include + + +#if __WORDSIZE == 64 +#define STRTO_IB_NET64(str, end, base) strtoul(str, end, base) +#else +#define STRTO_IB_NET64(str, end, base) strtoull(str, end, base) +#endif + +#define PARSERR(log, lnum, fmt, arg...) { \ + osm_log(log, OSM_LOG_ERROR, \ + "\nPARSE ERROR: line %d: " fmt "\n", (lnum), ##arg ); \ + fprintf(stderr, \ + "\nPARSE ERROR: line %d: " fmt "\n", (lnum), ##arg ); \ +} + +#define PARSEWARN(log, lnum, fmt, arg...) \ + osm_log(log, OSM_LOG_VERBOSE, \ + "PARSE WARN: line %d: " fmt , (lnum), ##arg ) + +/* + */ + + +struct part_conf { + osm_log_t *p_log; + osm_subn_t *p_subn; + osm_prtn_t *p_prtn; +}; + + +extern osm_prtn_t *osm_prtn_make_new(osm_log_t *p_log, osm_subn_t *p_subn, + const char *name, uint16_t pkey); +extern ib_api_status_t osm_prtn_add_all(osm_log_t *p_log, + osm_subn_t *p_subn, osm_prtn_t *p, boolean_t full); +extern ib_api_status_t osm_prtn_add_port(osm_log_t *p_log, + osm_subn_t *p_subn, osm_prtn_t *p, ib_net64_t guid, + boolean_t full); +extern ib_api_status_t osm_prtn_add_mcgroup(osm_log_t *p_log, + osm_subn_t *p_subn, osm_prtn_t *p); + + +static int partition_create(unsigned lineno, struct part_conf *conf, + char *name, char *id, char *flag, char *flag_val) +{ + uint16_t pkey; + + if (!id && name && isdigit(*name)) { + id = name; + name = NULL; + } + + if (id) { + char *end; + pkey = strtoul(id, &end, 0); + if (end == id || *end) + return -1; + } + else + pkey = 0; + + conf->p_prtn = osm_prtn_make_new(conf->p_log, conf->p_subn, + name, cl_hton16(pkey)); + if (!conf->p_prtn) + return -1; + + if (flag) { + if(!strncmp(flag, "ipoib", strlen(flag))) + osm_prtn_add_mcgroup(conf->p_log, + conf->p_subn, conf->p_prtn); + else { + PARSEWARN(conf->p_log, lineno, + "unrecognized partition flag \'%s\'" + " - ignored.\n", flag); + } + } + + return 0; +} + + +static int partition_add_port(unsigned lineno, struct part_conf *conf, + char *name, char *flag) +{ + osm_prtn_t *p = conf->p_prtn; + ib_net64_t guid; + boolean_t full = FALSE; + + if (!name || !*name || !strncmp(name, "NONE", strlen(name))) + return 0; + + if (flag) { + if(!strncmp(flag, "full", strlen(flag))) + full = TRUE; + else if(strncmp(flag, "partial", strlen(flag))) { + PARSEWARN(conf->p_log, lineno, + "unrecognized port flag \'%s\' -" + " suppose \'partial\'\n", flag); + } + } + + if (!strncmp(name, "ALL", strlen(name))) { + return osm_prtn_add_all(conf->p_log, conf->p_subn, p, + full) == IB_SUCCESS ? 0 : -1; + } + else if (!strncmp(name, "SELF", strlen(name))) { + guid = cl_ntoh64(conf->p_subn->sm_port_guid); + } + else { + char *end; + guid = STRTO_IB_NET64(name, &end, 0); + if (!guid || *end) + return -1; + } + + if (osm_prtn_add_port(conf->p_log, conf->p_subn, p, + cl_hton64(guid), full) != IB_SUCCESS) + return -1; + + return 0; +} + + +/* conf file parser */ + +#define STRIP_HEAD_SPACES(p) while (*(p) == ' ' || *(p) == '\t' || \ + *(p) == '\n') { (p)++; } +#define STRIP_TAIL_SPACES(p) { char *q = (p) + strlen(p); \ + while ( q != (p) && ( *q == '\0' || \ + *q == ' ' || *q == '\t' || \ + *q == '\n')) { *q-- = '\0'; }; } + +static int parse_name_token(char *str, char **name, char **val) +{ + int len = 0; + char *p, *q; + + *name = *val = NULL; + + p = str; + + while (*p == ' ' || *p == '\t' || *p == '\n') + p++; + + q = strchr(p, '='); + if (q) + *q++ = '\0'; + + len = strlen(str) + 1; + str = q; + + q = p + strlen(p); + while ( q != p && + ( *q == '\0' || *q == ' ' || *q == '\t' || *q == '\n')) + *q-- = '\0'; + + *name = p; + + p = str; + if (!p) + return len; + + while (*p == ' ' || *p == '\t' || *p == '\n') + p++; + + q = p + strlen(p); + len += q - str + 1; + while ( q != p && + ( *q == '\0' || *q == ' ' || *q == '\t' || *q == '\n')) + *q-- = '\0'; + *val = p; + + return len; +} + + +static struct part_conf *new_part_conf(osm_log_t *p_log, osm_subn_t *p_subn) +{ + static struct part_conf part; + struct part_conf *conf = ∂ + memset(conf, 0, sizeof(*conf)); + conf->p_log = p_log; + conf->p_subn = p_subn; + conf->p_prtn = NULL; + return conf; +} + +static int flush_part_conf(struct part_conf *conf) +{ + memset(conf, 0, sizeof(*conf)); + return 0; +} + + +static int parse_part_conf(struct part_conf *conf, char *str, int lineno) +{ + int ret, len = 0; + char *name, *id, *flag, *flval; + char *q, *p; + + p = str; + if (*p == '\t' || *p == '\0' || *p == '\n') + p++; + + len += p - str; + str = p; + + if (conf->p_prtn) + goto skip_header; + + q = strchr(p, ':'); + if (!q) { + PARSERR(conf->p_log, lineno, + "no partition definition found\n"); + return -1; + } + + *q++ = '\0'; + str = q; + + name = id = flag = flval = NULL; + + q = strchr(p, ','); + if (q) + *q = '\0'; + + ret = parse_name_token(p, &name, &id); + p += ret; + len += ret; + + if (q) { + ret = parse_name_token(p, &flag, &flval); + if (!flag) { + PARSERR(conf->p_log, lineno, + "bad partition flags\n"); + return -1; + } + p += ret; + len += ret; + } + + if (p != str || (partition_create(lineno, conf, + name, id, flag, flval) < 0)) { + PARSERR(conf->p_log, lineno, + "bad partition definition\n"); + return -1; + } + + skip_header: + do { + name = flag = NULL; + q = strchr(p, ','); + if (q) + *q++ = '\0'; + ret = parse_name_token(p, &name, &flag); + if (partition_add_port(lineno, conf, name, flag) < 0) { + PARSERR(conf->p_log, lineno, + "bad PortGUID\n"); + return -1; + } + p += ret; + len += ret; + } while (q); + + return len; +} + +int osm_prtn_config_parse_file(osm_log_t *p_log, osm_subn_t *p_subn, + const char *file_name) +{ + char line[1024]; + struct part_conf *conf = NULL; + FILE *file; + int lineno; + + file = fopen(file_name, "r"); + if (!file) { + perror("fopen"); + return -1; + } + + lineno = 0; + + while (fgets(line, sizeof(line) - 1, file) != NULL) { + char *q, *p = line; + + lineno++; + + p = line; + + q = strchr(p, '#'); + if (q) + *q = '\0'; + + do { + int len; + while (*p == ' ' || *p == '\t' || *p == '\n') + p++; + if (*p == '\0') + break; + + if (!conf && + !(conf = new_part_conf(p_log, p_subn))) { + PARSERR(p_log, lineno, + "internal: cannot create config.\n"); + break; + } + + q = strchr(p, ';'); + if (q) + *q = '\0'; + + len = parse_part_conf(conf, p, lineno); + if (len < 0) { + break; + } + + p += len; + + if (q) { + flush_part_conf(conf); + conf = NULL; + } + } while (q); + } + + fclose(file); + + return 0; +} diff --git a/osm/opensm/osm_sa_mcmember_record.c b/osm/opensm/osm_sa_mcmember_record.c index ef63c15..22fe7dc 100644 --- a/osm/opensm/osm_sa_mcmember_record.c +++ b/osm/opensm/osm_sa_mcmember_record.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -1146,6 +1146,24 @@ __mgrp_request_is_realizable( } /********************************************************************** + Call this function to find or create a new mgrp. +**********************************************************************/ +ib_api_status_t +osm_mcmr_rcv_find_or_create_new_mgrp( + IN osm_mcmr_recv_t* const p_rcv, + IN ib_net64_t comp_mask, + IN ib_member_rec_t* const p_recvd_mcmember_rec, + OUT osm_mgrp_t **pp_mgrp) +{ + ib_api_status_t status; + status = __get_mgrp_by_mgid(p_rcv, p_recvd_mcmember_rec, pp_mgrp); + if (status == IB_SUCCESS) + return status; + return osm_mcmr_rcv_create_new_mgrp(p_rcv, comp_mask, + p_recvd_mcmember_rec, pp_mgrp); +} + +/********************************************************************** Call this function to create a new mgrp. **********************************************************************/ ib_api_status_t diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c index 1340017..1068435 100644 --- a/osm/opensm/osm_subnet.c +++ b/osm/opensm/osm_subnet.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -53,11 +53,13 @@ #include #include +#include #include #include #include #include #include +#include #include #include #include @@ -97,6 +99,7 @@ osm_subn_destroy( osm_port_t *p_port, *p_next_port; osm_switch_t *p_sw, *p_next_sw; osm_remote_sm_t *p_rsm, *p_next_rsm; + osm_prtn_t *p_prtn, *p_next_prtn; osm_mgrp_t *p_mgrp, *p_next_mgrp; osm_infr_t *p_infr, *p_next_infr; @@ -135,6 +138,14 @@ osm_subn_destroy( cl_free( p_rsm ); } + p_next_prtn = (osm_prtn_t*)cl_qmap_head( &p_subn->prtn_pkey_tbl ); + while( p_next_prtn != (osm_prtn_t*)cl_qmap_end( &p_subn->prtn_pkey_tbl ) ) + { + p_prtn = p_next_prtn; + p_next_prtn = (osm_prtn_t*)cl_qmap_next( &p_prtn->map_item ); + osm_prtn_delete( &p_prtn ); + } + p_next_mgrp = (osm_mgrp_t*)cl_qmap_head( &p_subn->mgrp_mlid_tbl ); while( p_next_mgrp != (osm_mgrp_t*)cl_qmap_end( &p_subn->mgrp_mlid_tbl ) ) { @@ -167,10 +178,13 @@ osm_subn_destroy( ib_api_status_t osm_subn_init( IN osm_subn_t* const p_subn, + IN osm_opensm_t * const p_osm, IN const osm_subn_opt_t* const p_opt ) { cl_status_t status; + p_subn->p_osm = p_osm; + status = cl_ptr_vector_init( &p_subn->node_lid_tbl, OSM_SUBNET_VECTOR_MIN_SIZE, OSM_SUBNET_VECTOR_GROW_SIZE ); @@ -428,6 +442,7 @@ osm_subn_set_default_opt( p_opt->dump_files_dir = OSM_DEFAULT_TMP_DIR; p_opt->log_file = OSM_DEFAULT_LOG_FILE; + p_opt->partition_config_file = OSM_DEFAULT_PARTITION_CONFIG_FILE; p_opt->accum_log_file = TRUE; p_opt->port_profile_switch_nodes = FALSE; p_opt->max_port_profile = 0xffffffff; From ianjiang.ict at gmail.com Sun Feb 19 08:39:05 2006 From: ianjiang.ict at gmail.com (Ian Jiang) Date: Mon, 20 Feb 2006 00:39:05 +0800 Subject: [openib-general] [VAPI]VAPI_poll_cq: CQ is empty In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3016C4AA9@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3016C4AA9@mtlexch01.mtl.com> Message-ID: <7b2fa1820602190839p1ac27dbayaebd5667a167776e@mail.gmail.com> Hi Dotan, Thanks for your reply! I used a non-zero value for min_rnr_timer like this: qp_attr.min_rnr_timer = /*0*/ IB_RNR_NAK_TIMER_2_56; And to be simple, I put the sender and the receiver in the same process rather than synchronized tow processes. The data was transferred correctly as proved by printing out the contents of the buffers. However I got another question: the "number of bytes transferred" of the CQE VAPI_CQE_RQ_SEND_DATA was the same as the number of the Recv request. But the one of the CQE VAPI_CQE_SQ_SEND_DATA was zero, which was not what I expected. What is the matter do you think? Thanks very much! Here are the relative codes: ===========print the WC descriptor ======== void print_wc_desc(VAPI_wc_desc_t *wc_desc_p) { if (wc_desc_p != NULL) { printf("status: %d\n", wc_desc_p->status); printf("id: %lu\n", wc_desc_p->id); printf("opcode: %s\n", get_cqe_opcode_str(wc_desc_p->opcode)); printf("Num. of bytes transferred: %d\n", wc_desc_p->byte_len); printf("..."); } } ======= wait the Send to complete ========= do { poll_cnt++; res = VAPI_poll_cq(hca_hndl, s_cq_hndl, &wc_desc); if (res != VAPI_OK && res != VAPI_CQ_EMPTY) { PRINT_ERR("Poll CQ block failed\n"); VAPIERR(res); return -1; } } while(res == VAPI_CQ_EMPTY && poll_cnt < 10); if (wc_desc.status != VAPI_SUCCESS) { PRINT_ERR("Req unsuccess: %s\n", VAPI_wc_status_sym(wc_desc.status)); print_wc_desc(&wc_desc); return -1; } PRINT_TRACE("Req success\n"); print_wc_desc(&wc_desc); On 2/19/06, Dotan Barak wrote: > > I believe that the problem is that the min_rnr_timer value is 0 (which > means infinite timeout between the rnr retries) and there is rnr nakbetween the two sides (because you don't sync between the sides, and this is > the reason for the empty CQ … > > > Let me describe the problem: > > The sender sent a send message which should consume RR (Receive Request) > at the receiver side, but when the message have reached to the receiver > there wasn't any RR in the RQ, so he sent to the sender rnr-nack, the > sender got the rnr-nack and is waiting the min_rnr_timer which is infinite > … > > > > You should do the following things: > > - Put a non zero value in the min_rnr_timer (you may get completion > with bad status: rnr exceeded if the receiver won't be ready in time > …) > - Post RR in the responder in the init state > - Optional: sync between the sides (post SR at the sender only when > there is RR in the receiver side). > > > > Dotan > > -- Ian Jiang ianjiang.ict at gmail.com Laboratory of Spatial Information Technology Division of System Architecture Institute of Computing Technology Chinese Academy of Sciences -------------- next part -------------- An HTML attachment was scrubbed... URL: From info at agkhf.com Sun Feb 19 06:50:53 2006 From: info at agkhf.com (info at agkhf.com) Date: 19 Feb 2006 23:50:53 +0900 Subject: [openib-general] $B2q$C$F$/$l$l$P:9$7>e$2$^$9!#(B Message-ID: <20060219145053.15065.qmail@mail.agkhf.com> $B$O$8$a$^$7$F!"_7ED$H?=$7$^$9!#$$$-$J$j$N%a!<%k$G?=$7Lu$4$6$$$^$;$s!#(B $B:#;d$O!"/$7$G$b6=L#$,$*$"$j$G$7$?$i!"6b3[$N>r7o!"l9g$O!d"M!!(Bgendar7_net at yahoo.ca From info at agkhf.com Sun Feb 19 08:59:58 2006 From: info at agkhf.com (info at agkhf.com) Date: 20 Feb 2006 01:59:58 +0900 Subject: [openib-general] $BL5NA?7HVAH!J=w@-;o@kEA3HBgCf!K(B Message-ID: <20060219165958.29886.qmail@mail.agkhf.com> $B"(?7HVAH$@$+$i?7A/$G$9!#(B http://www.00-love6.com/?swe19 $BK\5$$N=P2q$$(B! $B3NEED>%"%I8r49!&(B $B"#(B------------------$B"#(B 18$B:PL$K~$NMxMQ$O=PMh$^$;$s(B The reception refusal of mail $B (Michael S. Tsirkin's message of "Sun, 19 Feb 2006 15:31:56 +0200") References: <20060219133156.GF22037@mellanox.co.il> Message-ID: Michael> I think the claim is that it's awkward to use different Michael> tools to get different components. Related question: can Michael> git support working on just the infiniband subdirectory Michael> of a tree? Thats important e.g. for people that only Michael> want to work on the infiniband stuff against the last Michael> stable kernel. I'm not sure exactly what you want to do. In general a distributed SCM like git is much better for pulling changes from one tree to another. So for example, it's easy to clone a kernel tree, check out the v2.6.15 tag, and then pull from my infiniband.git tree to get only the IB changes in a given branch. - R. From rdreier at cisco.com Sun Feb 19 14:17:51 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 19 Feb 2006 14:17:51 -0800 Subject: [openib-general] Re: ibv_create_srq doesn't update the SRQ init attributes In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3016C4D5E@mtlexch01.mtl.com> (Dotan Barak's message of "Sun, 19 Feb 2006 18:26:40 +0200") References: <6AB138A2AB8C8E4A98B9C0C3D52670E3016C4D5E@mtlexch01.mtl.com> Message-ID: Dotan> Hi. When one create a QP, the init attributes structure is Dotan> being changed by the verb to the actual init attributes Dotan> that were used for this QP. When one create a SRQ, the Dotan> init attributes structure is not being changed, BUT a Dotan> query_srq will return the actual values that were used to Dotan> create the SRQ (those values may be different from the Dotan> values that the user put in the create_srq). Dotan> Should the create_srq return the values that were used to Dotan> create the SRQ in the init_attributes? Yes, it would make sense to update the init attributes in ibv_create_srq() the same way that ibv_create_qp() does. Care to write a patch? Thanks, Roland From rdreier at cisco.com Sun Feb 19 14:28:59 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 19 Feb 2006 14:28:59 -0800 Subject: [openib-general] NFS/RDMA client release for Linux 2.6.15 In-Reply-To: <7.0.1.0.2.20060219093506.040ee028@netapp.com> (Thomas Talpey's message of "Sun, 19 Feb 2006 09:56:18 -0500") References: <7.0.1.0.2.20060207211754.0409e8a0@netapp.com> <20060219120155.GA10268@lst.de> <7.0.1.0.2.20060219093506.040ee028@netapp.com> Message-ID: Christoph> - structures like rpcrdma_msg that are on the wire Christoph> should use __be* for endianess annotations, and the Christoph> cpu_to_be*/be*_to_cpu accessor functions instead of Christoph> hton?/ntoh?. Please verify that these annotations are Christoph> correct using sparse -D__CHECK_ENDIAN__=1 Thomas> Hmm, okay but existing RPC and NFS code don't do this. I'm Thomas> reluctant to differ from the style of the governing Thomas> subsystem. I'll check w/Trond. I'm sure that's because the rest of the code was written before sparse and the __beXX types. I agree with Christoph -- if you don't annotate your code for endianness, then someone else is going to have to come back and clean it up later. So please do it up front. Christoph> - rdma_convert_physiov/rdma_convert_phys are completely Christoph> broken. page_to_phys can't be used by driver/fs code. Christoph> RDMA only deals with bus addresses, not physical Christoph> addresses. You must use the dma mapping API Christoph> instead. Also coalescing decisions are made by the dma Christoph> layer, because they are platform dependent and much Christoph> more complex then what the code in this patch does. Thomas> Now that we are moving to OpenIB api's this is Thomas> needed. There is some thought necessary w.r.t. our Thomas> max-performance mode of preregistering memory in DMA Thomas> mode. That's on our list of course. Again let me echo Christoph's point. If you are passing physical addresses into IB functions, then your code simply won't work on some architectures. Making sure your code actually works on something like a ppc64 box with an IOMMU would be a good test -- the low-end IBM POWER machines are cheap enough that you could just buy one if you don't have easy access. - R. From mst at mellanox.co.il Sun Feb 19 14:35:15 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 20 Feb 2006 00:35:15 +0200 Subject: [openib-general] Re: Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: References: <20060219133156.GF22037@mellanox.co.il> Message-ID: <20060219223514.GA24182@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: Plans for libibverbs 1.0, 1.1 and beyond > > Michael> I think the claim is that it's awkward to use different > Michael> tools to get different components. Related question: can > Michael> git support working on just the infiniband subdirectory > Michael> of a tree? Thats important e.g. for people that only > Michael> want to work on the infiniband stuff against the last > Michael> stable kernel. > > I'm not sure exactly what you want to do. In general a distributed > SCM like git is much better for pulling changes from one tree to > another. So for example, it's easy to clone a kernel tree, check out > the v2.6.15 tag, and then pull from my infiniband.git tree to get only > the IB changes in a given branch. So, each component gets a separate git tree? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From Thomas.Talpey at netapp.com Sun Feb 19 14:56:41 2006 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Sun, 19 Feb 2006 17:56:41 -0500 Subject: [openib-general] NFS/RDMA client release for Linux 2.6.15 In-Reply-To: References: <7.0.1.0.2.20060207211754.0409e8a0@netapp.com> <20060219120155.GA10268@lst.de> <7.0.1.0.2.20060219093506.040ee028@netapp.com> Message-ID: <7.0.1.0.2.20060219174756.045d6290@netapp.com> At 05:28 PM 2/19/2006, Roland Dreier wrote: > Christoph> - rdma_convert_physiov/rdma_convert_phys are completely > Christoph> broken. page_to_phys can't be used by driver/fs code. > Christoph> RDMA only deals with bus addresses, not physical > Christoph> addresses. You must use the dma mapping API > Christoph> instead. Also coalescing decisions are made by the dma > Christoph> layer, because they are platform dependent and much > Christoph> more complex then what the code in this patch does. > > Thomas> Now that we are moving to OpenIB api's this is > Thomas> needed. There is some thought necessary w.r.t. our > Thomas> max-performance mode of preregistering memory in DMA > Thomas> mode. That's on our list of course. > >Again let me echo Christoph's point. If you are passing physical >addresses into IB functions, then your code simply won't work on some >architectures. Making sure your code actually works on something like >a ppc64 box with an IOMMU would be a good test -- the low-end IBM >POWER machines are cheap enough that you could just buy one if you >don't have easy access. Yep, I get it! To elaborate a little, we're not exactly passing physical addresses. What we're doing is using the physaddr to calculate an offset relative to a base of zero. We register the zero address and advertise RDMA buffers via offsets relative to that r_key. And, this is only one of many memory registration modes. We would use memory windows, if only OpenIB provided them (yes I know the hardware currently sucks for them). We will add FMR support shortly. In both these modes we perform all addressing by the book via 1-1 OpenIB registration. Tom. From rdreier at cisco.com Sun Feb 19 20:15:00 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 19 Feb 2006 20:15:00 -0800 Subject: [openib-general] Re: Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: <20060219223514.GA24182@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 20 Feb 2006 00:35:15 +0200") References: <20060219133156.GF22037@mellanox.co.il> <20060219223514.GA24182@mellanox.co.il> Message-ID: Michael> So, each component gets a separate git tree? It would make sense to me to have each userspace component in a separate git tree. For the kernel, one git tree would make the most sense. But of course there can be many branches "in flight" at the same time. - R. From dotanb at mellanox.co.il Sun Feb 19 23:08:37 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Mon, 20 Feb 2006 09:08:37 +0200 Subject: [openib-general] [VAPI]VAPI_poll_cq: CQ is empty Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3016C4DEB@mtlexch01.mtl.com> Hi ian. Hi Dotan, > However I got another question: the "number of bytes transferred" of the CQE VAPI_CQE_RQ_SEND_DATA was the same as the number of the Recv >request. But the one of the CQE VAPI_CQE_SQ_SEND_DATA was zero, which was not what I expected. What is the matter do you think? >Thanks very much! The completion structure is a little bit tricky; not all of the attributes are valid: it depends on the opcode that was used + QP transport type + QP side (requestor/responder) + the completion status. You should check the IB spec (section 11.4.2.1) for the valid attributes. Anyway, in the requestor side if opcode SEND (with immediate) was used, the byte len attribute is not valid ... Dotan -------------- next part -------------- An HTML attachment was scrubbed... URL: From dotanb at mellanox.co.il Mon Feb 20 00:47:10 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Mon, 20 Feb 2006 10:47:10 +0200 Subject: [openib-general] RE: ibv_create_srq doesn't update the SRQ init attributes Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3016C4E72@mtlexch01.mtl.com> > Dotan> Should the create_srq return the values that were used to > Dotan> create the SRQ in the init_attributes? > > Yes, it would make sense to update the init attributes in > ibv_create_srq() the same way that ibv_create_qp() does. Care to > write a patch? No problem, but I don't know what to change ... Can you explain to me the concept of the ABI versioning ? Thankx Dotan From glebn at voltaire.com Mon Feb 20 01:03:23 2006 From: glebn at voltaire.com (Gleb Natapov) Date: Mon, 20 Feb 2006 11:03:23 +0200 Subject: [openib-general] Re: Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: <20060215082331.GE10026@mellanox.co.il> References: <20060213154114.GO32041@mellanox.co.il> <20060213190206.GC12458@mellanox.co.il> <20060214065145.GE24524@minantech.com> <20060214164236.GB12974@mellanox.co.il> <20060215073452.GI24524@minantech.com> <20060215082331.GE10026@mellanox.co.il> Message-ID: <20060220090323.GA24524@minantech.com> On Wed, Feb 15, 2006 at 10:23:31AM +0200, Michael S. Tsirkin wrote: > Quoting r. Gleb Natapov : > > Subject: Re: Re: madvise MADV_DONTFORK/MADV_DOFORK > > > > On Tue, Feb 14, 2006 at 06:42:36PM +0200, Michael S. Tsirkin wrote: > > > All, MADV_DONTFORK patch is now part of the -mm tree. > > > Everyone who's interested in fork support, please test 2.6.16-rc3-mm1 and > > > publish the results here and on lkml. > > > > > Good news! > > > > Should call to madvise be the part of reg_mr call? > > Probably no - MPI should have to do it. > It had come to my attention that there is file memory.c in libibverbs that implements refcounting for mlock. I think it was meant to be used from reg_mr() (since interface is hidden) back when mlock was needed for kernel bug workaround. If it was good idea back than why not now? Alternatively we can make this interface public for application to use explicitly. Roland can you please tell the history behind memory.c? -- Gleb. From dotanb at mellanox.co.il Mon Feb 20 01:52:07 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Mon, 20 Feb 2006 11:52:07 +0200 Subject: [openib-general] mthca fix: update the init attributes in create_srq Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3016C4EB8@mtlexch01.mtl.com> The init attributes are being updated with the actual SRQ values in the mthca. Signed-off-by: Dotan Barak Index: latest/drivers/infiniband/hw/mthca/mthca_srq.c =================================================================== --- latest.orig/drivers/infiniband/hw/mthca/mthca_srq.c 2006-02-20 08:04:06.000000000 +0200 +++ latest/drivers/infiniband/hw/mthca/mthca_srq.c 2006-02-20 11:26:04.000000000 +0200 @@ -268,6 +268,9 @@ int mthca_alloc_srq(struct mthca_dev *de srq->first_free = 0; srq->last_free = srq->max - 1; + attr->max_wr = srq->max; + attr->max_sge = srq->max_gs; + return 0; err_out_free_srq: Dotan Barak Software Verification Engineer Mellanox Technologies Tel: +972-4-9097200 Ext: 231 Fax: +972-4-9593245 P.O. Box 86 Yokneam 20692 ISRAEL. Home: +972-77-8841095 Cell: 052-4222383 -------------- next part -------------- An HTML attachment was scrubbed... URL: From info at hayato-ichimonji.com Mon Feb 20 01:24:33 2006 From: info at hayato-ichimonji.com (info at hayato-ichimonji.com) Date: 20 Feb 2006 18:24:33 +0900 Subject: [openib-general] $B>~$k8@MU$b2?$bI,MW$"$j$^$;$s!#(B Message-ID: <20060220092433.13087.qmail@mail.hayato-ichimonji.com> $B2a5n$N$*5RMM$+$i$N0U8+$rJ9$/8B$j$G$O!"AGD>$G%9%H%l!<%H$J@\$7J}$r$5$l$k at 5D>$JJ}$,9%$^$l$k$h$&$G$9!#%^%s%7%g%s$NE49|$,B-$j$J$$$J$s$F!"$3$s$J2?$r?.$8$F$$$$$+$o$+$i$J$$;~Be$@$+$i$3$=!"4X78$r;}$D$+$b$7$l$J$$Aj$G$"$j$?$$$G$9$h$M!#(B http://www.awg6.net/?lv20 $BAGE($J4X78$r8+$D$1$k$3$H$,$G$-$k$h$&!";d$I$b1?1D0Q0w0lF11~1g$7$F$$$^$9!#(B http://www.awg6.net/?lv20 /////////////////////////////////////////////////////////////////////// $B!c(BNO.I don't veceive your mail$B!d"M(Bpriority7_net at yahoo.ca $B!c:#8e!"l9g$O!d"M(Bpriority7_net at yahoo.ca /////////////////////////////////////////////////////////////////////// From eli at mellanox.co.il Mon Feb 20 02:53:04 2006 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 20 Feb 2006 12:53:04 +0200 Subject: [openib-general] [PATCH] mthca - command interface Message-ID: <1140432784.4429.14.camel@mtls03.yok.mtl.com> This patch is checks whether the HCA supports posting commands through doorbells and if they are, will configure the driver accordingly. This functionality can be controlled by the value of a read/write paramter - post_cmd2dbell. When 0 commands are posted through HCR - otherwise if HCA is capable commands go through doorbell. The value of the parameter works collectively on all available HCAs. This use of UAR0 to post commands eliminates the need for polling the go bit prior to posting a new command. Since reading from a PCI device is much more expensive then issuing a posted write, it is expected that this command will provide better CPU utilization. We are currently developing such tests and once we have results I will post them in this list. Signed-off-by: Eli Cohen Index: source/drivers/infiniband/hw/mthca/mthca_dev.h =================================================================== --- source.orig/drivers/infiniband/hw/mthca/mthca_dev.h +++ source/drivers/infiniband/hw/mthca/mthca_dev.h @@ -117,9 +117,18 @@ enum { MTHCA_OPCODE_INVALID = 0xff }; +enum { + MTHCA_CMD_USE_EVENTS = (1 << 0), + MTHCA_CMD_CAN_POST_DOORBELLS = (1 << 1), + MTHCA_CMD_POST_DOORBELLS = (1 << 2) +}; + +enum { + MTHCA_CMD_NUM_DBELL_DWORDS = 8 +}; + struct mthca_cmd { struct pci_pool *pool; - int use_events; struct mutex hcr_mutex; struct semaphore poll_sem; struct semaphore event_sem; @@ -128,6 +137,10 @@ struct mthca_cmd { int free_head; struct mthca_cmd_context *context; u16 token_mask; + u32 flags; + void __iomem *dbell_map; + u64 dbell_base; + u16 dbell_offsets[MTHCA_CMD_NUM_DBELL_DWORDS]; }; struct mthca_limits { Index: source/drivers/infiniband/hw/mthca/mthca_cmd.c =================================================================== --- source.orig/drivers/infiniband/hw/mthca/mthca_cmd.c +++ source/drivers/infiniband/hw/mthca/mthca_cmd.c @@ -182,25 +182,68 @@ struct mthca_cmd_context { u8 status; }; + +static int post_cmd2dbell = 1; +module_param(post_cmd2dbell, int, 0666); +MODULE_PARM_DESC(post_cmd2dbell, "post commands through doorbell"); + static inline int go_bit(struct mthca_dev *dev) { return readl(dev->hcr + HCR_STATUS_OFFSET) & swab32(1 << HCR_GO_BIT); } -static int mthca_cmd_post(struct mthca_dev *dev, - u64 in_param, - u64 out_param, - u32 in_modifier, - u8 op_modifier, - u16 op, - u16 token, - int event) + +static void mthca_cmd_post_dbell(struct mthca_dev *dev, + u64 in_param, + u64 out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + u16 token) { - int err = 0; + void __iomem *ptr = dev->cmd.dbell_map; + u16 *offs = dev->cmd.dbell_offsets; - mutex_lock(&dev->cmd.hcr_mutex); + __raw_writel((__force u32) cpu_to_be32(in_param >> 32), + ptr + offs[0]); + wmb(); + __raw_writel((__force u32) cpu_to_be32(in_param & 0xfffffffful), + ptr + offs[1]); + wmb(); + __raw_writel((__force u32) cpu_to_be32(in_modifier), + ptr + offs[2]); + wmb(); + __raw_writel((__force u32) cpu_to_be32(out_param >> 32), + ptr + offs[3]); + wmb(); + __raw_writel((__force u32) cpu_to_be32(out_param & 0xfffffffful), + ptr + offs[4]); + wmb(); + __raw_writel((__force u32) cpu_to_be32(token << 16), + ptr + offs[5]); + wmb(); + __raw_writel((__force u32) cpu_to_be32((1 << HCR_GO_BIT) | + (1 << HCA_E_BIT) | + (op_modifier << HCR_OPMOD_SHIFT) | + op), + ptr + offs[6]); + wmb(); + __raw_writel((__force u32) 0, + ptr + offs[7]); + wmb(); +} + +static int mthca_cmd_post_hcr(struct mthca_dev *dev, + u64 in_param, + u64 out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + u16 token, + int event) +{ if (event) { unsigned long end = jiffies + GO_BIT_TIMEOUT; @@ -210,10 +253,8 @@ static int mthca_cmd_post(struct mthca_d } } - if (go_bit(dev)) { - err = -EAGAIN; - goto out; - } + if (go_bit(dev)) + return -EAGAIN; /* * We use writel (instead of something like memcpy_toio) @@ -236,7 +277,29 @@ static int mthca_cmd_post(struct mthca_d (op_modifier << HCR_OPMOD_SHIFT) | op), dev->hcr + 6 * 4); -out: + return 0; +} + +static int mthca_cmd_post(struct mthca_dev *dev, + u64 in_param, + u64 out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + u16 token, + int event) +{ + int err = 0; + + mutex_lock(&dev->cmd.hcr_mutex); + + if (event && dev->cmd.flags & MTHCA_CMD_POST_DOORBELLS && post_cmd2dbell) + mthca_cmd_post_dbell(dev, in_param, out_param, in_modifier, + op_modifier, op, token); + else + err = mthca_cmd_post_hcr(dev, in_param, out_param, in_modifier, + op_modifier, op, token, event); + mutex_unlock(&dev->cmd.hcr_mutex); return err; } @@ -386,7 +449,7 @@ static int mthca_cmd_box(struct mthca_de unsigned long timeout, u8 *status) { - if (dev->cmd.use_events) + if (dev->cmd.flags & MTHCA_CMD_USE_EVENTS) return mthca_cmd_wait(dev, in_param, &out_param, 0, in_modifier, op_modifier, op, timeout, status); @@ -423,7 +486,7 @@ static int mthca_cmd_imm(struct mthca_de unsigned long timeout, u8 *status) { - if (dev->cmd.use_events) + if (dev->cmd.flags & MTHCA_CMD_USE_EVENTS) return mthca_cmd_wait(dev, in_param, out_param, 1, in_modifier, op_modifier, op, timeout, status); @@ -437,7 +500,7 @@ int mthca_cmd_init(struct mthca_dev *dev { mutex_init(&dev->cmd.hcr_mutex); sema_init(&dev->cmd.poll_sem, 1); - dev->cmd.use_events = 0; + dev->cmd.flags &= ~MTHCA_CMD_USE_EVENTS; dev->hcr = ioremap(pci_resource_start(dev->pdev, 0) + MTHCA_HCR_BASE, MTHCA_HCR_SIZE); @@ -461,6 +524,8 @@ void mthca_cmd_cleanup(struct mthca_dev { pci_pool_destroy(dev->cmd.pool); iounmap(dev->hcr); + if (dev->cmd.flags & MTHCA_CMD_POST_DOORBELLS) + iounmap(dev->cmd.dbell_map); } /* @@ -498,12 +563,47 @@ int mthca_cmd_use_events(struct mthca_de ; /* nothing */ --dev->cmd.token_mask; - dev->cmd.use_events = 1; + dev->cmd.flags |= MTHCA_CMD_USE_EVENTS; + + down(&dev->cmd.poll_sem); return 0; } + +/* + * attempt to post commands through doorbells + */ +int mthca_use_cmd_doorbells(struct mthca_dev *dev) +{ + int i; + u16 max_off = 0; + unsigned long pg1, pg2; + + if (!dev->cmd.flags & MTHCA_CMD_CAN_POST_DOORBELLS) + return -ENODEV; + + for (i=0; i<8; ++i) + if (dev->cmd.dbell_offsets[i] > max_off) + max_off = dev->cmd.dbell_offsets[i]; + + pg1 = dev->cmd.dbell_base & PAGE_MASK; + pg2 = (dev->cmd.dbell_base + max_off) & PAGE_MASK; + + if (pg1 != pg2) + return -ENOMEM; + + dev->cmd.dbell_map = ioremap(dev->cmd.dbell_base, max_off + sizeof(u32)); + if (!dev->cmd.dbell_map) + return -ENOMEM; + + dev->cmd.flags |= MTHCA_CMD_POST_DOORBELLS; + mthca_dbg(dev, "posting commands through doorbell\n"); + + return 0; +} + /* * Switch back to polling (used when shutting down the device) */ @@ -511,7 +611,7 @@ void mthca_cmd_use_polling(struct mthca_ { int i; - dev->cmd.use_events = 0; + dev->cmd.flags &= ~MTHCA_CMD_USE_EVENTS; for (i = 0; i < dev->cmd.max_cmds; ++i) down(&dev->cmd.event_sem); @@ -665,8 +765,10 @@ int mthca_QUERY_FW(struct mthca_dev *dev { struct mthca_mailbox *mailbox; u32 *outbox; + u32 tmp; int err = 0; u8 lg; + int i; #define QUERY_FW_OUT_SIZE 0x100 #define QUERY_FW_VER_OFFSET 0x00 @@ -674,6 +776,11 @@ int mthca_QUERY_FW(struct mthca_dev *dev #define QUERY_FW_ERR_START_OFFSET 0x30 #define QUERY_FW_ERR_SIZE_OFFSET 0x38 + +#define QUERY_FW_CMD_DB_EN_OFFSET 0x10 +#define QUERY_FW_CMD_DB_OFFSET 0x50 +#define QUERY_FW_CMD_DB_BASE 0x60 + #define QUERY_FW_START_OFFSET 0x20 #define QUERY_FW_END_OFFSET 0x28 @@ -706,6 +813,15 @@ int mthca_QUERY_FW(struct mthca_dev *dev dev->cmd.max_cmds = 1 << lg; MTHCA_GET(dev->catas_err.addr, outbox, QUERY_FW_ERR_START_OFFSET); MTHCA_GET(dev->catas_err.size, outbox, QUERY_FW_ERR_SIZE_OFFSET); + MTHCA_GET(tmp, outbox, QUERY_FW_CMD_DB_EN_OFFSET); + if (tmp & 0x1) { + mthca_dbg(dev, "FW supports commands through doorbells\n"); + dev->cmd.flags |= MTHCA_CMD_CAN_POST_DOORBELLS; + } + MTHCA_GET(dev->cmd.dbell_base, outbox, QUERY_FW_CMD_DB_BASE); + for (i=0; icmd.dbell_offsets[i], outbox, + QUERY_FW_CMD_DB_OFFSET + (i << 1)); mthca_dbg(dev, "FW version %012llx, max commands %d\n", (unsigned long long) dev->fw_ver, dev->cmd.max_cmds); @@ -1659,7 +1775,6 @@ int mthca_MODIFY_QP(struct mthca_dev *de if (0) { int i; mthca_dbg(dev, "Dumping QP context:\n"); - printk(" opt param mask: %08x\n", be32_to_cpup(mailbox->buf)); for (i = 0; i < 0x100 / 4; ++i) { if (i % 8 == 0) printk(" [%02x] ", i * 4); Index: source/drivers/infiniband/hw/mthca/mthca_cmd.h =================================================================== --- source.orig/drivers/infiniband/hw/mthca/mthca_cmd.h +++ source/drivers/infiniband/hw/mthca/mthca_cmd.h @@ -244,6 +244,7 @@ struct mthca_set_ib_param { int mthca_cmd_init(struct mthca_dev *dev); void mthca_cmd_cleanup(struct mthca_dev *dev); int mthca_cmd_use_events(struct mthca_dev *dev); +int mthca_use_cmd_doorbells(struct mthca_dev *dev); void mthca_cmd_use_polling(struct mthca_dev *dev); void mthca_cmd_event(struct mthca_dev *dev, u16 token, u8 status, u64 out_param); Index: source/drivers/infiniband/hw/mthca/mthca_main.c =================================================================== --- source.orig/drivers/infiniband/hw/mthca/mthca_main.c +++ source/drivers/infiniband/hw/mthca/mthca_main.c @@ -69,6 +69,8 @@ MODULE_PARM_DESC(msi, "attempt to use MS #endif /* CONFIG_PCI_MSI */ + + static const char mthca_version[] __devinitdata = DRV_NAME ": Mellanox InfiniBand HCA driver v" DRV_VERSION " (" DRV_RELDATE ")\n"; @@ -754,6 +756,11 @@ static int __devinit mthca_setup_hca(str goto err_eq_table_free; } + err = mthca_use_cmd_doorbells(dev); + if (err) + mthca_dbg(dev, "not using commands through doorbells\n"); + + err = mthca_NOP(dev, &status); if (err || status) { mthca_err(dev, "NOP command failed to generate interrupt (IRQ %d), aborting.\n", From info at dakq.com Mon Feb 20 04:53:02 2006 From: info at dakq.com (info at dakq.com) Date: 20 Feb 2006 21:53:02 +0900 Subject: [openib-general] $B<+?.$"$j$^$9!#(B Message-ID: <20060220125302.1420.qmail@mail.dakq.com> $B:#1|MM$N4V$G?M5$J(F-Cf!*!*7k:'$7$F$k$+$i$3$=!"%9%H%l%9N/$^$C$F$7$?$/$J$k(B $B!#$1$I!"7k:'$7$F$k$+$i$3$=(BSM$B$J$s$F=PMh$J$$!D!#IW$K$O$=$s$J$*4j$$=PMh$J$$(B $B!D!#$=$s$J1|MMC#$,(BM$BCK$r5a$a$FF|!9A}2C!"(BS$B1|MM at lMQ$K=PMh$^$7$?!#(B http://www.gyakuten6.net/serebu/?cv11 $B7P1DHq$O1|MM$NEPO?NA$GJd$o$l$F$*$j!"CK at -$+$i$NEPO?NA$OD:$$$F$*$j$^$;$s!#(B $B;22CCK at -$KBP$7$F$O!"$b$l$J$/=w at -$+$i$*>.8/$$(B($B8};_$aNA(B)$B$H8@$&8+JV$j(B $BIU$-$N%k!<%k$H$J$C$F$$$^$9!#:GDc6b3[(B3$BK|!*$=$l0J2<$O$"$j$^$;$s!#$=$7$FFs$D(B $B$NFbMF$K$h$j6b3[JQF0!"0J2<$NFs$D$N9=@.$H$J$C$F$$$^$9!#(B $B-!%=%U%H!&%N!<%^%k(BSM$B"M1|MM$+$i$N$*>.8/$$(B30,000$B1_!A(B $B!J$3$A$i$O$*$?$a$7E*$J%W%l%$$G$9!"=i?4.8/$$(B50,000$B1_!A(B ($B$3$A$i$N%3!<%9$O%O!<%I$G$9!"=i?4/$7$G$b$"$l$P!"1|MM$rK~B-$5$;$F99$K$*>.8/$$$b!"$b$i$($F(B $B$3$s$J%A%c%s%9F($9M}M3$"$j$^$9$+!)!)!)(B $B6=L#$r$*;}$A$K$J$C$?J}$O>e5-%"%/%;%9$K$F$4F~>l2<$5$$!#(B http://www.gyakuten6.net/serebu/?cv11 $B$3$N%A%c%s%9F($9J}$O$3$A$i"-(B concept2_net at yahoo.ca $B"((BI don't veceive your mail$B"-(B concept2_net at yahoo.ca From ogerlitz at voltaire.com Mon Feb 20 06:54:14 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 20 Feb 2006 16:54:14 +0200 Subject: [openib-general] Re: [PATCH] mutex-backport: add mutex_trylock support In-Reply-To: <20060219133915.GG22037@mellanox.co.il> References: <20060219124553.GA22037@mellanox.co.il> <43F86ABA.5060209@voltaire.com> <20060219130637.GC22037@mellanox.co.il> <43F872A2.3090400@voltaire.com> <20060219133915.GG22037@mellanox.co.il> Message-ID: <43F9D816.30005@voltaire.com> Michael S. Tsirkin wrote: > Please copy verbatim then: > > * NOTE: this function follows the spin_trylock() convention, so > * it is negated to the down_trylock() return values! Be careful > * about this when converting semaphore users to mutexes. > > Both NOTE and the last line are missing. let it be, feel free to change it to whatever you think need to be written, note that anyway this is temporal file to be used only before 2.6.16 is out Or. From anton at samba.org Mon Feb 20 06:59:05 2006 From: anton at samba.org (Anton Blanchard) Date: Tue, 21 Feb 2006 01:59:05 +1100 Subject: [openib-general] Re: [PATCH 01/22] Add powerpc-specific clear_cacheline(), which just compiles to "dcbz". In-Reply-To: <20060218005704.13620.88286.stgit@localhost.localdomain> References: <20060218005532.13620.79663.stgit@localhost.localdomain> <20060218005704.13620.88286.stgit@localhost.localdomain> Message-ID: <20060220145904.GA19895@krispykreme> Hi, > This is horribly non-portable. How much of a performance difference > does it make? How does it do on ppc64 systems where the cacheline > size is not 32? Yes, if anything we should catch cacheline aligned, multiple cacheline sized zeroing in memset. Anton From RAISCH at de.ibm.com Mon Feb 20 07:06:19 2006 From: RAISCH at de.ibm.com (Christoph Raisch) Date: Mon, 20 Feb 2006 16:06:19 +0100 Subject: [openib-general] Re: [PATCH 00/22] [RFC] IBM eHCA InfiniBand adapter driver In-Reply-To: <20060218005532.13620.79663.stgit@localhost.localdomain> Message-ID: Roland, as you already stated we really have a problem that we're not able to send "large" pieces of code to the kernel mailing list. It's perfectly ok for us to send patches to the openib.org mailing list and svn. This is something we still try to resolve with legal. So thank you Roland for acting as a proxy here... We have the ok to contribute to any ehca related discussion on kernel mailing-list and ppc64-mailing list, and are absolutely willing to do so! Adding a new driver for a complex new hardware isn't the regular linux develpment case, especially if there's no base code in linux kernel to patch against... In our case this patch resulted in 22 postings. Some people already noticed that there's still quite some road ahead of us... but we're abolutely willing to work that, and we had to start at some place. Some coments will result in modifications to all files. I guess posting 22 new patch files (diff against NIL) each week is sort of a DoS attack on the mailing list and we'll end up in peoples spam folders pretty quickly... So what's the recomended way to proceed here? Gruss / Regards . . . Christoph Raisch christoph raisch, HCAD teamlead Roland Dreier wrote on 18.02.2006 01:55:32: > Here's a series of patches that add an InfiniBand adapter driver > for IBM eHCA hardware. Please look it over with an eye towards issues > that need to be addressed before merging this upstream. > From anton at samba.org Mon Feb 20 07:09:53 2006 From: anton at samba.org (Anton Blanchard) Date: Tue, 21 Feb 2006 02:09:53 +1100 Subject: [openib-general] Re: [PATCH 03/22] pHype specific stuff In-Reply-To: <20060218005709.13620.77409.stgit@localhost.localdomain> References: <20060218005532.13620.79663.stgit@localhost.localdomain> <20060218005709.13620.77409.stgit@localhost.localdomain> Message-ID: <20060220150953.GB19895@krispykreme> Hi, > +inline static u32 getLongBusyTimeSecs(int longBusyRetCode) > +{ > + switch (longBusyRetCode) { > + case H_LongBusyOrder1msec: > + return 1; > + case H_LongBusyOrder10msec: > + return 10; > + case H_LongBusyOrder100msec: > + return 100; > + case H_LongBusyOrder1sec: > + return 1000; > + case H_LongBusyOrder10sec: > + return 10000; > + case H_LongBusyOrder100sec: > + return 100000; > + default: > + return 1; > + } /* eof switch */ > +} Since this actually returns milliseconds it might be worth making it obvious in the function name. Also no need to use studly caps for the function name and variable. We will fix the studly caps H_LongBusy* stuff another day :) > +inline static long plpar_hcall_7arg_7ret(unsigned long opcode, > +inline static long plpar_hcall_9arg_9ret(unsigned long opcode, These belong in arch/powerpc/platforms/pseries/hvCall.S Anton From anton at samba.org Mon Feb 20 07:12:15 2006 From: anton at samba.org (Anton Blanchard) Date: Tue, 21 Feb 2006 02:12:15 +1100 Subject: [openib-general] Re: [PATCH 07/22] Hypercall definitions In-Reply-To: <20060218005721.13620.84990.stgit@localhost.localdomain> References: <20060218005532.13620.79663.stgit@localhost.localdomain> <20060218005721.13620.84990.stgit@localhost.localdomain> Message-ID: <20060220151215.GC19895@krispykreme> Hi, > Do these defines belong in the ehca driver, or should they be put > somewhere in generic hypercall support? Agreed, I think they should go into include/asm-powerpc/hvcall.h Anton From anton at samba.org Mon Feb 20 07:22:13 2006 From: anton at samba.org (Anton Blanchard) Date: Tue, 21 Feb 2006 02:22:13 +1100 Subject: [openib-general] Re: [PATCH 21/22] ehca main file In-Reply-To: <20060218005759.13620.10968.stgit@localhost.localdomain> References: <20060218005532.13620.79663.stgit@localhost.localdomain> <20060218005759.13620.10968.stgit@localhost.localdomain> Message-ID: <20060220152213.GD19895@krispykreme> Hi, > What is ehca_show_flightrecorder() trying to do that snprintf() is > not fast enough? If you need to pass a binary structure back to > userspace (with a kernel address in it??) then sysfs is not the right > place to put it. Look at debugfs; or relayfs might make the most > sense for your flightrecorder stuff. I agree debugfs or relayfs would be better suited. Of course as the driver matures this form of debug is probably not required at all. > +#include "hcp_sense.h" /* TODO: later via hipz_* header file */ > +#include "hcp_if.h" /* TODO: later via hipz_* header file */ I count 88 TODOs in the driver, it would be nice to get rid of some of them like the two above, so we can concentrate on the important TODOs :) > +#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,12) > +#define EHCA_RESOURCE_ATTR_H(name) \ > +static ssize_t ehca_show_##name(struct device *dev, \ > + struct device_attribute *attr, \ > + char *buf) > +#else > +#define EHCA_RESOURCE_ATTR_H(name) \ > +static ssize_t ehca_show_##name(struct device *dev, \ > + char *buf) > +#endif No need for kernel version ifdefs. Anton From halr at voltaire.com Mon Feb 20 07:27:31 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Feb 2006 10:27:31 -0500 Subject: [openib-general] Re: Re:[PATCH] OpenSM/st.c: Fix some size_t issues related to memoryallocation in st.c In-Reply-To: <5zpsljk03k.fsf@mtl066.yok.mtl.com> References: <5zpsljk03k.fsf@mtl066.yok.mtl.com> Message-ID: <1140449250.4476.36129.camel@hal.voltaire.com> Hi Yael, On Sun, 2006-02-19 at 07:24, Yael Kalka wrote: > Hi Hal, > > The patch in general is fine. I've added one change to the original > patch - to avoid casting issues underwindows > Below is the full patch. This looks like the same patch to me. What changed ? (What did I miss ?) -- Hal > > Yael > > Signed-off-by: Hal Rosenstock > > Index: include/opensm/st.h =================================================================== > --- include/opensm/st.h (revision 5436) > +++ include/opensm/st.h (working copy) > @@ -40,6 +40,8 @@ > #ifndef ST_INCLUDED > #define ST_INCLUDED > > +#include > + > #ifdef __cplusplus > # define BEGIN_C_DECLS extern "C" { > # define END_C_DECLS } > @@ -79,11 +81,11 @@ struct st_table { > enum st_retval {ST_CONTINUE, ST_STOP, ST_DELETE}; > > st_table *st_init_table(struct st_hash_type *); > -st_table *st_init_table_with_size(struct st_hash_type *, int); > +st_table *st_init_table_with_size(struct st_hash_type *, size_t); > st_table *st_init_numtable(void); > -st_table *st_init_numtable_with_size(int); > +st_table *st_init_numtable_with_size(size_t); > st_table *st_init_strtable(void); > -st_table *st_init_strtable_with_size(int); > +st_table *st_init_strtable_with_size(size_t); > int st_delete(st_table *, st_data_t *, st_data_t *); > int st_delete_safe(st_table *, st_data_t *, st_data_t *, st_data_t); int st_insert(st_table *, st_data_t, st_data_t); > > Index: opensm/st.c > =================================================================== > --- opensm/st.c (revision 5436) > +++ opensm/st.c (working copy) > @@ -42,7 +42,6 @@ > #endif /* HAVE_CONFIG_H */ > > #include > -#include > #include > #include > > @@ -102,17 +101,11 @@ static struct st_hash_type type_strhash > #define xcalloc calloc > #define xrealloc realloc > #define xfree free > -#if 0 > -void *xmalloc(long); > -void *xcalloc(long, long); > -void *xrealloc(void *, long); > -void xfree(void *); > -#endif > > static void rehash(st_table *); > > -#define alloc(type) (type*)xmalloc((unsigned)sizeof(type)) > -#define Calloc(n,s) (char*)xcalloc((n),(s)) > +#define alloc(type) (type*)xmalloc(sizeof(type)) > +#define Calloc(n,s) (char*)xcalloc((n), (s)) > > #define EQUAL(table,x,y) ((x)==(y) || (*table->type->compare)(((void*)x),((void *)y)) == 0) > > @@ -200,7 +193,7 @@ stat_col() > st_table* > st_init_table_with_size(type, size) > struct st_hash_type *type; > - int size; > + size_t size; > { > st_table *tbl; > > @@ -238,7 +231,7 @@ st_init_numtable(void) > > st_table* > st_init_numtable_with_size(size) > - int size; > + size_t size; > { > return st_init_table_with_size(&type_numhash, size); > } > @@ -251,7 +244,7 @@ st_init_strtable(void) > > st_table* > st_init_strtable_with_size(size) > - int size; > + size_t size; > { > return st_init_table_with_size(&type_strhash, size); > } > @@ -314,7 +307,8 @@ st_lookup(table, key, value) > return 0; > } > else { > - if (value != 0) *value = ptr->record; > + if (value != 0) > + *value = ptr->record; > return 1; > } > } > @@ -407,7 +401,8 @@ st_copy(old_table) > { > st_table *new_table; > st_table_entry *ptr, *entry; > - int i, num_bins = old_table->num_bins; > + int i; > + size_t num_bins = old_table->num_bins; > > new_table = alloc(st_table); > if (new_table == 0) > @@ -417,7 +412,7 @@ st_copy(old_table) > > *new_table = *old_table; > new_table->bins = (st_table_entry**) > - Calloc((unsigned)num_bins, sizeof(st_table_entry*)); > + Calloc(num_bins, sizeof(st_table_entry*)); > > if (new_table->bins == 0) > { > @@ -524,7 +519,7 @@ st_delete_safe(table, key, value, never) > } > > static int > -delete_never( st_data_t key, st_data_t value, st_data_t never) > +delete_never(st_data_t key, st_data_t value, st_data_t never) > { > if (value == never) return ST_DELETE; > return ST_CONTINUE; > > > From schihei at de.ibm.com Mon Feb 20 18:09:59 2006 From: schihei at de.ibm.com (Heiko J Schick) Date: Tue, 21 Feb 2006 03:09:59 +0100 Subject: [openib-general] Re: [PATCH 21/22] ehca main file In-Reply-To: <20060220152213.GD19895@krispykreme> References: <20060218005532.13620.79663.stgit@localhost.localdomain> <20060218005759.13620.10968.stgit@localhost.localdomain> <20060220152213.GD19895@krispykreme> Message-ID: <43FA7677.3040901@de.ibm.com> Hello Anton, thanks for your help! >>+#include "hcp_sense.h" /* TODO: later via hipz_* header file */ >>+#include "hcp_if.h" /* TODO: later via hipz_* header file */ > > > I count 88 TODOs in the driver, it would be nice to get rid of some of > them like the two above, so we can concentrate on the important TODOs :) We will remove the TODOs soon as possible. >>+#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,12) >>+#define EHCA_RESOURCE_ATTR_H(name) \ >>+static ssize_t ehca_show_##name(struct device *dev, \ >>+ struct device_attribute *attr, \ >>+ char *buf) >>+#else >>+#define EHCA_RESOURCE_ATTR_H(name) \ >>+static ssize_t ehca_show_##name(struct device *dev, \ >>+ char *buf) >>+#endif > > > No need for kernel version ifdefs. The point is that our module have to run on Linux 2.6.5-7.244 (SuSE SLES 9 SP3), too. This was the reason why we've included the ifdefs. We can change the ifdefs to #if LINUX_VERSION_CODE >= KERNEL_VERSION(2.6.5) to mark that this code is used for Linux 2.6.5 compatibility. Regards, Heiko From rdreier at cisco.com Mon Feb 20 08:48:56 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 20 Feb 2006 08:48:56 -0800 Subject: [openib-general] Re: ibv_create_srq doesn't update the SRQ init attributes In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3016C4E72@mtlexch01.mtl.com> (Dotan Barak's message of "Mon, 20 Feb 2006 10:47:10 +0200") References: <6AB138A2AB8C8E4A98B9C0C3D52670E3016C4E72@mtlexch01.mtl.com> Message-ID: Dotan> No problem, but I don't know what to change ... Can you Dotan> explain to me the concept of the ABI versioning ? I don't think either the ABI or API has to change. Right now we have struct ibv_srq *ibv_create_srq(struct ibv_pd *pd, struct ibv_srq_init_attr *srq_init_attr); so the library should just update srq_init_attr with the values actually used to create the SRQ. The parameter isn't declared const or anything, so this should be fine for the 1.0 release. - R. From rdreier at cisco.com Mon Feb 20 08:50:30 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 20 Feb 2006 08:50:30 -0800 Subject: [openib-general] Re: Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: <20060220090323.GA24524@minantech.com> (Gleb Natapov's message of "Mon, 20 Feb 2006 11:03:23 +0200") References: <20060213154114.GO32041@mellanox.co.il> <20060213190206.GC12458@mellanox.co.il> <20060214065145.GE24524@minantech.com> <20060214164236.GB12974@mellanox.co.il> <20060215073452.GI24524@minantech.com> <20060215082331.GE10026@mellanox.co.il> <20060220090323.GA24524@minantech.com> Message-ID: Gleb> It had come to my attention that there is file memory.c in Gleb> libibverbs that implements refcounting for mlock. I think it Gleb> was meant to be used from reg_mr() (since interface is Gleb> hidden) back when mlock was needed for kernel bug Gleb> workaround. If it was good idea back than why not now? Gleb> Alternatively we can make this interface public for Gleb> application to use explicitly. Gleb> Roland can you please tell the history behind memory.c? This was used for mlock() when that was required. The reason I left it around was that I always thought that the MADV_DONTFORK would be handled the same way. Probably the best thing to do would be to have libibverbs handle the madvise() stuff transparently by default, but perhaps add some hooks for more intelligent applications or MPI libraries to do their own thing. - R. From rdreier at cisco.com Mon Feb 20 08:52:55 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 20 Feb 2006 08:52:55 -0800 Subject: [openib-general] Re: [PATCH 21/22] ehca main file In-Reply-To: <20060220152213.GD19895@krispykreme> (Anton Blanchard's message of "Tue, 21 Feb 2006 02:22:13 +1100") References: <20060218005532.13620.79663.stgit@localhost.localdomain> <20060218005759.13620.10968.stgit@localhost.localdomain> <20060220152213.GD19895@krispykreme> Message-ID: Anton> No need for kernel version ifdefs. Sorry, I tried to strip these out before posting the patch, but I missed one. Anyway, totally agree on the ifdefs and I will be double-extra-sure that the final version doesn't include them. - R. From rdreier at cisco.com Mon Feb 20 08:55:24 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 20 Feb 2006 08:55:24 -0800 Subject: [openib-general] Re: [PATCH 00/22] [RFC] IBM eHCA InfiniBand adapter driver In-Reply-To: (Christoph Raisch's message of "Mon, 20 Feb 2006 16:06:19 +0100") References: Message-ID: Christoph> I guess posting 22 new patch files (diff against NIL) Christoph> each week is sort of a DoS attack on the mailing list Christoph> and we'll end up in peoples spam folders pretty Christoph> quickly... So what's the recomended way to proceed Christoph> here? I don't think there's any other way to proceed. For each version, you should carefully note down the feedback that you received and how you are responding to each suggestion, and include that with the patch file. But it's too much to expect for people to keep context for a patch under review, so even though it generates a lot of email, I think that including the whole series is the only way to go. Perhaps the list admins disagree with me though ;) - R. From rdreier at cisco.com Mon Feb 20 08:56:28 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 20 Feb 2006 08:56:28 -0800 Subject: [openib-general] NFS/RDMA client release for Linux 2.6.15 In-Reply-To: <7.0.1.0.2.20060219174756.045d6290@netapp.com> (Thomas Talpey's message of "Sun, 19 Feb 2006 17:56:41 -0500") References: <7.0.1.0.2.20060207211754.0409e8a0@netapp.com> <20060219120155.GA10268@lst.de> <7.0.1.0.2.20060219093506.040ee028@netapp.com> <7.0.1.0.2.20060219174756.045d6290@netapp.com> Message-ID: Thomas> And, this is only one of many memory registration Thomas> modes. We would use memory windows, if only OpenIB Thomas> provided them (yes I know the hardware currently sucks for Thomas> them). We will add FMR support shortly. In both these Thomas> modes we perform all addressing by the book via 1-1 OpenIB Thomas> registration. Memory windows don't solve anything here, do they? You still have to register the full region using bus addresses somehow. - R. From spoole at lanl.gov Mon Feb 20 09:43:51 2006 From: spoole at lanl.gov (Stephen Poole) Date: Mon, 20 Feb 2006 10:43:51 -0700 Subject: [openib-general] Re: [PATCH 00/22] [RFC] IBM eHCA InfiniBand adapter driver In-Reply-To: References: Message-ID: If every open source company was being sued for $3B I think many companies would be a bit timid. :-) IBM has been working this issue at all levels. It will happen when IBM Legal has figured out all of the necessary paths in order to cover any potential law suits. Unfortunately, the open source path has been muddied by some folks. Steve... At 4:06 PM +0100 2/20/06, Christoph Raisch wrote: >Roland, >as you already stated we really have a problem that we're not able to send >"large" pieces of code to the kernel mailing list. >It's perfectly ok for us to send patches to the openib.org mailing list and >svn. >This is something we still try to resolve with legal. >So thank you Roland for acting as a proxy here... >We have the ok to contribute to any ehca related discussion on kernel >mailing-list and ppc64-mailing list, and are absolutely willing to do so! > >Adding a new driver for a complex new hardware isn't the regular linux >develpment case, especially if there's no base code in linux kernel to >patch against... >In our case this patch resulted in 22 postings. >Some people already noticed that there's still quite some road ahead of >us... but we're abolutely willing to work that, and we had to start at some >place. >Some coments will result in modifications to all files. >I guess posting 22 new patch files (diff against NIL) each week is sort of >a DoS attack on the mailing list and we'll end up in peoples spam folders >pretty quickly... >So what's the recomended way to proceed here? > > >Gruss / Regards . . . Christoph Raisch > >christoph raisch, HCAD teamlead > >Roland Dreier wrote on 18.02.2006 01:55:32: > >> Here's a series of patches that add an InfiniBand adapter driver >> for IBM eHCA hardware. Please look it over with an eye towards issues >> that need to be addressed before merging this upstream. >> > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Steve Poole (spoole at lanl.gov) Office: 505.665.9662 Los Alamos National Laboratory Cell: 505.699.3807 CCN - Special Projects / Advanced Development Fax: 505.665.7793 P.O. Box 1663, MS B255 Los Alamos, NM. 87545 03149801S From mst at mellanox.co.il Mon Feb 20 09:54:35 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 20 Feb 2006 19:54:35 +0200 Subject: [openib-general] Re: Re: Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: References: <20060213154114.GO32041@mellanox.co.il> <20060213190206.GC12458@mellanox.co.il> <20060214065145.GE24524@minantech.com> <20060214164236.GB12974@mellanox.co.il> <20060215073452.GI24524@minantech.com> <20060215082331.GE10026@mellanox.co.il> <20060220090323.GA24524@minantech.com> Message-ID: <20060220175435.GK20285@mellanox.co.il> Quoting Roland Dreier : > Gleb> Roland can you please tell the history behind memory.c? > > This was used for mlock() when that was required. The reason I left > it around was that I always thought that the MADV_DONTFORK would be > handled the same way. > > Probably the best thing to do would be to have libibverbs handle the > madvise() stuff transparently by default, but perhaps add some hooks > for more intelligent applications or MPI libraries to do their own thing. How would the hook look like? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From caitlin.bestler at gmail.com Mon Feb 20 10:12:32 2006 From: caitlin.bestler at gmail.com (Caitlin Bestler) Date: Mon, 20 Feb 2006 10:12:32 -0800 Subject: [openib-general] NFS/RDMA client release for Linux 2.6.15 In-Reply-To: References: <7.0.1.0.2.20060207211754.0409e8a0@netapp.com> <20060219120155.GA10268@lst.de> <7.0.1.0.2.20060219093506.040ee028@netapp.com> <7.0.1.0.2.20060219174756.045d6290@netapp.com> Message-ID: <469958e00602201012h492f5aa0h494274b67bcaa1d2@mail.gmail.com> On 2/20/06, Roland Dreier wrote: > > Thomas> And, this is only one of many memory registration > Thomas> modes. We would use memory windows, if only OpenIB > Thomas> provided them (yes I know the hardware currently sucks for > Thomas> them). We will add FMR support shortly. In both these > Thomas> modes we perform all addressing by the book via 1-1 OpenIB > Thomas> registration. > > Memory windows don't solve anything here, do they? You still have to > register the full region using bus addresses somehow. > > - R. > _______________________________________________ Keep in mind that there are two problems: registering memory and exposing memory. Windows solves the latter problem. FMR work requests solves both. A kernel based storage related client will frequently want to form logical buffers from scattered physical pages. The pages selected are not necessarily part of an existing virtual memory map, and especially not a registered one. You can solve the memory registration problem once by creating an exportable memory region that covers all of physical memory. The problem is that you do not want to advertise that regions RKey/STag. Memory Windows solve that problem, by allowing you to bind windows within the memory region. The problem is that if the buffer is not physically continquous then you still have to export a multi-element list in order to have the peer to read from a discontiguous target. A Fast Memory Region allows arbitrary sets of pages to form a single logical window for the purposes of peer-to-peer interaction, and has a life cycle that more naturally maps the duration when the pages have to be iomapped. -------------- next part -------------- An HTML attachment was scrubbed... URL: From arnd at arndb.de Mon Feb 20 10:32:31 2006 From: arnd at arndb.de (Arnd Bergmann) Date: Mon, 20 Feb 2006 19:32:31 +0100 Subject: [openib-general] Re: [PATCH 21/22] ehca main file In-Reply-To: <43FA7677.3040901@de.ibm.com> References: <20060218005532.13620.79663.stgit@localhost.localdomain> <20060220152213.GD19895@krispykreme> <43FA7677.3040901@de.ibm.com> Message-ID: <200602201932.31739.arnd@arndb.de> On Tuesday 21 February 2006 03:09, Heiko J Schick wrote: >  >>+#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,12) >  >>+#define EHCA_RESOURCE_ATTR_H(name)                                         \ >  >>+static ssize_t  ehca_show_##name(struct device *dev,                       \ >  >>+                             struct device_attribute *attr,            \ >  >>+                             char *buf) >  >>+#else >  >>+#define EHCA_RESOURCE_ATTR_H(name)                                         \ >  >>+static ssize_t  ehca_show_##name(struct device *dev,                       \ >  >>+                             char *buf) >  >>+#endif >  > >  > >  > No need for kernel version ifdefs. > > The point is that our module have to run on Linux 2.6.5-7.244 (SuSE SLES 9 SP3), too. > This was the reason why we've included the ifdefs. We can change the ifdefs to > #if LINUX_VERSION_CODE >= KERNEL_VERSION(2.6.5) to mark that this code is used for > Linux 2.6.5 compatibility. That only makes sense as long as you have a common source code for both that also is under your control. As soon as the driver enters the mainline kernel, it is no longer helpful to have these checks in it, because other people will start making changes to the driver that you don't want to have in the 2.6.5 version. You cannot avoid forking the code in the long term, but fortunately the need to backport fixes to the old version should also decrease over time. Arnd <>< From mst at mellanox.co.il Mon Feb 20 11:58:45 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 20 Feb 2006 21:58:45 +0200 Subject: [openib-general] ipoib patches Message-ID: <20060220195845.GG22781@mellanox.co.il> Hi, Roland! What's going on with ipoib patches in contrib/mellanox? There are still 9 patches outstanding, most of them are really simple and should be safe bet even for 2.6.16. There's also mthca_cosmetic_icm_page_size.patch there which looks like a safe one. Other patches might be good candidates at least for svn tree so that they get some coverage there. Should I repost so we get some discussion going? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From masazohto at kobej.zzn.com Mon Feb 20 12:47:31 2006 From: masazohto at kobej.zzn.com (masazohto at kobej.zzn.com) Date: Mon, 20 Feb 2006 12:47:31 -0800 (PST) Subject: [openib-general] =?iso-2022-jp?b?GyRCTTVKISRKMXxNTSRyS34bKEI=?= =?iso-2022-jp?b?GyRCJD8kNyRGSnM9NyQiJGobKEI=?= Message-ID: 20060221045125.90627mail@mail.hyper_grandy552158754_lookserver772_serebusystem03_woman-grandy.tv 人妻が出会いを求める理由は様々です。夫への不満、 単調な毎日から抜け出したい、刺激が欲しい、スリルを味わいたい、快楽に溺れたいなど・・・・。数え上げればきりがない程、人妻は満たされていないのです。そんな女性を貴男の手で満足して上げて下さい。 女性の満足度により貴男の報酬額が変わります。 http://club-grandee.cx/h/ ★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★ 女性を満たして上げたい・高額な報酬が欲しい等の男性は是非お入り下さい。 http://club-grandee.cx/h/ ★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★ ■必読 女性は男性へ支払う報酬額は貴女の満足度で決めて下さい。規則等は御座いません。 ‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥‥ 若い男性と知り合いたい、夫のいない時間に遊びたい そんな女性はコチラまで http://club-grandee.cx/h/ From fkruggel at uci.edu Mon Feb 20 12:42:09 2006 From: fkruggel at uci.edu (Frithjof Kruggel) Date: Mon, 20 Feb 2006 12:42:09 -0800 Subject: [openib-general] Performance optimization Message-ID: <43FA29A1.5020301@uci.edu> Hi, I am a newbie to this list. I just set up a small server system of 10x2 opteron nodes equipped with Mellanox cards and a Mellanox switch. Installed the gen2 trunk "out of the box" on a Debian system with a Linux 2.6.15.4 kernel. However, the performance as measured by the read/ write tests is "only" about 200 MB/s - I expected figures in the order of 400-500 MB/s. Can you give me some step-by-step list how to optimize performance? I walked through the digests but was unable to find any hint. Thanks, Frithjof Kruggel From mst at mellanox.co.il Mon Feb 20 12:49:23 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 20 Feb 2006 22:49:23 +0200 Subject: [openib-general] Re: Performance optimization In-Reply-To: <43FA29A1.5020301@uci.edu> References: <43FA29A1.5020301@uci.edu> Message-ID: <20060220204923.GI22781@mellanox.co.il> Quoting r. Frithjof Kruggel : > However, the performance as measured by the read/ > write tests is "only" about 200 MB/s. What do you refer to as read/write tests? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From tom at opengridcomputing.com Mon Feb 20 13:00:23 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Mon, 20 Feb 2006 15:00:23 -0600 Subject: [openib-general] NFS/RDMA client release for Linux 2.6.15 In-Reply-To: <469958e00602201012h492f5aa0h494274b67bcaa1d2@mail.gmail.com> References: <7.0.1.0.2.20060207211754.0409e8a0@netapp.com> <20060219120155.GA10268@lst.de> <7.0.1.0.2.20060219093506.040ee028@netapp.com> <7.0.1.0.2.20060219174756.045d6290@netapp.com> <469958e00602201012h492f5aa0h494274b67bcaa1d2@mail.gmail.com> Message-ID: <1140469223.15835.13.camel@trinity.ogc.int> Are we all talking about the same thing here? I think Christoph is just asking that the code use dma_map_[single|page|sg] instead of using page_to_phys or virt_to_phys. The rest of this talk is about memory registration strategies which is a different issue than having the code assume that bus address == phys addresses. > On Mon, 2006-02-20 at 10:12 -0800, Caitlin Bestler wrote: > > > On 2/20/06, Roland Dreier wrote: > Thomas> And, this is only one of many memory registration > Thomas> modes. We would use memory windows, if only OpenIB > Thomas> provided them (yes I know the hardware currently > sucks for > Thomas> them). We will add FMR support shortly. In both > these > Thomas> modes we perform all addressing by the book via > 1-1 OpenIB > Thomas> registration. > > Memory windows don't solve anything here, do they? You still > have to > register the full region using bus addresses somehow. > > - R. > _______________________________________________ > Keep in mind that there are two problems: registering memory and > exposing memory. > Windows solves the latter problem. FMR work requests solves both. > > A kernel based storage related client will frequently want to form > logical buffers > from scattered physical pages. The pages selected are not necessarily > part > of an existing virtual memory map, and especially not a registered > one. > > You can solve the memory registration problem once by creating an > exportable memory region that covers all of physical memory. The > problem is that you do not want to advertise that regions RKey/STag. > > Memory Windows solve that problem, by allowing you to bind windows > within the memory region. The problem is that if the buffer is not > physically > continquous then you still have to export a multi-element list in > order to > have the peer to read from a discontiguous target. > > A Fast Memory Region allows arbitrary sets of pages to form a single > logical window for the purposes of peer-to-peer interaction, and has a > life cycle that more naturally maps the duration when the pages have > to be iomapped. > > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mdidomenico at silverstorm.com Mon Feb 20 13:06:25 2006 From: mdidomenico at silverstorm.com (DiDomenico, Mike) Date: Mon, 20 Feb 2006 16:06:25 -0500 Subject: [openib-general] SRP not detecting Message-ID: I'm trying to get an OpenIB system connected via SRP Initiator to an SRP Target on a SilverStorm 7000 Vfx switch. When I try to detect the SRP initiator through the switch it doesn't show up. Is there anything else that I need to do to get the switch to see the initiator besides modprobe ib_srp? You can see what's happening from the output below, which I believe is being caused by there being no map in the 7000. I haven't tried adding it manually yet, but I'm curious why it doesn't auto-detect... thanks [root at linux14 bin]# ./dmcli -t /sys/class/infiniband/mthca0/ports/1 -d /dev/infiniband/umad0 3 IO Unit Info: max controllers: 2 controller[ 1] GUID: 00066a013800017a vendor ID: 066a00 device ID: 000038 ID: Chassis 0x00066A0050000101, Slot 1, IOC 1 service entries: 1 service[ 0]: 0000494353535250 / SRP.T10:0000000000000001 controller[ 2] GUID: 00066a023800017a vendor ID: 066a00 device ID: 000038 ID: Chassis 0x00066A0050000101, Slot 1, IOC 2 service entries: 1 service[ 0]: 0000494353535250 / SRP.T10:0000000000000001 [root at linux14 bin]# echo id_ext=0000000000000001,ioc_guid=00066a013800017a,dgid=fe800000000000000 00 66a013800017a,pkey=ffff,service_id=0000494353535250 > /sys/class/infiniband_srp/srp-mthca0-1/add_ta Rget [root at linux14 bin]# tail /var/log/messages Feb 20 16:05:39 linux14 kernel: ib_srp: Got failed path rec status -22 Feb 20 16:05:39 linux14 kernel: ib_srp: Path record query failed Feb 20 16:05:39 linux14 kernel: ib_srp: Connection failed -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Mon Feb 20 13:35:51 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 20 Feb 2006 13:35:51 -0800 Subject: [openib-general] SRP not detecting In-Reply-To: (Mike DiDomenico's message of "Mon, 20 Feb 2006 16:06:25 -0500") References: Message-ID: Mike> I'm trying to get an OpenIB system connected via SRP Mike> Initiator to an SRP Target on a SilverStorm 7000 Vfx switch. Mike> When I try to detect the SRP initiator through the switch it Mike> doesn't show up. Is there anything else that I need to do Mike> to get the switch to see the initiator besides modprobe Mike> ib_srp? Looks like you are using the wrong port GUID, and so the path lookup to the target fails. Mike> [root at linux14 bin]# ./dmcli -t /sys/class/infiniband/mthca0/ports/1 -d /dev/infiniband/umad0 3 I suggest using ibsrpdm from my srptools stuff under userspace in svn. It's a lot easier to use, since you don't have to guess the LID of the target, and it tells you everything you need to do. You can just do "ibsrpdm" to get a summary of all the targets on the fabric, and "ibsrpdm -c" to get a string you can echo to the kernel to connect. Mike> controller[ 1] Mike> GUID: 00066a013800017a This is almost definitely the node GUID, so using it as the port GUID will be wrong. - R. From rdreier at cisco.com Mon Feb 20 13:48:24 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 20 Feb 2006 13:48:24 -0800 Subject: [openib-general] Re: ipoib patches In-Reply-To: <20060220195845.GG22781@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 20 Feb 2006 21:58:45 +0200") References: <20060220195845.GG22781@mellanox.co.il> Message-ID: Michael> Hi, Roland! What's going on with ipoib patches in Michael> contrib/mellanox? There are still 9 patches outstanding, Michael> most of them are really simple and should be safe bet Michael> even for 2.6.16. Sorry, I still need to get back to that stuff. The most critical one I think is the neighbour destructor change. I will ask netdev again for status on that. The rest either don't look urgent, or I need to take time and understand what's going on (especially for the ones coming from code review rather than actual crashes). Michael> There's also mthca_cosmetic_icm_page_size.patch there Michael> which looks like a safe one. It doesn't look urgent. I wanted to take some time to make sure that it wasn't converting some 4096's that shouldn't be lumped together with the ICM page size. Michael> Other patches might be good candidates at least for svn Michael> tree so that they get some coverage there. Good point. - R. From rdreier at cisco.com Mon Feb 20 13:48:59 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 20 Feb 2006 13:48:59 -0800 Subject: [openib-general] Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: <20060220175435.GK20285@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 20 Feb 2006 19:54:35 +0200") References: <20060213154114.GO32041@mellanox.co.il> <20060213190206.GC12458@mellanox.co.il> <20060214065145.GE24524@minantech.com> <20060214164236.GB12974@mellanox.co.il> <20060215073452.GI24524@minantech.com> <20060215082331.GE10026@mellanox.co.il> <20060220090323.GA24524@minantech.com> <20060220175435.GK20285@mellanox.co.il> Message-ID: Michael> How would the hook look like? I don't know. Perhaps a callback when the library would have called madvise()? I haven't really figured it out yet... - R. From mst at mellanox.co.il Mon Feb 20 15:19:04 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 21 Feb 2006 01:19:04 +0200 Subject: [openib-general] Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: References: <20060213190206.GC12458@mellanox.co.il> <20060214065145.GE24524@minantech.com> <20060214164236.GB12974@mellanox.co.il> <20060215073452.GI24524@minantech.com> <20060215082331.GE10026@mellanox.co.il> <20060220090323.GA24524@minantech.com> <20060220175435.GK20285@mellanox.co.il> Message-ID: <20060220231904.GB23774@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: madvise MADV_DONTFORK/MADV_DOFORK > > Michael> How would the hook look like? > > I don't know. Perhaps a callback when the library would have called > madvise()? I haven't really figured it out yet... Note that anything like that affects API. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From rdreier at cisco.com Mon Feb 20 15:28:11 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 20 Feb 2006 15:28:11 -0800 Subject: [openib-general] Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: <20060220231904.GB23774@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 21 Feb 2006 01:19:04 +0200") References: <20060213190206.GC12458@mellanox.co.il> <20060214065145.GE24524@minantech.com> <20060214164236.GB12974@mellanox.co.il> <20060215073452.GI24524@minantech.com> <20060215082331.GE10026@mellanox.co.il> <20060220090323.GA24524@minantech.com> <20060220175435.GK20285@mellanox.co.il> <20060220231904.GB23774@mellanox.co.il> Message-ID: Michael> Note that anything like that affects API. Yes. But it could be done by simply adding a new entry point to the application's interface, so the disruption is minimal. - R. From ftillier at silverstorm.com Mon Feb 20 18:14:40 2006 From: ftillier at silverstorm.com (Fabian Tillier) Date: Mon, 20 Feb 2006 18:14:40 -0800 Subject: [openib-general] IPoIB broadcast MC group membership Message-ID: <79ae2f320602201814s3df76825w50fb759238e91077@mail.gmail.com> If I understand the code correctly, IPoIB depends on the broadcast MC group existing, as it only ever issues a MC join that does not create the group to the SA. First, is this correct? Second, if so, how is IPoIB supposed to interact with subnet managers that don't pre-create an empty broadcast group? Shouldn't IPoIB first do a GET for the broadcast group, and use those settings if it exist, otherwise create it? Thanks, - Fab From rdreier at cisco.com Mon Feb 20 18:25:53 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 20 Feb 2006 18:25:53 -0800 Subject: [openib-general] IPoIB broadcast MC group membership In-Reply-To: <79ae2f320602201814s3df76825w50fb759238e91077@mail.gmail.com> (Fabian Tillier's message of "Mon, 20 Feb 2006 18:14:40 -0800") References: <79ae2f320602201814s3df76825w50fb759238e91077@mail.gmail.com> Message-ID: Fabian> If I understand the code correctly, IPoIB depends on the Fabian> broadcast MC group existing, as it only ever issues a MC Fabian> join that does not create the group to the SA. Fabian> First, is this correct? Yes. Fabian> Second, if so, how is IPoIB supposed to interact with Fabian> subnet managers that don't pre-create an empty broadcast Fabian> group? Fabian> Shouldn't IPoIB first do a GET for the broadcast group, Fabian> and use those settings if it exist, otherwise create it? What parameters should it use to create it? The IETF drafts for IPoIB say that the IPv4 broadcast group must be created administratively before an IPoIB interface can be brought up. What subnet managers exist that don't have some sort of special handling for the IPoIB broadcast group? - R. From ftillier at silverstorm.com Mon Feb 20 18:53:55 2006 From: ftillier at silverstorm.com (Fabian Tillier) Date: Mon, 20 Feb 2006 18:53:55 -0800 Subject: [openib-general] IPoIB broadcast MC group membership In-Reply-To: References: <79ae2f320602201814s3df76825w50fb759238e91077@mail.gmail.com> Message-ID: <79ae2f320602201853n2982624an1c845dac9776ccef@mail.gmail.com> On 2/20/06, Roland Dreier wrote: > Fabian> Second, if so, how is IPoIB supposed to interact with > Fabian> subnet managers that don't pre-create an empty broadcast > Fabian> group? > > Fabian> Shouldn't IPoIB first do a GET for the broadcast group, > Fabian> and use those settings if it exist, otherwise create it? > > What parameters should it use to create it? The only paramter that can be problematic is the QKey, but it's not a problem for it to just make one up, as long as it's a privileged one. All other parameters can be taken from the local port info. > The IETF drafts for IPoIB say that the IPv4 broadcast group must be > created administratively before an IPoIB interface can be brought up. Doesn't the IB spec require that a multicast group have a member? That is, when the last member leaves the group, the multicast group is destroyed? Further, the IETF drafts for IPoIB only recommend administrative creation of the broadcast group, but allows creation by the first member. An IB MC join of an non-existant group should fail unless all the proper parameters are provided to create the group. What is the behavior of SMs that pre-create the group in response to a GET query for the MC group parameters? Does the query return a record, or does it fail with no records? - Fab From caitlinb at broadcom.com Mon Feb 20 19:15:15 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Mon, 20 Feb 2006 19:15:15 -0800 Subject: [openib-general] NFS/RDMA client release for Linux 2.6.15 Message-ID: <54AD0F12E08D1541B826BE97C98F99F1296FF2@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > Are we all talking about the same thing here? I think > Christoph is just asking that the code use > dma_map_[single|page|sg] instead of using page_to_phys or > virt_to_phys. > > The rest of this talk is about memory registration strategies > which is a different issue than having the code assume that > bus address == phys addresses. > Yes, that was the original point. And it is an important one. But Roland's comment seemed to question the usage scenario, and making sure everyone understands how applications use RDMA (or perhaps more importantly want to use) is important. Our choices on the verb layer APIs have very real impacts on how applications work. I was disagreeing that windows and/or FMRs are not of vital importance to many applications. Part of building a viable long-term stack is shifting to thinking down from the application needs rather than up from the hardware. A successful stack will eventually support so many hardware options that the idea of working up from the hardware will be self-evidently absurd. We just need to start making that shift. From rdreier at cisco.com Mon Feb 20 19:39:40 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 20 Feb 2006 19:39:40 -0800 Subject: [openib-general] IPoIB broadcast MC group membership In-Reply-To: <79ae2f320602201853n2982624an1c845dac9776ccef@mail.gmail.com> (Fabian Tillier's message of "Mon, 20 Feb 2006 18:53:55 -0800") References: <79ae2f320602201814s3df76825w50fb759238e91077@mail.gmail.com> <79ae2f320602201853n2982624an1c845dac9776ccef@mail.gmail.com> Message-ID: Fabian> The only paramter that can be problematic is the QKey, but Fabian> it's not a problem for it to just make one up, as long as Fabian> it's a privileged one. All other parameters can be taken Fabian> from the local port info. Actually all of the extra parameters (Q_Key, SL, flow label, traffic class) are policy decisions that don't have a clear default for the driver to choose. Fabian> What is the behavior of SMs that pre-create the group in Fabian> response to a GET query for the MC group parameters? Does Fabian> the query return a record, or does it fail with no Fabian> records? I guess it depends on the SM. Do you know of an SM that has problems with the existing Linux IPoIB driver? - R. From rdreier at cisco.com Mon Feb 20 19:42:41 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 20 Feb 2006 19:42:41 -0800 Subject: [openib-general] Re: [PATCH RESEND] net: Move destructor from neigh->ops to neigh_params In-Reply-To: <20060202.103448.118065476.yoshfuji@linux-ipv6.org> (YOSHIFUJI Hideaki's message of "Thu, 02 Feb 2006 10:34:48 +0900 (JST)") References: <20060202.103448.118065476.yoshfuji@linux-ipv6.org> Message-ID: Hi Dave, have you had a chance to look at this? I can resend again if you've lost the original mail. Also, let me know if you want me to merge this through my tree when 2.6.17 opens up. The only feedback I've seen is that Yoshfuji-san has said that this looks sane. Thanks, Roland From davem at davemloft.net Mon Feb 20 20:01:52 2006 From: davem at davemloft.net (David S. Miller) Date: Mon, 20 Feb 2006 20:01:52 -0800 (PST) Subject: [openib-general] Re: [PATCH RESEND] net: Move destructor from neigh->ops to neigh_params In-Reply-To: References: <20060202.103448.118065476.yoshfuji@linux-ipv6.org> Message-ID: <20060220.200152.00014728.davem@davemloft.net> From: Roland Dreier Date: Mon, 20 Feb 2006 19:42:41 -0800 > Hi Dave, have you had a chance to look at this? Not yet, it's very low on the priority list at the moment, but I do still have it in my inbox so don't worry. From rdreier at cisco.com Mon Feb 20 20:48:30 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 20 Feb 2006 20:48:30 -0800 Subject: [openib-general] Re: [PATCH RESEND] net: Move destructor from neigh->ops to neigh_params In-Reply-To: <20060220.200152.00014728.davem@davemloft.net> (David S. Miller's message of "Mon, 20 Feb 2006 20:01:52 -0800 (PST)") References: <20060202.103448.118065476.yoshfuji@linux-ipv6.org> <20060220.200152.00014728.davem@davemloft.net> Message-ID: David> Not yet, it's very low on the priority list at the moment, David> but I do still have it in my inbox so don't worry. Do you think you'll get a chance to look at it for 2.6.17? If not we can work around things in the IPoIB driver in a slightly uglier way for 2.6.17. Thanks, Roland From davem at davemloft.net Mon Feb 20 20:55:34 2006 From: davem at davemloft.net (David S. Miller) Date: Mon, 20 Feb 2006 20:55:34 -0800 (PST) Subject: [openib-general] Re: [PATCH RESEND] net: Move destructor from neigh->ops to neigh_params In-Reply-To: References: <20060220.200152.00014728.davem@davemloft.net> Message-ID: <20060220.205534.115202407.davem@davemloft.net> From: Roland Dreier Date: Mon, 20 Feb 2006 20:48:30 -0800 > Do you think you'll get a chance to look at it for 2.6.17? Yes I will. From rdreier at cisco.com Mon Feb 20 21:54:14 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 20 Feb 2006 21:54:14 -0800 Subject: [openib-general] Re: ipoib patches In-Reply-To: <20060220195845.GG22781@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 20 Feb 2006 21:58:45 +0200") References: <20060220195845.GG22781@mellanox.co.il> Message-ID: I looked at some of the simpler patches... ipoib_cosmetics.patch: Applied to svn and for-2.6.17 git branch. ipoib_init_qp.patch: The rationale seems debatable. Is it really worth making this change? ipoib_qprst_protect.patch: It seems that the worst case of stale values is another 1 millisecond sleep and another iteration of the loop. Unless I'm missing something, it doesn't seem worth caring about this. - R. From ftillier at silverstorm.com Mon Feb 20 22:10:30 2006 From: ftillier at silverstorm.com (Fabian Tillier) Date: Mon, 20 Feb 2006 22:10:30 -0800 Subject: [openib-general] IPoIB broadcast MC group membership In-Reply-To: References: <79ae2f320602201814s3df76825w50fb759238e91077@mail.gmail.com> <79ae2f320602201853n2982624an1c845dac9776ccef@mail.gmail.com> Message-ID: <79ae2f320602202210s7a668b52if968e1fad058019c@mail.gmail.com> On 2/20/06, Roland Dreier wrote: > Fabian> What is the behavior of SMs that pre-create the group in > Fabian> response to a GET query for the MC group parameters? Does > Fabian> the query return a record, or does it fail with no > Fabian> records? > > I guess it depends on the SM. I guess the follow up question is how an SM handles a MC join that specifies the QKey and other settings that may conflict with the preset ones, but didn't return a response to the GET query. Does it fail the join, does it succeed it with the provided values, or does it succeed it and return the preset parameters? OpenSM seems to respond to the GET query, even if there are no members to the group - and the query returns a group that specifies a rate of 10Gbps (4X SDR - same as the system running OpenSM, incidentally) > Do you know of an SM that has problems with the existing Linux IPoIB driver? No, actually my query wasn't driven by an actual issue with the Linux IPoIB driver. I was trying to figure out how to do better error handling and diagnostic logging in the Windows IPoIB and wanted to see how the Linux IPoIB driver handled a similar situation. The problem I was having was that the sequence of events I was using in the Windows IPoIB resulted in ambiguous error conditions, where it wasn't possible to differentiate between unexpected errors and errors that could be worked around. The Windows IPoIB follows a sequence like: if( GET broadcast group == NO_ERROR ) if( SET join broadcast group != NO_ERROR ) repeat GET; else if( SET create broadcast group != NO_ERROR ) repeat GET; Specifically, the problem relates to handling a 1X node trying to join the broadcast group, and what the retry policy should be. If the group already exists at 4X, the join should fail if the SM follows the compliance statements in the IB spec. Because the code allowed for the broadcast group not pre-existing (that is, a join could fail because the group wasn't created), it was unclear whether a failure of the join indicated that there was a setting incompatibility (1X vs. 4X), or just whether the group needed to be created. Then, because the code handled the race where some other node beat it to creation and thus resulted in invalid settings, a failure in creation resulted in a retry of the whole process, staring with a new GET query. A 1X node in such a case ends up perpetually retrying the sequence of events, eventhough it really should just stop and wait for the next port up event (since link width changes require the port to go through the down state as far as I understand). The lack of detailed error reporting in SA queries could stand to be improved, and something as simple as the SA returning a component mask indicating which components caused conflicts would be extremely useful in determining the next course of action. ERR_REQ_INVALID is just too broad in this case to allow the code to do anything intelligent. As a note, OpenSM seems to allow a 1X node to join a 4X multicast group which it should not, unless the join specifies the rate in which case the join fails as expected. Do we just not care that a 1X node could be dropping 3/4 of the packets sent on the broadcast group, aside from OpenSM violating o15-0.1.13? Note that the failure if the rate is specified occurs even if the 1X node is the first to attempt to join (that is, no other nodes on the fabric have IPoIB running). Anyhow, I'm still not sure how to cleanly handle these errors so that the system log is pretty clear that things are not working likely due to a bad cable. - Fab From dotanb at mellanox.co.il Mon Feb 20 22:19:51 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Tue, 21 Feb 2006 08:19:51 +0200 Subject: [openib-general] RE: ibv_create_srq doesn't update the SRQ init attributes Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3016C50DB@mtlexch01.mtl.com> > I don't think either the ABI or API has to change. Right now we have >struct ibv_srq *ibv_create_srq(struct ibv_pd *pd, > struct ibv_srq_init_attr *srq_init_attr); >so the library should just update srq_init_attr with the values >actually used to create the SRQ. The parameter isn't declared const >or anything, so this should be fine for the 1.0 release. In the kernel level, You are right: I sent a patch to enable the mthca low level driver to update those attributes. In the user level the response structure doesn't allow the return of the SRQ attributes from kernel level to user level, so I think that the ABI should be changed to support it. What do you think? Dotan From wftde03 at hotmail.com Mon Feb 20 22:30:40 2006 From: wftde03 at hotmail.com (=?gb2312?B?c2V4eSBnaXJscyB0byB5b3U=?=) Date: Mon, 20 Feb 2006 22:30:40 -0800 (PST) Subject: [openib-general] =?iso-2022-jp?b?GyRCIXpXRCRBJGckQyRUJGobKEI=?= =?iso-2022-jp?b?GyRCJEolKCVDJUEheVdEGyhC?= Message-ID: <20060221063040.094062283E8@openib.ca.sandia.gov> 18歳未満禁止 ちょっぴりなエッチ 当サイトは逆援を希望する男性に大支援! 逆¥希望の女性が男性を自由に選べるシステムを採用する。 だから、男性は基本的に選ばれるほうなので、お相手の容姿・顔については保障でき ません(笑)。 でも、バイト感覚で割り切って楽しんだら、かなり儲かります。 うまくいけば、Hだけで食っていけるかも・・ *=======*=======*=======*=======*=======*=======*=======*=======* 登録率業界トップ(登録完全無料・フリーアドOK) http://www.himitsuno-sasayaki5.net/?deri *=======*=======*=======*=======*=======*=======*=======*=======* ・これからは女性が男性を選ぶ時代。 ・詳細プロフ検索自由・地域別 ・PC&携帯完全対応(フリーアドレスも自由に遊べる) ・安心の個人情報セキュリティー ☆彡∞∞★彡∞∞☆彡∞∞★彡∞∞☆彡∞∞★彡∞∞☆彡∞∞★彡∞∞☆彡 エッチ&初めてをしたい!売りたい! 女社長、婦人などリッチな立場にありながら性欲の解放を求める女性達が集まってい ます。 男性へ性交渉を求める代わりにそのお礼として報酬という逆援助という形で出会う場 所です。 趣旨をご理解のうえご利用下さい。 http://www.himitsuno-sasayaki5.net/?sien ☆彡∞∞★彡∞∞☆彡∞∞★彡∞∞☆彡∞∞★彡∞∞☆彡∞∞★彡∞∞☆彡 高級婦人との無料の出会いを提供してます。 女性は登録料1万円、男性は無料となってますので、真剣な女性しかおりません。 人妻 婦人 奥様 SM アブノーマル マニア 童貞 逆援 逆玉の輿 ホスト http://www.himitsuno-sasayaki5.net/?himitu ◆☆◆☆◆☆◆━━━━━━━━━★☆★━━━━━━━━━◆☆◆☆◆☆◆ 今後受信拒否する方はsdrgeas at hotmail.comへご返信ください -------------- next part -------------- An HTML attachment was scrubbed... URL: From yael at mellanox.co.il Mon Feb 20 23:10:15 2006 From: yael at mellanox.co.il (Yael Kalka) Date: Tue, 21 Feb 2006 09:10:15 +0200 Subject: [openib-general] RE: Re:[PATCH] OpenSM/st.c: Fix some size_t issues related tomemoryallocation in st.c Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FDB7@mtlexch01.mtl.com> Hi Hal, That is because I accidentally sent the original patch.... The fix is in line 404 - make i of size_t (instead of int) as well. I will send that patch separately. Yael -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Monday, February 20, 2006 5:28 PM To: Yael Kalka Cc: openib-general at openib.org; Eitan Zahavi Subject: Re: Re:[PATCH] OpenSM/st.c: Fix some size_t issues related tomemoryallocation in st.c Hi Yael, On Sun, 2006-02-19 at 07:24, Yael Kalka wrote: > Hi Hal, > > The patch in general is fine. I've added one change to the original > patch - to avoid casting issues underwindows > Below is the full patch. This looks like the same patch to me. What changed ? (What did I miss ?) -- Hal > > Yael > > Signed-off-by: Hal Rosenstock > > Index: include/opensm/st.h =================================================================== > --- include/opensm/st.h (revision 5436) > +++ include/opensm/st.h (working copy) > @@ -40,6 +40,8 @@ > #ifndef ST_INCLUDED > #define ST_INCLUDED > > +#include > + > #ifdef __cplusplus > # define BEGIN_C_DECLS extern "C" { > # define END_C_DECLS } > @@ -79,11 +81,11 @@ struct st_table { > enum st_retval {ST_CONTINUE, ST_STOP, ST_DELETE}; > > st_table *st_init_table(struct st_hash_type *); > -st_table *st_init_table_with_size(struct st_hash_type *, int); > +st_table *st_init_table_with_size(struct st_hash_type *, size_t); > st_table *st_init_numtable(void); > -st_table *st_init_numtable_with_size(int); > +st_table *st_init_numtable_with_size(size_t); > st_table *st_init_strtable(void); > -st_table *st_init_strtable_with_size(int); > +st_table *st_init_strtable_with_size(size_t); > int st_delete(st_table *, st_data_t *, st_data_t *); > int st_delete_safe(st_table *, st_data_t *, st_data_t *, st_data_t); int st_insert(st_table *, st_data_t, st_data_t); > > Index: opensm/st.c > =================================================================== > --- opensm/st.c (revision 5436) > +++ opensm/st.c (working copy) > @@ -42,7 +42,6 @@ > #endif /* HAVE_CONFIG_H */ > > #include > -#include > #include > #include > > @@ -102,17 +101,11 @@ static struct st_hash_type type_strhash > #define xcalloc calloc > #define xrealloc realloc > #define xfree free > -#if 0 > -void *xmalloc(long); > -void *xcalloc(long, long); > -void *xrealloc(void *, long); > -void xfree(void *); > -#endif > > static void rehash(st_table *); > > -#define alloc(type) (type*)xmalloc((unsigned)sizeof(type)) > -#define Calloc(n,s) (char*)xcalloc((n),(s)) > +#define alloc(type) (type*)xmalloc(sizeof(type)) > +#define Calloc(n,s) (char*)xcalloc((n), (s)) > > #define EQUAL(table,x,y) ((x)==(y) || (*table->type->compare)(((void*)x),((void *)y)) == 0) > > @@ -200,7 +193,7 @@ stat_col() > st_table* > st_init_table_with_size(type, size) > struct st_hash_type *type; > - int size; > + size_t size; > { > st_table *tbl; > > @@ -238,7 +231,7 @@ st_init_numtable(void) > > st_table* > st_init_numtable_with_size(size) > - int size; > + size_t size; > { > return st_init_table_with_size(&type_numhash, size); > } > @@ -251,7 +244,7 @@ st_init_strtable(void) > > st_table* > st_init_strtable_with_size(size) > - int size; > + size_t size; > { > return st_init_table_with_size(&type_strhash, size); > } > @@ -314,7 +307,8 @@ st_lookup(table, key, value) > return 0; > } > else { > - if (value != 0) *value = ptr->record; > + if (value != 0) > + *value = ptr->record; > return 1; > } > } > @@ -407,7 +401,8 @@ st_copy(old_table) > { > st_table *new_table; > st_table_entry *ptr, *entry; > - int i, num_bins = old_table->num_bins; > + int i; > + size_t num_bins = old_table->num_bins; > > new_table = alloc(st_table); > if (new_table == 0) > @@ -417,7 +412,7 @@ st_copy(old_table) > > *new_table = *old_table; > new_table->bins = (st_table_entry**) > - Calloc((unsigned)num_bins, sizeof(st_table_entry*)); > + Calloc(num_bins, sizeof(st_table_entry*)); > > if (new_table->bins == 0) > { > @@ -524,7 +519,7 @@ st_delete_safe(table, key, value, never) > } > > static int > -delete_never( st_data_t key, st_data_t value, st_data_t never) > +delete_never(st_data_t key, st_data_t value, st_data_t never) > { > if (value == never) return ST_DELETE; > return ST_CONTINUE; > > > From yael at mellanox.co.il Mon Feb 20 23:10:31 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 21 Feb 2006 09:10:31 +0200 Subject: [openib-general] Re:[PATCH] OpenSM/st.c: Fix some size_t issues related tomemoryallocation in st.c Message-ID: <5zoe11jifs.fsf@mtl066.yok.mtl.com> Hi Hal, Here is the patch with the fix for the type of "i". patch - to avoid casting issues underwindows Below is the full patch. Yael Signed-off-by: Hal Rosenstock Index: include/opensm/st.h =================================================================== --- include/opensm/st.h (revision 5446) +++ include/opensm/st.h (working copy) @@ -40,6 +40,8 @@ #ifndef ST_INCLUDED #define ST_INCLUDED +#include + #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { # define END_C_DECLS } @@ -79,11 +81,11 @@ struct st_table { enum st_retval {ST_CONTINUE, ST_STOP, ST_DELETE}; st_table *st_init_table(struct st_hash_type *); -st_table *st_init_table_with_size(struct st_hash_type *, int); +st_table *st_init_table_with_size(struct st_hash_type *, size_t); st_table *st_init_numtable(void); -st_table *st_init_numtable_with_size(int); +st_table *st_init_numtable_with_size(size_t); st_table *st_init_strtable(void); -st_table *st_init_strtable_with_size(int); +st_table *st_init_strtable_with_size(size_t); int st_delete(st_table *, st_data_t *, st_data_t *); int st_delete_safe(st_table *, st_data_t *, st_data_t *, st_data_t); int st_insert(st_table *, st_data_t, st_data_t); Index: opensm/st.c =================================================================== --- opensm/st.c (revision 5446) +++ opensm/st.c (working copy) @@ -42,7 +42,6 @@ #endif /* HAVE_CONFIG_H */ #include -#include #include #include @@ -102,16 +101,10 @@ static struct st_hash_type type_strhash #define xcalloc calloc #define xrealloc realloc #define xfree free -#if 0 -void *xmalloc(long); -void *xcalloc(long, long); -void *xrealloc(void *, long); -void xfree(void *); -#endif static void rehash(st_table *); -#define alloc(type) (type*)xmalloc((unsigned)sizeof(type)) +#define alloc(type) (type*)xmalloc(sizeof(type)) #define Calloc(n,s) (char*)xcalloc((n),(s)) #define EQUAL(table,x,y) ((x)==(y) || (*table->type->compare)(((void*)x),((void *)y)) == 0) @@ -200,7 +193,7 @@ stat_col() st_table* st_init_table_with_size(type, size) struct st_hash_type *type; - int size; + size_t size; { st_table *tbl; @@ -238,7 +231,7 @@ st_init_numtable(void) st_table* st_init_numtable_with_size(size) - int size; + size_t size; { return st_init_table_with_size(&type_numhash, size); } @@ -251,7 +244,7 @@ st_init_strtable(void) st_table* st_init_strtable_with_size(size) - int size; + size_t size; { return st_init_table_with_size(&type_strhash, size); } @@ -314,7 +307,8 @@ st_lookup(table, key, value) return 0; } else { - if (value != 0) *value = ptr->record; + if (value != 0) + *value = ptr->record; return 1; } } @@ -407,7 +401,7 @@ st_copy(old_table) { st_table *new_table; st_table_entry *ptr, *entry; - int i, num_bins = old_table->num_bins; + size_t i, num_bins = old_table->num_bins; new_table = alloc(st_table); if (new_table == 0) @@ -417,7 +411,7 @@ st_copy(old_table) *new_table = *old_table; new_table->bins = (st_table_entry**) - Calloc((unsigned)num_bins, sizeof(st_table_entry*)); + Calloc(num_bins, sizeof(st_table_entry*)); if (new_table->bins == 0) { From glebn at voltaire.com Tue Feb 21 04:33:55 2006 From: glebn at voltaire.com (Gleb Natapov) Date: Tue, 21 Feb 2006 14:33:55 +0200 Subject: [openib-general] Re: Re: madvise MADV_DONTFORK/MADV_DOFORK In-Reply-To: References: <20060213154114.GO32041@mellanox.co.il> <20060213190206.GC12458@mellanox.co.il> <20060214065145.GE24524@minantech.com> <20060214164236.GB12974@mellanox.co.il> <20060215073452.GI24524@minantech.com> <20060215082331.GE10026@mellanox.co.il> <20060220090323.GA24524@minantech.com> Message-ID: <20060221123355.GA6341@minantech.com> On Mon, Feb 20, 2006 at 08:50:30AM -0800, Roland Dreier wrote: > Probably the best thing to do would be to have libibverbs handle the > madvise() stuff transparently by default, but perhaps add some hooks > for more intelligent applications or MPI libraries to do their own thing. > I also think that transparent handling is the best. I am not sure about hook though. I think a call that will enable/disable madvise() handling from inside reg_mr might be added to libibverbs so the application can handle madvice() by itself if it wants to. I can't think why MPI will want to do it by itself. -- Gleb. From yael at mellanox.co.il Tue Feb 21 04:43:48 2006 From: yael at mellanox.co.il (Yael Kalka) Date: Tue, 21 Feb 2006 14:43:48 +0200 Subject: [openib-general] [RFC] [PATCH] OpenSM: Add functional partitionmanager support Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FDBE@mtlexch01.mtl.com> Hello Sasha, There are problems with the patch you sent - doesn't work when trying to apply it. One problem, for example - is that file opensm/osm_partition.h is given with diffs, when actually it is a new file. Yael -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Sasha Khapyorsky Sent: Sunday, February 19, 2006 6:39 PM To: halr at voltaire.com; openib-general at openib.org Subject: [openib-general] [RFC] [PATCH] OpenSM: Add functional partitionmanager support Hello, There is phase 1 of partiton manager for OpenSM. Please review. Thanks, Sasha. This patch implements partition management for OpenSM (Phase 1) as described in osm/doc/OpenSM_PKey_Mgr.txt. Basically at each heavy resweep this will: - recreate partition configuration - update pkey tables for endports - update switch's ports connected to endports - for partitions marked for IPoIB support this will also create appropriate multicast group Signed-off-by: Sasha Khapyorsky diff --git a/osm/doc/partition-config.txt b/osm/doc/partition-config.txt new file mode 100644 index 0000000..b3ba804 --- /dev/null +++ b/osm/doc/partition-config.txt @@ -0,0 +1,90 @@ +OpenSM Partitions configuration. +=============================== + +The default name of OpenSM partitions configuration file is +'/etc/osm-partitions.txt'. The default may be changed by using +--Pconfig (-P) option with OpenSM. + +The default partition will be created by OpenSM unconditionally even +when partition configuration file does not exist or cannot be accessed. + +The default partition has P_Key value 0x7fff. OpenSM's port will have +full membership in default partition. all other end ports will have +partial membership. + + +File Format. +=========== + +Comments: +-------- + +Line content followed after '#' character is comment and ignored by +parser. + + +General file format: +------------------- + +: ; + + +Partition Definition: +-------------------- + +[PartitionName][=PKey][,flag] + +PartitionName - free string, will be used with logging. When omitted + empty string will be used. +PKey - P_Key value for this partition. Only low 15 bits will + be used. When omitted will be autogenerated. +flag - used to indicate IPoIB capability of this partition. + 'ipoib' is only valid value currently (in future other + values may be added). + + +PortGUIDs list: +-------------- + +[PortGUID[=full|=part]] [,PortGUID[=full|=part]] [,PortGUID] ... + +PortGUID - GUID of partition member EndPort. Hexadecimal numbers + should start from 0x. +full or part - indicates full or partial membership for this port. When + omitted (or unrecognized) partial membership is assumed. + +There are two useful keywords for PortGUID definition: + +- 'ALL' means all end ports in this subnet +- 'SELF' means subnet manager's port. + +Empty list means no ports in this partition. + + +Notes: +----- + +White spaces are permitted between delimiters ('=', ',',':',';'). + +The Line can be wrapped after ':' followed after Partition Definition and +between. + +PartitionName does not need to be unique, PKey does need to be unique. +If PKey is repeated then those partition configurations will be merged +(see also next note). + +It is possible to split partition configuration in more than one +definition, but then PKey should be explicitly specified (overwise +different PKey values will be generated for those definitions). + + +Examples: +-------- + +Default=0x7fff : ALL, SELF=full ; + +NewPartition , ipoib : 0x123456=full, 0x3456789034=part, 0x2134af2306 ; + +YetAnotherOne = 0x300 : SELF=full ; +YetAnotherOne = 0x300 : ALL=part ; + diff --git a/osm/include/opensm/osm_base.h b/osm/include/opensm/osm_base.h index 660771f..3da39a6 100644 --- a/osm/include/opensm/osm_base.h +++ b/osm/include/opensm/osm_base.h @@ -222,6 +222,22 @@ BEGIN_C_DECLS #endif /***********/ +/****d* OpenSM: Base/OSM_DEFAULT_PARTITION_CONFIG_FILE +* NAME +* OSM_DEFAULT_PARTITION_CONFIG_FILE +* +* DESCRIPTION +* Specifies the default partition config file name +* +* SYNOPSIS +*/ +#ifdef __WIN__ +#define OSM_DEFAULT_PARTITION_CONFIG_FILE strcat(GetOsmPath(), "osm-partitions.conf") +#else +#define OSM_DEFAULT_PARTITION_CONFIG_FILE "/etc/osm-partitions.conf" +#endif +/***********/ + /****d* OpenSM: Base/OSM_DEFAULT_SWEEP_INTERVAL_SECS * NAME * OSM_DEFAULT_SWEEP_INTERVAL_SECS diff --git a/osm/include/opensm/osm_partition.h b/osm/include/opensm/osm_partition.h index 27678c2..369cf8a 100644 --- a/osm/include/opensm/osm_partition.h +++ b/osm/include/opensm/osm_partition.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -50,6 +50,12 @@ #ifndef _OSM_PARTITION_H_ #define _OSM_PARTITION_H_ +#include +#include +#include +#include +#include + #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { # define END_C_DECLS } @@ -94,12 +100,17 @@ BEGIN_C_DECLS */ typedef struct _osm_prtn { - uint16_t pkey; - cl_map port_guid_tbl; - + cl_map_item_t map_item; + uint16_t pkey; + cl_map_t full_guid_tbl; + cl_map_t part_guid_tbl; + char name[32]; } osm_prtn_t; /* * FIELDS +* map_item +* Linkage structure for cl_qmap. MUST BE FIRST MEMBER! +* * pkey * The IBA defined P_KEY of this Partition. * @@ -111,118 +122,61 @@ typedef struct _osm_prtn * Partition *********/ -/****f* OpenSM: Partition/osm_prtn_construct +/****f* OpenSM: Partition/osm_prtn_delete * NAME -* osm_prtn_construct +* osm_prtn_delete * * DESCRIPTION -* This function constructs a Partition. +* This function destroys and deallocates a Partition object. * * SYNOPSIS */ -void osm_prtn_construct( - IN osm_prtn_t* const p_prtn ); +void osm_prtn_delete( + IN OUT osm_prtn_t** const pp_prtn ); /* * PARAMETERS -* p_prtn -* [in] Pointer to a Partition to construct. +* pp_prtn +* [in][out] Pointer to a pointer to a Partition oject to +* delete. On return, this pointer is NULL. * * RETURN VALUE * This function does not return a value. * * NOTES -* Allows calling osm_prtn_init, osm_prtn_destroy, and osm_prtn_is_inited. -* -* Calling osm_prtn_construct is a prerequisite to calling any other -* method except osm_prtn_init. +* Performs any necessary cleanup of the specified Partition object. * * SEE ALSO -* Partition, osm_prtn_init, osm_prtn_destroy, osm_prtn_is_inited +* Partition, osm_prtn_new *********/ -/****f* OpenSM: Partition/osm_prtn_destroy +/****f* OpenSM: Partition/osm_prtn_new * NAME -* osm_prtn_destroy +* osm_prtn_new * * DESCRIPTION -* The osm_prtn_destroy function destroys a Partition, releasing -* all resources. +* This function allocates and initializes a Partition object. * * SYNOPSIS */ -void osm_prtn_destroy( - IN osm_prtn_t* const p_prtn ); +osm_prtn_t* osm_prtn_new( + IN const char *name, + IN const uint16_t pkey ); /* * PARAMETERS -* p_prtn -* [in] Pointer to a Partition to destroy. -* -* RETURN VALUE -* This function does not return a value. -* -* NOTES -* Performs any necessary cleanup of the specified Partition. -* Further operations should not be attempted on the destroyed object. -* This function should only be called after a call to osm_prtn_construct or -* osm_prtn_init. -* -* SEE ALSO -* Partition, osm_prtn_construct, osm_prtn_init -*********/ - -/****f* OpenSM: Partition/osm_prtn_init -* NAME -* osm_prtn_init -* -* DESCRIPTION -* The osm_prtn_init function initializes a Partition for use. -* -* SYNOPSIS -*/ -ib_api_status_t osm_prtn_init( - IN osm_prtn_t* const p_prtn ); -/* -* PARAMETERS -* p_prtn -* [in] Pointer to an osm_prtn_t object to initialize. -* -* RETURN VALUES -* CL_SUCCESS if initialization was successful. -* -* NOTES -* Allows calling other Partition methods. +* name +* [in] Partition name string * -* SEE ALSO -* Partition, osm_prtn_construct, osm_prtn_destroy, -* osm_prtn_is_inited -*********/ - -/****f* OpenSM: Partition/osm_prtn_is_inited -* NAME -* osm_prtn_is_inited -* -* DESCRIPTION -* Indicates if the object has been initialized with osm_prtn_init. -* -* SYNOPSIS -*/ -boolean_t osm_ptrn_is_inited( - IN const osm_prtn_t* const p_prtn ); -/* -* PARAMETERS -* p_prtn -* [in] Pointer to an osm_prtn_t object. +* pkey +* [in] Partition P_Key value * -* RETURN VALUES -* TRUE if the object was initialized successfully, -* FALSE otherwise. +* RETURN VALUE +* Pointer to the initialize Partition object. * * NOTES -* The osm_prtn_construct or osm_prtn_init must be called before using -* this function. +* Allows calling other partition methods. * * SEE ALSO -* Partition, osm_prtn_construct, osm_prtn_init +* Partition *********/ /****f* OpenSM: Partition/osm_prtn_is_guid @@ -234,9 +188,14 @@ boolean_t osm_ptrn_is_inited( * * SYNOPSIS */ +static inline boolean_t osm_prtn_is_guid( IN const osm_prtn_t* const p_prtn, - IN const uint64 guid ); + IN const ib_net64_t guid ) +{ + return (cl_map_get(&p_prtn->full_guid_tbl, guid) != NULL) || + (cl_map_get(&p_prtn->part_guid_tbl, guid) != NULL); +} /* * PARAMETERS * p_prtn @@ -254,24 +213,28 @@ boolean_t osm_prtn_is_guid( * SEE ALSO *********/ -/****f* OpenSM: Partition/osm_prtn_get_pkey +/****f* OpenSM: Partition/osm_prtn_make_partitions * NAME -* osm_prtn_get_pkey +* osm_prtn_make_partitions * * DESCRIPTION -* Gets the IBA defined P_KEY value for this Partition. +* Makes all partitions in subnet. * * SYNOPSIS */ -uint16_t osm_prtn_get_pkey( - IN const osm_prtn_t* const p_prtn ); +ib_api_status_t osm_prtn_make_partitions( + IN osm_log_t * const p_log, + IN osm_subn_t * const p_subn); /* * PARAMETERS -* p_prtn -* [in] Pointer to an osm_prtn_t object. +* p_log +* [in] Pointer to a log object. +* +* p_subn +* [in] Pointer to subnet object. * * RETURN VALUES -* P_KEY value for this Partition. +* IB_SUCCESS value on success. * * NOTES * diff --git a/osm/include/opensm/osm_sa_mcmember_record.h b/osm/include/opensm/osm_sa_mcmember_record.h index 97c6296..6e4e033 100644 --- a/osm/include/opensm/osm_sa_mcmember_record.h +++ b/osm/include/opensm/osm_sa_mcmember_record.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -315,6 +315,42 @@ osm_mcmr_rcv_create_new_mgrp( * *********/ +/****f* OpenSM: MC Member record Receiver/osm_mcmr_rcv_find_or_create_new_mgrp +* NAME +* osm_mcmr_rcv_find_or_create_new_mgrp +* +* DESCRIPTION +* Create new Multicast group +* +* SYNOPSIS +*/ + +ib_api_status_t +osm_mcmr_rcv_find_or_create_new_mgrp( + IN osm_mcmr_recv_t* const p_mcmr, + IN uint64_t comp_mask, + IN ib_member_rec_t* const p_recvd_mcmember_rec, + OUT osm_mgrp_t **pp_mgrp); +/* +* PARAMETERS +* p_mcmr +* [in] Pointer to an osm_mcmr_recv_t object. +* p_recvd_mcmember_rec +* [in] Received Multicast member record +* +* pp_mgrp +* [out] pointer the osm_mgrp_t object +* +* RETURN VALUES +* IB_SUCCESS, IB_ERROR +* +* NOTES +* +* +* SEE ALSO +* +*********/ + #define JOIN_MC_COMP_MASK (IB_MCR_COMPMASK_MGID | \ IB_MCR_COMPMASK_PORT_GID | \ IB_MCR_COMPMASK_JOIN_STATE) diff --git a/osm/include/opensm/osm_subnet.h b/osm/include/opensm/osm_subnet.h index 0aab874..7841c29 100644 --- a/osm/include/opensm/osm_subnet.h +++ b/osm/include/opensm/osm_subnet.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -73,6 +73,8 @@ BEGIN_C_DECLS #define OSM_SUBNET_VECTOR_CAPACITY 256 +struct _osm_opensm_t; + /****h* OpenSM/Subnet * NAME * Subnet @@ -220,6 +222,7 @@ typedef struct _osm_subn_opt uint8_t log_flags; char * dump_files_dir; char * log_file; + const char * partition_config_file; boolean_t accum_log_file; boolean_t console; cl_map_t port_prof_ignore_guids; @@ -399,6 +402,7 @@ typedef struct _osm_subn_opt */ typedef struct _osm_subn { + struct _osm_opensm_t *p_osm; cl_qmap_t sw_guid_tbl; cl_qmap_t node_guid_tbl; cl_qmap_t port_guid_tbl; @@ -644,6 +648,7 @@ osm_subn_destroy( ib_api_status_t osm_subn_init( IN osm_subn_t* const p_subn, + IN struct _osm_opensm_t * const p_osm, IN const osm_subn_opt_t* const p_opt ); /* * PARAMETERS diff --git a/osm/opensm/Makefile.am b/osm/opensm/Makefile.am index 447176a..ea7b007 100644 --- a/osm/opensm/Makefile.am +++ b/osm/opensm/Makefile.am @@ -80,6 +80,7 @@ opensm_SOURCES = main.c osm_console.c os osm_state_mgr_ctrl.c osm_subnet.c \ osm_sweep_fail_ctrl.c osm_sw_info_rcv.c \ osm_sw_info_rcv_ctrl.c osm_switch.c \ + osm_prtn.c osm_prtn_config.c \ osm_trap_rcv.c osm_trap_rcv_ctrl.c \ osm_ucast_mgr.c osm_ucast_updn.c \ osm_vl15intf.c osm_vl_arb_rcv.c\ diff --git a/osm/opensm/main.c b/osm/opensm/main.c index c5ba443..ed6ed79 100644 --- a/osm/opensm/main.c +++ b/osm/opensm/main.c @@ -182,6 +182,10 @@ show_usage(void) " This option will cause deletion of the log file\n" " (if it previously exists). By default, the log file\n" " is accumulative.\n\n"); + printf( "-P\n" + "--Pconfig\n" + " This option defines the optional partition configurationi file.\n" + " The default name is \'" OSM_DEFAULT_PARTITION_CONFIG_FILE "\'.\n\n"); printf( "-y\n" "--stay_on_fatal\n" " This option will cause SM not to exit on fatal initialization\n" @@ -470,7 +474,7 @@ main( boolean_t cache_options = FALSE; char *ignore_guids_file_name = NULL; uint32_t val; - const char * const short_option = "i:f:ed:g:l:s:t:a:uvVhorcy"; + const char * const short_option = "i:f:ed:g:l:s:t:a:P:uvVhorcy"; /* In the array below, the 2nd parameter specified the number @@ -491,6 +495,7 @@ main( { "D", 1, NULL, 'D'}, { "log_file", 1, NULL, 'f'}, { "erase_log_file",0, NULL, 'e'}, + { "Pconfig", 1, NULL, 'P'}, { "maxsmps", 1, NULL, 'n'}, { "console", 0, NULL, 'q'}, { "V", 0, NULL, 'V'}, @@ -680,6 +685,10 @@ main( printf(" Creating new log file\n"); break; + case 'P': + opt.partition_config_file = optarg; + break; + case 'y': opt.exit_on_fatal = FALSE; printf(" Staying on fatal initialization errors\n"); diff --git a/osm/opensm/osm_opensm.c b/osm/opensm/osm_opensm.c index 6ca6796..94a16ac 100644 --- a/osm/opensm/osm_opensm.c +++ b/osm/opensm/osm_opensm.c @@ -262,7 +262,7 @@ osm_opensm_init( if( status != IB_SUCCESS ) goto Exit; - status = osm_subn_init( &p_osm->subn, p_opt ); + status = osm_subn_init( &p_osm->subn, p_osm, p_opt ); if( status != IB_SUCCESS ) goto Exit; diff --git a/osm/opensm/osm_pkey_mgr.c b/osm/opensm/osm_pkey_mgr.c index 14ed2db..cc02bac 100644 --- a/osm/opensm/osm_pkey_mgr.c +++ b/osm/opensm/osm_pkey_mgr.c @@ -56,6 +56,7 @@ #include #include #include +#include /********************************************************************** **********************************************************************/ @@ -107,121 +108,254 @@ osm_pkey_mgr_init( /********************************************************************** **********************************************************************/ -boolean_t +static ib_api_status_t +osm_pkey_mgr_update_pkey_entry( + IN const osm_pkey_mgr_t * const p_mgr, + IN const osm_physp_t *p_physp, + IN const ib_pkey_table_t *block, + IN const uint16_t block_index) +{ + osm_madw_context_t context; + osm_node_t *p_node = osm_physp_get_node_ptr(p_physp); + uint32_t attr_mod; + + context.pkey_context.node_guid = osm_node_get_node_guid(p_node); + context.pkey_context.port_guid = osm_physp_get_port_guid(p_physp); + context.pkey_context.set_method = TRUE; + attr_mod = block_index; + if (osm_node_get_type(p_node) == IB_NODE_TYPE_SWITCH) + attr_mod |= osm_physp_get_port_num(p_physp) << 16; + return osm_req_set(p_mgr->p_req, osm_physp_get_dr_path_ptr(p_physp), + ( uint8_t * ) block, sizeof( *block ), + IB_MAD_ATTR_P_KEY_TABLE, + cl_hton32( attr_mod ), + CL_DISP_MSGID_NONE, &context ); +} + +/********************************************************************** + **********************************************************************/ + +/* + * Send a new entry for the pkey table for this port when this pkey + * does not exist. Update existed entry when membership was changed. + */ + +static boolean_t __osm_pkey_mgr_process_physical_port( IN const osm_pkey_mgr_t * const p_mgr, - IN osm_node_t * p_node, - IN uint8_t port_num, + IN const ib_net16_t pkey, IN osm_physp_t * p_physp ) { - boolean_t return_val = FALSE; /* TRUE if IB_DEFAULT_PKEY was inserted */ - osm_madw_context_t context; + boolean_t return_val = FALSE; /* TRUE if pkey was inserted or updated */ + ib_api_status_t status; + osm_node_t *p_node = osm_physp_get_node_ptr(p_physp); ib_pkey_table_t *block = NULL; uint16_t block_index; uint16_t num_of_blocks; const osm_pkey_tbl_t *p_pkey_tbl; - uint32_t attr_mod; + ib_net16_t *p_orig_pkey; uint32_t i; - ib_net16_t pkey; - ib_api_status_t status; - boolean_t block_with_empty_entry_found; + boolean_t block_found = FALSE; OSM_LOG_ENTER( p_mgr->p_log, __osm_pkey_mgr_process_physical_port ); - /* - * Send a new entry for the pkey table for this node that includes - * IB_DEFAULT_PKEY when IB_DEFAULT_PARTIAL_PKEY or IB_DEFAULT_PKEY - * don't exist - */ - if ( ( osm_physp_has_pkey( p_mgr->p_log, - IB_DEFAULT_PKEY, - p_physp ) == FALSE ) && - ( osm_physp_has_pkey( p_mgr->p_log, - IB_DEFAULT_PARTIAL_PKEY, p_physp ) == FALSE ) ) - { - context.pkey_context.node_guid = osm_node_get_node_guid( p_node ); - context.pkey_context.port_guid = osm_physp_get_port_guid( p_physp ); - context.pkey_context.set_method = TRUE; - - p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp ); - num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); - block_with_empty_entry_found = FALSE; + p_pkey_tbl = osm_physp_get_pkey_tbl(p_physp); + num_of_blocks = osm_pkey_tbl_get_num_blocks(p_pkey_tbl); + p_orig_pkey = cl_map_get(&p_pkey_tbl->keys, ib_pkey_get_base(pkey)); + + if ( p_orig_pkey && *p_orig_pkey == pkey ) { + if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) ) + { + osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, + "__osm_pkey_mgr_process_physical_port: " + "No need to insert %04x for node 0x%016" PRIx64 + " port %u\n", + cl_ntoh16(pkey), + cl_ntoh64(osm_node_get_node_guid(p_node)), + osm_physp_get_port_num(p_physp)); + } + goto _done; + } + else if (!p_orig_pkey) + { for ( block_index = 0; block_index < num_of_blocks; block_index++ ) { block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); for ( i = 0; i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK; i++ ) { - pkey = block->pkey_entry[i]; - if ( ib_pkey_is_invalid( pkey ) ) + if ( ib_pkey_is_invalid( block->pkey_entry[i] ) ) { - block->pkey_entry[i] = IB_DEFAULT_PKEY; - block_with_empty_entry_found = TRUE; + block->pkey_entry[i] = pkey; + block_found = TRUE; break; } } - if ( block_with_empty_entry_found ) + if ( block_found ) { break; } } - - if ( block_with_empty_entry_found == FALSE ) + } + else + { + *p_orig_pkey = pkey; + for ( block_index = 0; block_index < num_of_blocks; block_index++ ) { - osm_log( p_mgr->p_log, OSM_LOG_ERROR, - "__osm_pkey_mgr_process_physical_port: ERR 0501: " - "No empty entry was found to insert IB_DEFAULT_PKEY for node " - "0x%016" PRIx64 " and port %u\n", - cl_ntoh64( osm_node_get_node_guid( p_node ) ), port_num ); + block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); + i = p_orig_pkey - block->pkey_entry; + if ( i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK ) { + block_found = TRUE; + break; + } } - else - { - /* Building the attribute modifier */ - if ( osm_node_get_type( p_node ) == IB_NODE_TYPE_SWITCH ) - { - /* Port num | Block Index */ - attr_mod = port_num << 16 | block_index; - } - else - { - attr_mod = block_index; - } + } - status = osm_req_set( p_mgr->p_req, - osm_physp_get_dr_path_ptr( p_physp ), - ( uint8_t * ) block, - sizeof( *block ), - IB_MAD_ATTR_P_KEY_TABLE, - cl_hton32( attr_mod ), - CL_DISP_MSGID_NONE, &context ); - return_val = TRUE; /*IB_DEFAULT_PKEY was inserted */ + if ( block_found == FALSE ) + { + osm_log( p_mgr->p_log, OSM_LOG_ERROR, + "__osm_pkey_mgr_process_physical_port: ERR 0501: " + "No empty entry was found to insert %04x for node " + "0x%016" PRIx64 " and port %u\n", + cl_ntoh16(pkey), + cl_ntoh64(osm_node_get_node_guid(p_node)), + osm_physp_get_port_num(p_physp)); + goto _done; + } - if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) ) - { - osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, - "__osm_pkey_mgr_process_physical_port: " - "IB_DEFAULT_PKEY was inserted for node 0x%016" PRIx64 + status = osm_pkey_mgr_update_pkey_entry(p_mgr, p_physp, block, block_index); + + if (status != IB_SUCCESS) + { + osm_log( p_mgr->p_log, OSM_LOG_ERROR, + "__osm_pkey_mgr_process_physical_port: " + "osm_pkey_mgr_update_pkey_entry() failed to update " + "pkey table block %d for node 0x%016" PRIx64 " and port %u\n", + block_index, + cl_ntoh64(osm_node_get_node_guid(p_node)), + osm_physp_get_port_num(p_physp)); + goto _done; + } + + return_val = TRUE; /* pkey was inserted/updated */ + + if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) ) + { + osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, + "__osm_pkey_mgr_process_physical_port: " + "%04x was inserted for node 0x%016" PRIx64 + " and port %u\n", + cl_ntoh16(pkey), + cl_ntoh64(osm_node_get_node_guid(p_node)), + osm_physp_get_port_num(p_physp)); + } + + _done: + OSM_LOG_EXIT( p_mgr->p_log ); + return ( return_val ); +} + + +/********************************************************************** + **********************************************************************/ +static void osm_pkey_mgr_update_peer_port( + const osm_pkey_mgr_t * const p_mgr, + const osm_port_t * const p_port) +{ + osm_physp_t *p, *peer; + osm_node_t *p_node; + ib_pkey_table_t *block, *peer_block; + const osm_pkey_tbl_t *p_pkey_tbl, *p_peer_pkey_tbl; + uint16_t block_index; + uint16_t num_of_blocks; + ib_api_status_t status = IB_SUCCESS; + + p = osm_port_get_default_phys_ptr(p_port); + if (!osm_physp_is_valid(p)) + return; + peer = osm_physp_get_remote(p); + if (!peer || !osm_physp_is_valid(peer)) + return; + p_node = osm_physp_get_node_ptr(peer); + if (osm_node_get_type(p_node) == IB_NODE_TYPE_CA) + return; + + p_pkey_tbl = osm_physp_get_pkey_tbl(p); + p_peer_pkey_tbl = osm_physp_get_pkey_tbl(peer); + num_of_blocks = osm_pkey_tbl_get_num_blocks(p_pkey_tbl); + if (num_of_blocks > osm_pkey_tbl_get_num_blocks(p_peer_pkey_tbl)) + num_of_blocks = osm_pkey_tbl_get_num_blocks(p_peer_pkey_tbl); + + for ( block_index = 0; block_index < num_of_blocks; block_index++ ) + { + block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); + peer_block = osm_pkey_tbl_block_get( p_peer_pkey_tbl, block_index ); + if (cl_memcmp(peer_block, block, sizeof(*block))) { + cl_memcpy(peer_block, block, sizeof(*block)); + status = osm_pkey_mgr_update_pkey_entry(p_mgr, peer, peer_block, block_index); + if (status != IB_SUCCESS) + osm_log( p_mgr->p_log, OSM_LOG_ERROR, + "osm_pkey_mgr_update_peer_port: " + "osm_pkey_mgr_update_pkey_entry() failed to update " + "pkey table block %d for node 0x%016" PRIx64 " and port %u\n", - cl_ntoh64( osm_node_get_node_guid( p_node ) ), - port_num ); - } + block_index, + cl_ntoh64(osm_node_get_node_guid(p_node)), + osm_physp_get_port_num(peer)); } } - else + + if ( num_of_blocks && status == IB_SUCCESS && + osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) ) { - /* default key or partial default key already exist */ - if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) ) + osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, + "osm_pkey_mgr_update_peer_port: " + "pkey table was updated for node 0x%016" PRIx64 + " and port %u\n", + cl_ntoh64(osm_node_get_node_guid(p_node)), + osm_physp_get_port_num(peer)); + } +} + +/********************************************************************** + **********************************************************************/ +static boolean_t osm_pkey_mgr_process_partition_table( + const osm_pkey_mgr_t * const p_mgr, + const osm_prtn_t * const p_prtn, + const boolean_t full) +{ + const cl_map_t * p_tbl = full ? + &p_prtn->full_guid_tbl : &p_prtn->part_guid_tbl; + cl_map_iterator_t i, i_next; + ib_net16_t pkey = p_prtn->pkey; + osm_physp_t *p_physp; + boolean_t result = FALSE; + + if (full) + pkey = cl_hton16(cl_ntoh16(pkey)|0x8000); + + i_next = cl_map_head(p_tbl); + while (i_next != cl_map_end(p_tbl)) { + i = i_next; + i_next = cl_map_next(i); + p_physp = cl_map_obj(i); + if (p_physp && osm_physp_is_valid(p_physp) && + __osm_pkey_mgr_process_physical_port(p_mgr, pkey, p_physp)) { - osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, - "__osm_pkey_mgr_process_physical_port: " - "No need to insert IB_DEFAULT_PKEY for node 0x%016" PRIx64 - " port %u\n", - cl_ntoh64( osm_node_get_node_guid( p_node ) ), port_num ); + result = TRUE; + if (osm_log_is_active(p_mgr->p_log, OSM_LOG_VERBOSE)) + osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, + "osm_pkey_mgr_process_partition_table: " + "Adding %04x for pkey table of node " + "0x%016" PRIx64 " port %u\n", + cl_ntoh16(pkey), + cl_ntoh64(osm_node_get_node_guid( + osm_physp_get_node_ptr(p_physp))), + osm_physp_get_port_num(p_physp)); } } - OSM_LOG_EXIT( p_mgr->p_log ); - return ( return_val ); + return result; } /********************************************************************** @@ -230,51 +364,54 @@ osm_signal_t osm_pkey_mgr_process( IN const osm_pkey_mgr_t * const p_mgr ) { - cl_qmap_t *p_node_guid_tbl; - osm_node_t *p_node; - osm_node_t *p_next_node; - uint8_t port_num; - osm_physp_t *p_physp; + cl_qmap_t *p_tbl; + cl_map_item_t *p_next; + osm_prtn_t *p_prtn; + osm_port_t *p_port; osm_signal_t result = OSM_SIGNAL_DONE; CL_ASSERT( p_mgr ); OSM_LOG_ENTER( p_mgr->p_log, osm_pkey_mgr_process ); - p_node_guid_tbl = &p_mgr->p_subn->node_guid_tbl; + CL_PLOCK_EXCL_ACQUIRE(p_mgr->p_lock); - CL_PLOCK_EXCL_ACQUIRE( p_mgr->p_lock ); + if (osm_prtn_make_partitions(p_mgr->p_log, p_mgr->p_subn) != IB_SUCCESS) + { + osm_log(p_mgr->p_log, OSM_LOG_ERROR, "osm_pkey_mgr_process: " + "osm_prtn_make_partitions() is failed.\n"); + goto _err; + } - p_next_node = ( osm_node_t * ) cl_qmap_head( p_node_guid_tbl ); - while ( p_next_node != ( osm_node_t * ) cl_qmap_end( p_node_guid_tbl ) ) + p_tbl = &p_mgr->p_subn->prtn_pkey_tbl; + + p_next = cl_qmap_head(p_tbl); + while (p_next != cl_qmap_end(p_tbl)) { - p_node = p_next_node; - p_next_node = ( osm_node_t * ) cl_qmap_next( &p_next_node->map_item ); + p_prtn = (osm_prtn_t *)p_next; + p_next = cl_qmap_next(p_next); - for ( port_num = 0; port_num < osm_node_get_num_physp( p_node ); - port_num++ ) - { - p_physp = osm_node_get_physp_ptr( p_node, port_num ); - if ( osm_physp_is_valid( p_physp ) ) - { - if ( __osm_pkey_mgr_process_physical_port - ( p_mgr, p_node, port_num, p_physp ) ) - { - if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) ) - { - osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, - "osm_pkey_mgr_process: " - "Adding IB_DEFAULT_PKEY for pkey table of node " - "0x%016" PRIx64 " port %u\n", - cl_ntoh64( osm_node_get_node_guid( p_node ) ), - port_num ); - } - result = OSM_SIGNAL_DONE_PENDING; - } - } + if (osm_pkey_mgr_process_partition_table(p_mgr, p_prtn, FALSE)) + result = OSM_SIGNAL_DONE_PENDING; + if (osm_pkey_mgr_process_partition_table(p_mgr, p_prtn, TRUE)) + result = OSM_SIGNAL_DONE_PENDING; + } + + p_tbl = &p_mgr->p_subn->port_guid_tbl; + + p_next = cl_qmap_head(p_tbl); + while (p_next != cl_qmap_end(p_tbl)) + { + p_port = (osm_port_t *)p_next; + p_next = cl_qmap_next(p_next); + + if (osm_node_get_type(osm_port_get_parent_node(p_port)) != + IB_NODE_TYPE_SWITCH) { + osm_pkey_mgr_update_peer_port(p_mgr, p_port); } } + _err: CL_PLOCK_RELEASE( p_mgr->p_lock ); OSM_LOG_EXIT( p_mgr->p_log ); return ( result ); diff --git a/osm/opensm/osm_prtn.c b/osm/opensm/osm_prtn.c new file mode 100644 index 0000000..f5f3a32 --- /dev/null +++ b/osm/opensm/osm_prtn.c @@ -0,0 +1,324 @@ +/* + * Copyright (c) 2006 Voltaire, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + + +/* + * Abstract: + * Implementation of osm_prtn_t. + * This object represents an IBA partition. + * This object is part of the opensm family of objects. + * + * Environment: + * Linux User Mode + * + * $Revision$ + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include + +#include +#include +#include +#include +#include +#include + + +extern int osm_prtn_config_parse_file(osm_log_t * const p_log, + osm_subn_t * const p_subn, + const char *file_name); + + +static uint16_t global_pkey_counter; + +/* + * + */ + +osm_prtn_t* osm_prtn_new( + IN const char *name, + IN const uint16_t pkey ) +{ + osm_prtn_t *p = cl_zalloc(sizeof(*p)); + if (!p) + return NULL; + p->pkey = pkey; + cl_map_construct(&p->full_guid_tbl); + cl_map_init(&p->full_guid_tbl, 32); + cl_map_construct(&p->part_guid_tbl); + cl_map_init(&p->part_guid_tbl, 32); + + if (name && *name) + strncpy(p->name, name, sizeof(p->name)); + else + snprintf(p->name, sizeof(p->name), "%04x", cl_ntoh16(pkey)); + + return p; +} + +void osm_prtn_delete( + IN OUT osm_prtn_t** const pp_prtn ) +{ + osm_prtn_t *p = *pp_prtn; + cl_map_remove_all(&p->full_guid_tbl); + cl_map_destroy(&p->full_guid_tbl); + cl_map_remove_all(&p->part_guid_tbl); + cl_map_destroy(&p->part_guid_tbl); + cl_free(p); + *pp_prtn = NULL; +} + + +ib_api_status_t osm_prtn_add_port(osm_log_t *p_log, osm_subn_t *p_subn, + osm_prtn_t *p, ib_net64_t guid, boolean_t full) +{ + cl_qmap_t *p_port_tbl = &p_subn->port_guid_tbl; + ib_api_status_t status = IB_SUCCESS; + cl_map_t *p_tbl; + osm_port_t *p_port; + osm_physp_t *p_physp; + + p_port = (osm_port_t *)cl_qmap_get(p_port_tbl, guid); + if (!p_port || p_port == (osm_port_t *)cl_qmap_end(p_port_tbl)) { + osm_log(p_log, OSM_LOG_VERBOSE, "osm_prtn_add_port: " + "port 0x%" PRIx64 " not found.\n", + cl_ntoh64(guid)); + return status; + } + + p_physp = osm_port_get_default_phys_ptr(p_port); + if (!p_physp) { + osm_log(p_log, OSM_LOG_VERBOSE, "osm_prtn_add_port: " + "no physical for port 0x%" PRIx64 "\n", + cl_ntoh64(guid)); + return status; + } + + if (osm_prtn_is_guid(p, guid)) { + osm_log(p_log, OSM_LOG_VERBOSE, "osm_prtn_add_port: " + "port 0x%" PRIx64 " already in " + "partition \'%s\' (%04x). Will overwrite.\n", + cl_ntoh64(guid), p->name, cl_ntoh16(p->pkey)); + } + + p_tbl = (full == TRUE) ? &p->full_guid_tbl : &p->part_guid_tbl ; + + if (cl_map_insert(p_tbl, guid, p_physp) == NULL) + return IB_INSUFFICIENT_MEMORY; + + return status; +} + + +ib_api_status_t osm_prtn_add_all(osm_log_t *p_log, osm_subn_t *p_subn, + osm_prtn_t *p, boolean_t full) +{ + cl_qmap_t *p_port_tbl = &p_subn->port_guid_tbl; + cl_map_item_t *p_item; + osm_port_t *p_port; + ib_api_status_t status = IB_SUCCESS; + + p_item = cl_qmap_head(p_port_tbl); + while (p_item != cl_qmap_end(p_port_tbl)) { + p_port = (osm_port_t *)p_item; + p_item = cl_qmap_next(p_item); + status = osm_prtn_add_port(p_log, p_subn, p, + osm_port_get_guid(p_port), full); + if (status != IB_SUCCESS) + goto _err; + } + + _err: + return status; +} + + +ib_api_status_t osm_prtn_add_mcgroup(osm_log_t *p_log, + osm_subn_t *p_subn, osm_prtn_t *p) +{ + ib_member_rec_t mc_rec; + ib_net64_t comp_mask; + ib_net16_t pkey; + osm_mgrp_t *p_mgrp = NULL; + osm_sa_t *p_sa = &p_subn->p_osm->sa; + ib_api_status_t status = IB_SUCCESS; + + pkey = cl_hton16(cl_ntoh16(p->pkey)|0x8000); + + cl_memclr(&mc_rec, sizeof(mc_rec)); + + mc_rec.mgid = osm_ipoib_mgid; /* this is ipv4 broadcast */ + cl_memcpy(&mc_rec.mgid.raw[4], &pkey, sizeof(pkey)); + + mc_rec.qkey = CL_HTON32(0x0b1b); + mc_rec.mtu = 4; /* 2048 Bytes */ + mc_rec.tclass = 0; + mc_rec.pkey = pkey; + mc_rec.rate = 0x3; /* 10Gb/sec */ + mc_rec.pkt_life = OSM_DEFAULT_SUBNET_TIMEOUT; + mc_rec.sl_flow_hop = OSM_DEFAULT_SL << 28; + /* Note: scope needs to be consistent with MGID */ + mc_rec.scope_state = 0x21; + + /* mtu and rate will be updated according to CA */ + comp_mask = 0; + + status = osm_mcmr_rcv_find_or_create_new_mgrp(&p_sa->mcmr_rcv, + comp_mask, &mc_rec, &p_mgrp); + + if (!p_mgrp || status != IB_SUCCESS) + osm_log( p_log, OSM_LOG_ERROR, + "osm_prtn_add_mcgroup:" + " failed to create mc group with %04x pkey\n", + cl_ntoh16(pkey)); + + return status; +} + + +static uint16_t __generate_pkey(osm_subn_t *p_subn) +{ + uint16_t pkey; + cl_qmap_t *m = &p_subn->prtn_pkey_tbl; + while ( global_pkey_counter < IB_DEFAULT_PARTIAL_PKEY - 1) { + pkey = ++global_pkey_counter; + if (cl_qmap_get(m, pkey) == cl_qmap_end(m)) + return cl_hton16(pkey); + } + return 0; +} + +osm_prtn_t *osm_prtn_make_new(osm_log_t *p_log, osm_subn_t *p_subn, + const char *name, uint16_t pkey) +{ + osm_prtn_t *p = NULL, *p_check; + + if (pkey == 0 && !(pkey = __generate_pkey(p_subn))) + return NULL; + + if (cl_ntoh16(pkey)&0x8000) { + pkey = cl_hton16(cl_ntoh16(pkey)&~0x8000); + osm_log(p_log, OSM_LOG_VERBOSE, + "osm_prtn_make_new: pkey was striped for" + " partition \'%s\' (%04x)\n", + name, cl_ntoh16(pkey)); + } + + p = osm_prtn_new(name, pkey); + if (!p) { + osm_log(p_log, OSM_LOG_ERROR, + "osm_prtn_make_new: Unable to create" + " partition \'%s\' (%04x)\n", + name, cl_ntoh16(pkey)); + return NULL; + } + + p_check = (osm_prtn_t *)cl_qmap_insert(&p_subn->prtn_pkey_tbl, + p->pkey, &p->map_item); + if (p != p_check) { + osm_log(p_log, OSM_LOG_VERBOSE, + "osm_prtn_make_new: Duplicated partition" + " definition: \'%s\' (%04x) prev name \'%s\'" + " - will use it.\n", + name, cl_ntoh16(pkey), p_check->name); + osm_prtn_delete(&p); + p = p_check; + } + + return p; +} + + +static ib_api_status_t osm_prtn_make_default(osm_log_t * const p_log, + osm_subn_t * const p_subn) +{ + ib_api_status_t status = IB_UNKNOWN_ERROR; + osm_prtn_t *p; + p = osm_prtn_make_new(p_log, p_subn, + "Default", IB_DEFAULT_PARTIAL_PKEY); + if (!p) + goto _err; + status = osm_prtn_add_all(p_log, p_subn, p, FALSE); + if (status != IB_SUCCESS) + goto _err;; + cl_map_remove(&p->part_guid_tbl, p_subn->sm_port_guid); + status = osm_prtn_add_port(p_log, p_subn, p, + p_subn->sm_port_guid, TRUE); + _err: + return status; +} + + +ib_api_status_t osm_prtn_make_partitions(osm_log_t * const p_log, + osm_subn_t * const p_subn) +{ + const char *file_name; + ib_api_status_t status = IB_SUCCESS; + osm_prtn_t *p, *p_next; + + file_name = p_subn->opt.partition_config_file ? + p_subn->opt.partition_config_file : + "/etc/osm-partitions.conf"; + + /* cl_qmap uses self addresses we cannot just save + qmap state and clean it later, so clean all now */ + p_next = (osm_prtn_t *)cl_qmap_head(&p_subn->prtn_pkey_tbl); + while (p_next != (osm_prtn_t *)cl_qmap_end(&p_subn->prtn_pkey_tbl)) { + p = p_next; + p_next = (osm_prtn_t *)cl_qmap_next(&p->map_item); + osm_prtn_delete(&p); + } + cl_qmap_init(&p_subn->prtn_pkey_tbl); + + global_pkey_counter = 0; + + status = osm_prtn_make_default(p_log, p_subn); + if(status != IB_SUCCESS) + goto _err; + + if (osm_prtn_config_parse_file(p_log, p_subn, file_name)) { + osm_log(p_log, OSM_LOG_VERBOSE, + "osm_prtn_make_partitions: Partition configuration " + "file \'%s\' was not fully processed (or does not exist).\n", + file_name); + } + + _err: + return status; +} diff --git a/osm/opensm/osm_prtn_config.c b/osm/opensm/osm_prtn_config.c new file mode 100644 index 0000000..97a835a --- /dev/null +++ b/osm/opensm/osm_prtn_config.c @@ -0,0 +1,387 @@ +/* + * Copyright (c) 2006 Voltaire, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + + +/* + * Abstract: + * Implementation of opensm partition management configuration + * + * Environment: + * Linux User Mode + * + * $Revision$ + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include +#include + +#include +#include +#include +#include + + +#if __WORDSIZE == 64 +#define STRTO_IB_NET64(str, end, base) strtoul(str, end, base) +#else +#define STRTO_IB_NET64(str, end, base) strtoull(str, end, base) +#endif + +#define PARSERR(log, lnum, fmt, arg...) { \ + osm_log(log, OSM_LOG_ERROR, \ + "\nPARSE ERROR: line %d: " fmt "\n", (lnum), ##arg ); \ + fprintf(stderr, \ + "\nPARSE ERROR: line %d: " fmt "\n", (lnum), ##arg ); \ +} + +#define PARSEWARN(log, lnum, fmt, arg...) \ + osm_log(log, OSM_LOG_VERBOSE, \ + "PARSE WARN: line %d: " fmt , (lnum), ##arg ) + +/* + */ + + +struct part_conf { + osm_log_t *p_log; + osm_subn_t *p_subn; + osm_prtn_t *p_prtn; +}; + + +extern osm_prtn_t *osm_prtn_make_new(osm_log_t *p_log, osm_subn_t *p_subn, + const char *name, uint16_t pkey); +extern ib_api_status_t osm_prtn_add_all(osm_log_t *p_log, + osm_subn_t *p_subn, osm_prtn_t *p, boolean_t full); +extern ib_api_status_t osm_prtn_add_port(osm_log_t *p_log, + osm_subn_t *p_subn, osm_prtn_t *p, ib_net64_t guid, + boolean_t full); +extern ib_api_status_t osm_prtn_add_mcgroup(osm_log_t *p_log, + osm_subn_t *p_subn, osm_prtn_t *p); + + +static int partition_create(unsigned lineno, struct part_conf *conf, + char *name, char *id, char *flag, char *flag_val) +{ + uint16_t pkey; + + if (!id && name && isdigit(*name)) { + id = name; + name = NULL; + } + + if (id) { + char *end; + pkey = strtoul(id, &end, 0); + if (end == id || *end) + return -1; + } + else + pkey = 0; + + conf->p_prtn = osm_prtn_make_new(conf->p_log, conf->p_subn, + name, cl_hton16(pkey)); + if (!conf->p_prtn) + return -1; + + if (flag) { + if(!strncmp(flag, "ipoib", strlen(flag))) + osm_prtn_add_mcgroup(conf->p_log, + conf->p_subn, conf->p_prtn); + else { + PARSEWARN(conf->p_log, lineno, + "unrecognized partition flag \'%s\'" + " - ignored.\n", flag); + } + } + + return 0; +} + + +static int partition_add_port(unsigned lineno, struct part_conf *conf, + char *name, char *flag) +{ + osm_prtn_t *p = conf->p_prtn; + ib_net64_t guid; + boolean_t full = FALSE; + + if (!name || !*name || !strncmp(name, "NONE", strlen(name))) + return 0; + + if (flag) { + if(!strncmp(flag, "full", strlen(flag))) + full = TRUE; + else if(strncmp(flag, "partial", strlen(flag))) { + PARSEWARN(conf->p_log, lineno, + "unrecognized port flag \'%s\' -" + " suppose \'partial\'\n", flag); + } + } + + if (!strncmp(name, "ALL", strlen(name))) { + return osm_prtn_add_all(conf->p_log, conf->p_subn, p, + full) == IB_SUCCESS ? 0 : -1; + } + else if (!strncmp(name, "SELF", strlen(name))) { + guid = cl_ntoh64(conf->p_subn->sm_port_guid); + } + else { + char *end; + guid = STRTO_IB_NET64(name, &end, 0); + if (!guid || *end) + return -1; + } + + if (osm_prtn_add_port(conf->p_log, conf->p_subn, p, + cl_hton64(guid), full) != IB_SUCCESS) + return -1; + + return 0; +} + + +/* conf file parser */ + +#define STRIP_HEAD_SPACES(p) while (*(p) == ' ' || *(p) == '\t' || \ + *(p) == '\n') { (p)++; } +#define STRIP_TAIL_SPACES(p) { char *q = (p) + strlen(p); \ + while ( q != (p) && ( *q == '\0' || \ + *q == ' ' || *q == '\t' || \ + *q == '\n')) { *q-- = '\0'; }; } + +static int parse_name_token(char *str, char **name, char **val) +{ + int len = 0; + char *p, *q; + + *name = *val = NULL; + + p = str; + + while (*p == ' ' || *p == '\t' || *p == '\n') + p++; + + q = strchr(p, '='); + if (q) + *q++ = '\0'; + + len = strlen(str) + 1; + str = q; + + q = p + strlen(p); + while ( q != p && + ( *q == '\0' || *q == ' ' || *q == '\t' || *q == '\n')) + *q-- = '\0'; + + *name = p; + + p = str; + if (!p) + return len; + + while (*p == ' ' || *p == '\t' || *p == '\n') + p++; + + q = p + strlen(p); + len += q - str + 1; + while ( q != p && + ( *q == '\0' || *q == ' ' || *q == '\t' || *q == '\n')) + *q-- = '\0'; + *val = p; + + return len; +} + + +static struct part_conf *new_part_conf(osm_log_t *p_log, osm_subn_t *p_subn) +{ + static struct part_conf part; + struct part_conf *conf = ∂ + memset(conf, 0, sizeof(*conf)); + conf->p_log = p_log; + conf->p_subn = p_subn; + conf->p_prtn = NULL; + return conf; +} + +static int flush_part_conf(struct part_conf *conf) +{ + memset(conf, 0, sizeof(*conf)); + return 0; +} + + +static int parse_part_conf(struct part_conf *conf, char *str, int lineno) +{ + int ret, len = 0; + char *name, *id, *flag, *flval; + char *q, *p; + + p = str; + if (*p == '\t' || *p == '\0' || *p == '\n') + p++; + + len += p - str; + str = p; + + if (conf->p_prtn) + goto skip_header; + + q = strchr(p, ':'); + if (!q) { + PARSERR(conf->p_log, lineno, + "no partition definition found\n"); + return -1; + } + + *q++ = '\0'; + str = q; + + name = id = flag = flval = NULL; + + q = strchr(p, ','); + if (q) + *q = '\0'; + + ret = parse_name_token(p, &name, &id); + p += ret; + len += ret; + + if (q) { + ret = parse_name_token(p, &flag, &flval); + if (!flag) { + PARSERR(conf->p_log, lineno, + "bad partition flags\n"); + return -1; + } + p += ret; + len += ret; + } + + if (p != str || (partition_create(lineno, conf, + name, id, flag, flval) < 0)) { + PARSERR(conf->p_log, lineno, + "bad partition definition\n"); + return -1; + } + + skip_header: + do { + name = flag = NULL; + q = strchr(p, ','); + if (q) + *q++ = '\0'; + ret = parse_name_token(p, &name, &flag); + if (partition_add_port(lineno, conf, name, flag) < 0) { + PARSERR(conf->p_log, lineno, + "bad PortGUID\n"); + return -1; + } + p += ret; + len += ret; + } while (q); + + return len; +} + +int osm_prtn_config_parse_file(osm_log_t *p_log, osm_subn_t *p_subn, + const char *file_name) +{ + char line[1024]; + struct part_conf *conf = NULL; + FILE *file; + int lineno; + + file = fopen(file_name, "r"); + if (!file) { + perror("fopen"); + return -1; + } + + lineno = 0; + + while (fgets(line, sizeof(line) - 1, file) != NULL) { + char *q, *p = line; + + lineno++; + + p = line; + + q = strchr(p, '#'); + if (q) + *q = '\0'; + + do { + int len; + while (*p == ' ' || *p == '\t' || *p == '\n') + p++; + if (*p == '\0') + break; + + if (!conf && + !(conf = new_part_conf(p_log, p_subn))) { + PARSERR(p_log, lineno, + "internal: cannot create config.\n"); + break; + } + + q = strchr(p, ';'); + if (q) + *q = '\0'; + + len = parse_part_conf(conf, p, lineno); + if (len < 0) { + break; + } + + p += len; + + if (q) { + flush_part_conf(conf); + conf = NULL; + } + } while (q); + } + + fclose(file); + + return 0; +} diff --git a/osm/opensm/osm_sa_mcmember_record.c b/osm/opensm/osm_sa_mcmember_record.c index ef63c15..22fe7dc 100644 --- a/osm/opensm/osm_sa_mcmember_record.c +++ b/osm/opensm/osm_sa_mcmember_record.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -1146,6 +1146,24 @@ __mgrp_request_is_realizable( } /********************************************************************** + Call this function to find or create a new mgrp. +**********************************************************************/ +ib_api_status_t +osm_mcmr_rcv_find_or_create_new_mgrp( + IN osm_mcmr_recv_t* const p_rcv, + IN ib_net64_t comp_mask, + IN ib_member_rec_t* const p_recvd_mcmember_rec, + OUT osm_mgrp_t **pp_mgrp) +{ + ib_api_status_t status; + status = __get_mgrp_by_mgid(p_rcv, p_recvd_mcmember_rec, pp_mgrp); + if (status == IB_SUCCESS) + return status; + return osm_mcmr_rcv_create_new_mgrp(p_rcv, comp_mask, + p_recvd_mcmember_rec, pp_mgrp); +} + +/********************************************************************** Call this function to create a new mgrp. **********************************************************************/ ib_api_status_t diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c index 1340017..1068435 100644 --- a/osm/opensm/osm_subnet.c +++ b/osm/opensm/osm_subnet.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -53,11 +53,13 @@ #include #include +#include #include #include #include #include #include +#include #include #include #include @@ -97,6 +99,7 @@ osm_subn_destroy( osm_port_t *p_port, *p_next_port; osm_switch_t *p_sw, *p_next_sw; osm_remote_sm_t *p_rsm, *p_next_rsm; + osm_prtn_t *p_prtn, *p_next_prtn; osm_mgrp_t *p_mgrp, *p_next_mgrp; osm_infr_t *p_infr, *p_next_infr; @@ -135,6 +138,14 @@ osm_subn_destroy( cl_free( p_rsm ); } + p_next_prtn = (osm_prtn_t*)cl_qmap_head( &p_subn->prtn_pkey_tbl ); + while( p_next_prtn != (osm_prtn_t*)cl_qmap_end( &p_subn->prtn_pkey_tbl ) ) + { + p_prtn = p_next_prtn; + p_next_prtn = (osm_prtn_t*)cl_qmap_next( &p_prtn->map_item ); + osm_prtn_delete( &p_prtn ); + } + p_next_mgrp = (osm_mgrp_t*)cl_qmap_head( &p_subn->mgrp_mlid_tbl ); while( p_next_mgrp != (osm_mgrp_t*)cl_qmap_end( &p_subn->mgrp_mlid_tbl ) ) { @@ -167,10 +178,13 @@ osm_subn_destroy( ib_api_status_t osm_subn_init( IN osm_subn_t* const p_subn, + IN osm_opensm_t * const p_osm, IN const osm_subn_opt_t* const p_opt ) { cl_status_t status; + p_subn->p_osm = p_osm; + status = cl_ptr_vector_init( &p_subn->node_lid_tbl, OSM_SUBNET_VECTOR_MIN_SIZE, OSM_SUBNET_VECTOR_GROW_SIZE ); @@ -428,6 +442,7 @@ osm_subn_set_default_opt( p_opt->dump_files_dir = OSM_DEFAULT_TMP_DIR; p_opt->log_file = OSM_DEFAULT_LOG_FILE; + p_opt->partition_config_file = OSM_DEFAULT_PARTITION_CONFIG_FILE; p_opt->accum_log_file = TRUE; p_opt->port_profile_switch_nodes = FALSE; p_opt->max_port_profile = 0xffffffff; _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mdidomenico at silverstorm.com Tue Feb 21 06:31:57 2006 From: mdidomenico at silverstorm.com (DiDomenico, Mike) Date: Tue, 21 Feb 2006 09:31:57 -0500 Subject: [openib-general] SRP not detecting Message-ID: -----Original Message----- > I suggest using ibsrpdm from my srptools stuff under userspace in > svn. It's a lot easier to use, since you don't have to guess the LID > of the target, and it tells you everything you need to do. > > You can just do "ibsrpdm" to get a summary of all the targets on the > fabric, and "ibsrpdm -c" to get a string you can echo to the kernel to > connect. > > > This is almost definitely the node GUID, so using it as the port GUID > will be wrong. Roland, Without a doubt your tool is much easier to use then DMCLI, someone should probably update the wiki... Which is where I got the info to use dmcli... I'm still getting rejected however, and I suspect it's because I haven't created an SRP map on the SilverStorm 7000, which leads me back to the question of why it's not auto-detecting in the switch. Which very well could be a switch issue, I have to find an engineering resource over here to see further. I'm going to add the map manually and see if I can get the system at least going. thanks From halr at voltaire.com Tue Feb 21 06:42:10 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Feb 2006 09:42:10 -0500 Subject: [openib-general] IPoIB broadcast MC group membership In-Reply-To: <79ae2f320602202210s7a668b52if968e1fad058019c@mail.gmail.com> References: <79ae2f320602201814s3df76825w50fb759238e91077@mail.gmail.com> <79ae2f320602201853n2982624an1c845dac9776ccef@mail.gmail.com> <79ae2f320602202210s7a668b52if968e1fad058019c@mail.gmail.com> Message-ID: <1140532783.28051.5457.camel@hal.voltaire.com> Hi Fab, On Tue, 2006-02-21 at 01:10, Fabian Tillier wrote: > On 2/20/06, Roland Dreier wrote: > > Fabian> What is the behavior of SMs that pre-create the group in > > Fabian> response to a GET query for the MC group parameters? Does > > Fabian> the query return a record, or does it fail with no > > Fabian> records? > > > > I guess it depends on the SM. I'm not sure about this. If the group is precreated, it exists and I think a record needs to be returned. > I guess the follow up question is how an SM handles a MC join that > specifies the QKey and other settings that may conflict with the > preset ones, but didn't return a response to the GET query. Does it > fail the join, does it succeed it with the provided values, or does it > succeed it and return the preset parameters? If the join is to the same MC group and the settings conflict, it should fail. > OpenSM seems to respond to the GET query, even if there are no members > to the group In IBA, there are no groups without members so a precreated group has a member (albeit an admin member). > - and the query returns a group that specifies a rate of > 10Gbps (4X SDR - same as the system running OpenSM, incidentally) Note also that this will be changing with the partition manager work going on. > > Do you know of an SM that has problems with the existing Linux IPoIB driver? > > No, actually my query wasn't driven by an actual issue with the Linux > IPoIB driver. I was trying to figure out how to do better error > handling and diagnostic logging in the Windows IPoIB and wanted to see > how the Linux IPoIB driver handled a similar situation. > > The problem I was having was that the sequence of events I was using > in the Windows IPoIB resulted in ambiguous error conditions, where it > wasn't possible to differentiate between unexpected errors and errors > that could be worked around. > > The Windows IPoIB follows a sequence like: > > if( GET broadcast group == NO_ERROR ) > if( SET join broadcast group != NO_ERROR ) repeat GET; > else > if( SET create broadcast group != NO_ERROR ) repeat GET; > > Specifically, the problem relates to handling a 1X node trying to join > the broadcast group, and what the retry policy should be. If the > group already exists at 4X, the join should fail if the SM follows the > compliance statements in the IB spec. I think you would need to look at the returned error status. A mismatched rate join should fail with unrealizable rather than invalid. Does that help ? > Because the code allowed for > the broadcast group not pre-existing (that is, a join could fail > because the group wasn't created), it was unclear whether a failure of > the join indicated that there was a setting incompatibility (1X vs. > 4X), or just whether the group needed to be created. Then, because > the code handled the race where some other node beat it to creation > and thus resulted in invalid settings, a failure in creation resulted > in a retry of the whole process, staring with a new GET query. > > A 1X node in such a case ends up perpetually retrying the sequence of > events, eventhough it really should just stop and wait for the next > port up event (since link width changes require the port to go through > the down state as far as I understand). > > The lack of detailed error reporting in SA queries could stand to be > improved, and something as simple as the SA returning a component mask > indicating which components caused conflicts would be extremely useful > in determining the next course of action. ERR_REQ_INVALID is just too > broad in this case to allow the code to do anything intelligent. I think ERR_REQ_UNREALIZABLE helps here. > As a note, OpenSM seems to allow a 1X node to join a 4X multicast > group which it should not, unless the join specifies the rate in which > case the join fails as expected. I believe this is a bug which should be fixed. > Do we just not care that a 1X node > could be dropping 3/4 of the packets sent on the broadcast group, > aside from OpenSM violating o15-0.1.13? Note that the failure if the > rate is specified occurs even if the 1X node is the first to attempt > to join (that is, no other nodes on the fabric have IPoIB running). It is a preconfiguration issue in terms of OpenSM. It will be allowing different rates in the near term future. > Anyhow, I'm still not sure how to cleanly handle these errors so that > the system log is pretty clear that things are not working likely due > to a bad cable. Does the above help ? -- Hal > - Fab > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mdidomenico at silverstorm.com Tue Feb 21 06:57:08 2006 From: mdidomenico at silverstorm.com (DiDomenico, Mike) Date: Tue, 21 Feb 2006 09:57:08 -0500 Subject: [openib-general] SRP not detecting Message-ID: > > You can just do "ibsrpdm" to get a summary of all the targets on the > > fabric, and "ibsrpdm -c" to get a string you can echo to the kernel to > > connect. > > > > > > This is almost definitely the node GUID, so using it as the port GUID > > will be wrong. > > I'm still getting rejected however, and I suspect it's because I haven't > created an SRP map on the SilverStorm 7000, which leads me back to the > question of why it's not auto-detecting in the switch. Which very well > could be a switch issue, I have to find an engineering resource over > here to see further. > > I'm going to add the map manually and see if I can get the system at > least going. Intrestingly enough after I ran the ibsrpdm command, then echo the output to add_target, and got the reject message in /var/log/messages. I re-ran the auto-detect and the switch found the host, but the guid and extention id where backwards. I'm not sure if this is a switch bug or an openib bug. I'm going to look into it. From halr at voltaire.com Tue Feb 21 06:56:45 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Feb 2006 09:56:45 -0500 Subject: [openib-general] IPoIB broadcast MC group membership In-Reply-To: <79ae2f320602201814s3df76825w50fb759238e91077@mail.gmail.com> References: <79ae2f320602201814s3df76825w50fb759238e91077@mail.gmail.com> Message-ID: <1140533804.28051.5566.camel@hal.voltaire.com> On Mon, 2006-02-20 at 21:14, Fabian Tillier wrote: [snip...] > Shouldn't IPoIB first do a GET for the broadcast group, and use those > settings if it exist, otherwise create it? That's one possible algorithm but not the only one. -- Hal > Thanks, > > - Fab > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Tue Feb 21 07:00:21 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Feb 2006 10:00:21 -0500 Subject: [openib-general] IPoIB broadcast MC group membership In-Reply-To: <79ae2f320602201853n2982624an1c845dac9776ccef@mail.gmail.com> References: <79ae2f320602201814s3df76825w50fb759238e91077@mail.gmail.com> <79ae2f320602201853n2982624an1c845dac9776ccef@mail.gmail.com> Message-ID: <1140534007.28051.5596.camel@hal.voltaire.com> On Mon, 2006-02-20 at 21:53, Fabian Tillier wrote: > On 2/20/06, Roland Dreier wrote: > > Fabian> Second, if so, how is IPoIB supposed to interact with > > Fabian> subnet managers that don't pre-create an empty broadcast > > Fabian> group? > > > > Fabian> Shouldn't IPoIB first do a GET for the broadcast group, > > Fabian> and use those settings if it exist, otherwise create it? > > > > What parameters should it use to create it? > > The only paramter that can be problematic is the QKey, but it's not a > problem for it to just make one up, as long as it's a privileged one. > All other parameters can be taken from the local port info. There are other parameters aside from QKey which cannot be derived from local port info. > > The IETF drafts for IPoIB say that the IPv4 broadcast group must be > > created administratively before an IPoIB interface can be brought up. > > Doesn't the IB spec require that a multicast group have a member? Yes. > That is, when the last member leaves the group, the multicast group is > destroyed? No; destruction can be lazy (see p.914-5 o15-0.1.14). In fact, so lazy that it technically does not have to be done. > Further, the IETF drafts for IPoIB only recommend > administrative creation of the broadcast group, but allows creation by > the first member. Yes, but if it's precreated it can't be created by the first member. The issue with the first member creation is how do you know which first member has the right set of parameters ? I believe most if not all have chosen the precreated approach (for that reason). > An IB MC join of an non-existant group should fail unless all the > proper parameters are provided to create the group. Correct. > What is the behavior of SMs that pre-create the group in response to a > GET query for the MC group parameters? Does the query return a > record, or does it fail with no records? As an IBA group does not exist without members (even an admin member is a member), I would think that they return the group info. -- Hal > > - Fab > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Tue Feb 21 07:37:19 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 21 Feb 2006 17:37:19 +0200 Subject: [openib-general] [RFC] [PATCH] OpenSM: Add functional partition manager support In-Reply-To: <20060219163843.GB16012@sashak.voltaire.com> References: <20060219163843.GB16012@sashak.voltaire.com> Message-ID: <20060221153719.GA9041@sashak.voltaire.com> Hell oYael, (sorry about breaking thread - just deleted original message by mistke). On ... Tue 21 Feb , Yael Kalka wrote: > > There are problems with the patch you sent - doesn't work when trying to > apply it. > One problem, for example - is that file opensm/osm_partition.h is given > with diffs, when actually it is a new file. Yes, and this should work. Just save original message in text file (mbox or whatever) and then: $ cd /to/your/osm $ patch -p2 < /path/to/saved_file Sasha. From ftillier at silverstorm.com Tue Feb 21 08:15:13 2006 From: ftillier at silverstorm.com (Fabian Tillier) Date: Tue, 21 Feb 2006 08:15:13 -0800 Subject: [openib-general] IPoIB broadcast MC group membership In-Reply-To: <1140532783.28051.5457.camel@hal.voltaire.com> References: <79ae2f320602201814s3df76825w50fb759238e91077@mail.gmail.com> <79ae2f320602201853n2982624an1c845dac9776ccef@mail.gmail.com> <79ae2f320602202210s7a668b52if968e1fad058019c@mail.gmail.com> <1140532783.28051.5457.camel@hal.voltaire.com> Message-ID: <79ae2f320602210815o62eff7a5qc00744c4b755c8d7@mail.gmail.com> On 21 Feb 2006 09:42:10 -0500, Hal Rosenstock wrote: > On Tue, 2006-02-21 at 01:10, Fabian Tillier wrote: > > The lack of detailed error reporting in SA queries could stand to be > > improved, and something as simple as the SA returning a component mask > > indicating which components caused conflicts would be extremely useful > > in determining the next course of action. ERR_REQ_INVALID is just too > > broad in this case to allow the code to do anything intelligent. > > I think ERR_REQ_UNREALIZABLE helps here. What is this ERR_REQ_UNREALIZABLE that you speak of? Where is it documented? I only see ERR_REQ_INVALID in the 1.2 spec, perhaps I need to look at an errata of sorts? Thanks, - Fab From halr at voltaire.com Tue Feb 21 08:23:45 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Feb 2006 11:23:45 -0500 Subject: [openib-general] IPoIB broadcast MC group membership In-Reply-To: <79ae2f320602210815o62eff7a5qc00744c4b755c8d7@mail.gmail.com> References: <79ae2f320602201814s3df76825w50fb759238e91077@mail.gmail.com> <79ae2f320602201853n2982624an1c845dac9776ccef@mail.gmail.com> <79ae2f320602202210s7a668b52if968e1fad058019c@mail.gmail.com> <1140532783.28051.5457.camel@hal.voltaire.com> <79ae2f320602210815o62eff7a5qc00744c4b755c8d7@mail.gmail.com> Message-ID: <1140539024.28051.6108.camel@hal.voltaire.com> Hi Fab, On Tue, 2006-02-21 at 11:15, Fabian Tillier wrote: > On 21 Feb 2006 09:42:10 -0500, Hal Rosenstock wrote: > > On Tue, 2006-02-21 at 01:10, Fabian Tillier wrote: > > > The lack of detailed error reporting in SA queries could stand to be > > > improved, and something as simple as the SA returning a component mask > > > indicating which components caused conflicts would be extremely useful > > > in determining the next course of action. Or at least 1 component mask bit of a conflict... Interesting idea :-) > ERR_REQ_INVALID is just too > > > broad in this case to allow the code to do anything intelligent. > > > > I think ERR_REQ_UNREALIZABLE helps here. > > What is this ERR_REQ_UNREALIZABLE that you speak of? Where is it > documented? I only see ERR_REQ_INVALID in the 1.2 spec, perhaps I > need to look at an errata of sorts? You're right; Unrealizable == ERR_REQ_INVALID so there is no easy way to tell the difference. -- Hal > Thanks, > > - Fab From info at ppiuu.com Tue Feb 21 07:43:51 2006 From: info at ppiuu.com (info at ppiuu.com) Date: 22 Feb 2006 00:43:51 +0900 Subject: [openib-general] $BMp8r%Q!<%F%#!<3+:E$7$^$9!*!*(B Message-ID: <20060221154351.25151.qmail@mail.ppiuu.com> $BA02sBg at 967$K=*$($?Mp8r%Q!<%F%#!!!:#2s$NCK at -Jg=8?M?t$O#1#6L>!!(B $B;22CHqEy$O!"$*;P$5$^J};}$A$H$J$j$^$9$N$G(B $BCK at -$NJ}$+$i0l at ZHqMQ$OD:$-$^$;$s!#C"$7!"$=$NJ,=w at -$KJt;E$7$FD:$-$^$9!#(B $BA02s$N%Q!<%F%#!<\:Y!";22C4uK>$NJ}$O$3$A$i$+$i$*F~$j$/$@$5$$!#(B http://www.gyakuten5.net/ranko/ $B%a!<%k References: <79ae2f320602201814s3df76825w50fb759238e91077@mail.gmail.com> <79ae2f320602201853n2982624an1c845dac9776ccef@mail.gmail.com> <79ae2f320602202210s7a668b52if968e1fad058019c@mail.gmail.com> <1140532783.28051.5457.camel@hal.voltaire.com> <79ae2f320602210815o62eff7a5qc00744c4b755c8d7@mail.gmail.com> <1140539024.28051.6108.camel@hal.voltaire.com> Message-ID: <79ae2f320602210908j1256615fl480a6af71dd0ebc5@mail.gmail.com> On 21 Feb 2006 11:23:45 -0500, Hal Rosenstock wrote: > Hi Fab, > > On Tue, 2006-02-21 at 11:15, Fabian Tillier wrote: > > On 21 Feb 2006 09:42:10 -0500, Hal Rosenstock wrote: > > > On Tue, 2006-02-21 at 01:10, Fabian Tillier wrote: > > > > The lack of detailed error reporting in SA queries could stand to be > > > > improved, and something as simple as the SA returning a component mask > > > > indicating which components caused conflicts would be extremely useful > > > > in determining the next course of action. > > Or at least 1 component mask bit of a conflict... > > Interesting idea :-) If I were to write something up as an enhancement to the spec, which working group would it be submitted to, and do you want to help? I don't know if SMs today clear the component mask in responses or just leave it as the component mask of the request, but it would be great to find a way that is backward compatible. - Fab From halr at voltaire.com Tue Feb 21 09:22:34 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Feb 2006 12:22:34 -0500 Subject: [openib-general] IPoIB broadcast MC group membership In-Reply-To: <79ae2f320602210908j1256615fl480a6af71dd0ebc5@mail.gmail.com> References: <79ae2f320602201814s3df76825w50fb759238e91077@mail.gmail.com> <79ae2f320602201853n2982624an1c845dac9776ccef@mail.gmail.com> <79ae2f320602202210s7a668b52if968e1fad058019c@mail.gmail.com> <1140532783.28051.5457.camel@hal.voltaire.com> <79ae2f320602210815o62eff7a5qc00744c4b755c8d7@mail.gmail.com> <1140539024.28051.6108.camel@hal.voltaire.com> <79ae2f320602210908j1256615fl480a6af71dd0ebc5@mail.gmail.com> Message-ID: <1140542552.28051.6480.camel@hal.voltaire.com> On Tue, 2006-02-21 at 12:08, Fabian Tillier wrote: > On 21 Feb 2006 11:23:45 -0500, Hal Rosenstock wrote: > > Hi Fab, > > > > On Tue, 2006-02-21 at 11:15, Fabian Tillier wrote: > > > On 21 Feb 2006 09:42:10 -0500, Hal Rosenstock wrote: > > > > On Tue, 2006-02-21 at 01:10, Fabian Tillier wrote: > > > > > The lack of detailed error reporting in SA queries could stand to be > > > > > improved, and something as simple as the SA returning a component mask > > > > > indicating which components caused conflicts would be extremely useful > > > > > in determining the next course of action. > > > > Or at least 1 component mask bit of a conflict... > > > > Interesting idea :-) > > If I were to write something up as an enhancement to the spec, which > working group would it be submitted to, MgtWG > and do you want to help? Sure. I'd be happy to help improve the spec :-) > I don't know if SMs today clear the component mask in responses or > just leave it as the component mask of the request, but it would be > great to find a way that is backward compatible. I think I see where you are headed with this. The CM header description for SA header only indicates SA operations and does not appear to distinguish between queries and responses and most importantly errors. There is some additional informative text on p.925 on its use during read, set, or delete operations but not event forwarding. Based on that, I suspect it technically is not backward compatible as there is no statement on CM in error cases. -- Hal > - Fab From rdreier at cisco.com Tue Feb 21 09:25:38 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 21 Feb 2006 09:25:38 -0800 Subject: [openib-general] SRP not detecting In-Reply-To: (Mike DiDomenico's message of "Tue, 21 Feb 2006 09:31:57 -0500") References: Message-ID: Mike> Without a doubt your tool is much easier to use then DMCLI, Mike> someone should probably update the wiki... Which is where I Mike> got the info to use dmcli... Go ahead and edit the page. All you have to do is create a wiki account. - R. From viswa.krish at gmail.com Tue Feb 21 09:54:50 2006 From: viswa.krish at gmail.com (Viswanath Krishnamurthy) Date: Tue, 21 Feb 2006 09:54:50 -0800 Subject: [openib-general] mthca and coalesced ACK Message-ID: <4df28be40602210954y44eff456mcdc20e21d2bfd7e1@mail.gmail.com> When the HCA receives back to back RDMA write followed by RDMA read requests. It generates coalesced ACK (implicit ACK for RDMA write). Is there a configuration in the mthca driver which will enable HCA firmware to generate individual ACK's. I an trying to debug another issue and this will be helpful. Thanks, Viswa -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Tue Feb 21 10:01:21 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 21 Feb 2006 10:01:21 -0800 Subject: [openib-general] RE: [PATCH 3 of 3] mad: large RMPP support, Round 2 In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3016C4BCC@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3016C4BCC@mtlexch01.mtl.com> Message-ID: <43FB5571.1060000@ichips.intel.com> > I did it this way to avoid using kzalloc for all segment allocations, or > kmalloc/memset for all. This way, only the last segment is "memset"'ed > and only in the padding area. The API states that the buffer returned from ib_create_send_mad will be cleared. We either need to clear the buffer, or update the documentation. > [JPM]: This was done to hide segment implementation details from the > user_mad layer. You are correct that we can get rid of the size field, > and store it only once. > However, I think that the proper place for this value is the > "ib_mad_send_wr_private" structure, which also holds the list pointer. > How about if we change > struct ib_mad_multipacket_seg > *ib_mad_get_multipacket_seg(struct ib_mad_send_buf *send_buf, int > seg_num) > > to also return the segment size: Since the user needs to know the segment size, it makes more sense to me to just expose it through ib_mad_send_buf. > [JPM]: You are correct. In fact, the test just before this one is also > performed in ib_create_send_mad(), and may also be deleted (this was not > part of the patch): > > if (rmpp_active && !agent->rmpp_version) { > ret = -EINVAL; > goto err_ah; > } I will remove these, and submit a patch for any other changes. Thanks. - Sean From tbarron at ornl.gov Tue Feb 21 10:19:02 2006 From: tbarron at ornl.gov (Tom Barron) Date: Tue, 21 Feb 2006 13:19:02 -0500 Subject: [openib-general] SRP not detecting In-Reply-To: References: Message-ID: <1887bb5cc994879439a27a91f703e447@ornl.gov> On 2006 Feb 21 , at 12:25, Roland Dreier wrote: > Mike> Without a doubt your tool is much easier to use then DMCLI, > Mike> someone should probably update the wiki... Which is where I > Mike> got the info to use dmcli... > > Go ahead and edit the page. All you have to do is create a wiki > account. FWIW, I already did edit at least one reference to dmcli in the wiki and added a mention of ibsrpdm, with a link to Mike's message stating that ibsrpdm is easier. I don't know whether I got the reference Mike originally saw. Tom From halr at voltaire.com Tue Feb 21 10:45:03 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Feb 2006 13:45:03 -0500 Subject: [openib-general] Re: Re: Re: [PATCH] change Mellanox SDP workaround toa moduleparameter In-Reply-To: <20060219131242.GE22037@mellanox.co.il> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3FEDFE1@mtlexch01.mtl.com> <1140189358.4333.44536.camel@hal.voltaire.com> <20060219131242.GE22037@mellanox.co.il> Message-ID: <1140547502.28051.7015.camel@hal.voltaire.com> On Sun, 2006-02-19 at 08:12, Michael S. Tsirkin wrote: > > > The SWG defined a generic mechanism which uses REJ to indicate that the > > > passive side does not accept a certain REQ fields, and allows the passive > > > side to indicate an alternative value. Indirection is also supported through > > > the same protocol. It also allows the active side, following the REJ, to use > > > an alternate value, other than the one suggested by the passive side, i.e. > > > passive side only has a veto capability. This is the mechanism and the short > > > theory behind it. Unfortunately it's a bit inefficient in terms of > > > performance because of the ping pong of messages. Solving just the MTU might > > > not be a good enough argument. The approach should be to enable the active > > > side to specify a set of acceptable parameters for each one of the REQ > > > fields, and then let the passive side to choose. This may change the CM > > > packets all over and will introduce new problems. I don't think that there's > > > a good chance of just adding a solution for just one of the fields. Anyway, > > > you can still try and propose this to IBTA, I tried it once already :) > > > > Thanks for the historical perspective. It's harder to overturn an > > existing vote on something at the IBTA. Not sure I have the time to take > > up this (larger) mission. > > Assuming the spec says as it is, then: > 1. CMA needs to be modified to retry the connection if its rejected because > of lower MTU. > 2. SDP/SRP protocols specs need a clarification: e.g. current SDP spec > says the connection should be closed when we get a REJ. Can you be specific about the spec citations for SDP and SRP for REJ handling ? Isn't it more the retry strategy once the connection is REJected ? Is that in those specs ? -- Hal From mst at mellanox.co.il Tue Feb 21 13:09:56 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 21 Feb 2006 23:09:56 +0200 Subject: [openib-general] Re: Re: Re: [PATCH] change Mellanox SDP workaround toa moduleparameter In-Reply-To: <1140547502.28051.7015.camel@hal.voltaire.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3FEDFE1@mtlexch01.mtl.com> <1140189358.4333.44536.camel@hal.voltaire.com> <20060219131242.GE22037@mellanox.co.il> <1140547502.28051.7015.camel@hal.voltaire.com> Message-ID: <20060221210956.GC21543@mellanox.co.il> Quoting Hal Rosenstock : > > Assuming the spec says as it is, then: > > 1. CMA needs to be modified to retry the connection if its rejected because > > of lower MTU. > > 2. SDP/SRP protocols specs need a clarification: e.g. current SDP spec > > says the connection should be closed when we get a REJ. > > Can you be specific about the spec citations for SDP and SRP for REJ > handling ? Isn't it more the retry strategy once the connection is > REJected ? Is that in those specs ? This is not explicitly explained in spec. I think Dror discussed the use of REJ/retry to get the MTU in his mail in sufficient detail. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From halr at voltaire.com Tue Feb 21 13:15:34 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Feb 2006 16:15:34 -0500 Subject: [openib-general] Re: Re: Re: [PATCH] change Mellanox SDP workaround toa moduleparameter In-Reply-To: <20060221210956.GC21543@mellanox.co.il> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3FEDFE1@mtlexch01.mtl.com> <1140189358.4333.44536.camel@hal.voltaire.com> <20060219131242.GE22037@mellanox.co.il> <1140547502.28051.7015.camel@hal.voltaire.com> <20060221210956.GC21543@mellanox.co.il> Message-ID: <1140556533.28051.7947.camel@hal.voltaire.com> On Tue, 2006-02-21 at 16:09, Michael S. Tsirkin wrote: > Quoting Hal Rosenstock : > > > Assuming the spec says as it is, then: > > > 1. CMA needs to be modified to retry the connection if its rejected because > > > of lower MTU. > > > 2. SDP/SRP protocols specs need a clarification: e.g. current SDP spec > > > says the connection should be closed when we get a REJ. > > > > Can you be specific about the spec citations for SDP and SRP for REJ > > handling ? Isn't it more the retry strategy once the connection is > > REJected ? Is that in those specs ? > > This is not explicitly explained in spec. I think Dror discussed the use of > REJ/retry to get the MTU in his mail in sufficient detail. Sorry for being dense but this is what Dror wrote: "The SWG defined a generic mechanism which uses REJ to indicate that the passive side does not accept a certain REQ fields, and allows the passive side to indicate an alternative value. Indirection is also supported through the same protocol. It also allows the active side, following the REJ, to use an alternate value, other than the one suggested by the passive side, i.e. passive side only has a veto capability." So the only issue here is the inefficiency in terms of the back and forth of CM messages to get to the 1K MTU connection. How important is connection rate for SDP and SRP ? If not, can't we live with how things are ? -- Hal From mst at mellanox.co.il Tue Feb 21 13:36:10 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 21 Feb 2006 23:36:10 +0200 Subject: [openib-general] Re: Re: Re: [PATCH] change Mellanox SDP workaround toa moduleparameter In-Reply-To: <1140556533.28051.7947.camel@hal.voltaire.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3FEDFE1@mtlexch01.mtl.com> <1140189358.4333.44536.camel@hal.voltaire.com> <20060219131242.GE22037@mellanox.co.il> <1140547502.28051.7015.camel@hal.voltaire.com> <20060221210956.GC21543@mellanox.co.il> <1140556533.28051.7947.camel@hal.voltaire.com> Message-ID: <20060221213610.GD21543@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: Re: Re: Re: [PATCH] change Mellanox SDP workaround toa moduleparameter > > On Tue, 2006-02-21 at 16:09, Michael S. Tsirkin wrote: > > Quoting Hal Rosenstock : > > > > Assuming the spec says as it is, then: > > > > 1. CMA needs to be modified to retry the connection if its rejected because > > > > of lower MTU. > > > > 2. SDP/SRP protocols specs need a clarification: e.g. current SDP spec > > > > says the connection should be closed when we get a REJ. > > > > > > Can you be specific about the spec citations for SDP and SRP for REJ > > > handling ? Isn't it more the retry strategy once the connection is > > > REJected ? Is that in those specs ? > > > > This is not explicitly explained in spec. I think Dror discussed the use of > > REJ/retry to get the MTU in his mail in sufficient detail. > > Sorry for being dense but this is what Dror wrote: > "The SWG defined a generic mechanism which uses REJ to indicate that > the passive side does not accept a certain REQ fields, and allows the > passive side to indicate an alternative value. Indirection is also > supported through the same protocol. It also allows the active side, > following the REJ, to use an alternate value, other than the one > suggested by the passive side, i.e. passive side only has a veto > capability." I think problems that need resolution are: 1. Some of places in spec require the connection to be terminated after REJ. It does not seem to describe what Dror says anywhere. For example, see SDP spec: A4.5.1.2 ABORTING CONNECTION SETUP CA4-43: When a CM REJ MAD is received by either the connecting or accepting peer the connection setup shall be aborted. If a CM REJ MAD is sent for an SDP-specific error, the reject reason code value shall be 28 (Consumer Reject -- 12.6.7.2 Rejection Reason on page 665. An SDP implementation is expected to cleanup any resources associated with an aborted connection. 2. The implementation is still missing: does it belong as part of CM, or should it be a higher level thing like CMA? 3. Is there some solution for backward compatibility? There does not seem to exist a way to figure out whether sending REJ makes sense since the remote side will retry the connection with a better MTU value, or not. > So the only issue here is the inefficiency in terms of the back and > forth of CM messages to get to the 1K MTU connection. How important is > connection rate for SDP and SRP ? If not, can't we live with how things > are ? SDP implements sockets, and thats a very wide field, so everything is important. AFAIK, connection rate is very important for some socket applications, and does not matter at all for others. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From halr at voltaire.com Tue Feb 21 13:43:32 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Feb 2006 16:43:32 -0500 Subject: [openib-general] Re: Re: Re: [PATCH] change Mellanox SDP workaround toa moduleparameter In-Reply-To: <20060221213610.GD21543@mellanox.co.il> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3FEDFE1@mtlexch01.mtl.com> <1140189358.4333.44536.camel@hal.voltaire.com> <20060219131242.GE22037@mellanox.co.il> <1140547502.28051.7015.camel@hal.voltaire.com> <20060221210956.GC21543@mellanox.co.il> <1140556533.28051.7947.camel@hal.voltaire.com> <20060221213610.GD21543@mellanox.co.il> Message-ID: <1140558211.28051.8131.camel@hal.voltaire.com> On Tue, 2006-02-21 at 16:36, Michael S. Tsirkin wrote: > Quoting r. Hal Rosenstock : > > Subject: Re: Re: Re: [PATCH] change Mellanox SDP workaround toa moduleparameter > > > > On Tue, 2006-02-21 at 16:09, Michael S. Tsirkin wrote: > > > Quoting Hal Rosenstock : > > > > > Assuming the spec says as it is, then: > > > > > 1. CMA needs to be modified to retry the connection if its rejected because > > > > > of lower MTU. > > > > > 2. SDP/SRP protocols specs need a clarification: e.g. current SDP spec > > > > > says the connection should be closed when we get a REJ. > > > > > > > > Can you be specific about the spec citations for SDP and SRP for REJ > > > > handling ? Isn't it more the retry strategy once the connection is > > > > REJected ? Is that in those specs ? > > > > > > This is not explicitly explained in spec. I think Dror discussed the use of > > > REJ/retry to get the MTU in his mail in sufficient detail. > > > > Sorry for being dense but this is what Dror wrote: > > "The SWG defined a generic mechanism which uses REJ to indicate that > > the passive side does not accept a certain REQ fields, and allows the > > passive side to indicate an alternative value. Indirection is also > > supported through the same protocol. It also allows the active side, > > following the REJ, to use an alternate value, other than the one > > suggested by the passive side, i.e. passive side only has a veto > > capability." > > I think problems that need resolution are: > 1. Some of places in spec require the connection to be terminated after REJ. > It does not seem to describe what Dror says anywhere. > > For example, see SDP spec: > > A4.5.1.2 ABORTING CONNECTION SETUP > CA4-43: When a CM REJ MAD is received by either the connecting or accepting > peer the connection setup shall be aborted. > If a CM REJ MAD is sent for an SDP-specific error, the reject reason code > value shall be 28 (Consumer Reject -- 12.6.7.2 Rejection Reason on page 665. > An SDP implementation is expected to cleanup any resources associated with > an aborted connection. Yes, that was what I saw too. It was unclear to me what the ramifications of aborting the connection are. It didn't say it couldn't be retried. Also, doesn't it leave open what would be done with Rejection Reason 26 ? > 2. The implementation is still missing: does it belong as part of CM, > or should it be a higher level thing like CMA? Yes, this is after how to deal with any standards issues with the ULPs. > 3. Is there some solution for backward compatibility? > There does not seem to exist a way to figure out whether > sending REJ makes sense since the remote side will retry the connection > with a better MTU value, or not. If consumer reject is the only REJ reason and the format of the ARI gives no clue, I agree (that there is no basis on which the active peer has any idea what to do). > > So the only issue here is the inefficiency in terms of the back and > > forth of CM messages to get to the 1K MTU connection. How important is > > connection rate for SDP and SRP ? If not, can't we live with how things > > are ? > > SDP implements sockets, and thats a very wide field, so everything is important. > AFAIK, connection rate is very important for some socket applications, and does > not matter at all for others. It sounds like better CM handling is important for connection rate. -- Hal From mst at mellanox.co.il Tue Feb 21 13:56:12 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 21 Feb 2006 23:56:12 +0200 Subject: [openib-general] Re: Re: Re: [PATCH] change Mellanox SDP workaround toa moduleparameter In-Reply-To: <1140558211.28051.8131.camel@hal.voltaire.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3FEDFE1@mtlexch01.mtl.com> <1140189358.4333.44536.camel@hal.voltaire.com> <20060219131242.GE22037@mellanox.co.il> <1140547502.28051.7015.camel@hal.voltaire.com> <20060221210956.GC21543@mellanox.co.il> <1140556533.28051.7947.camel@hal.voltaire.com> <20060221213610.GD21543@mellanox.co.il> <1140558211.28051.8131.camel@hal.voltaire.com> Message-ID: <20060221215612.GE21543@mellanox.co.il> Quoting r. Hal Rosenstock : > > > this is what Dror wrote: > > > "The SWG defined a generic mechanism which uses REJ to indicate that > > > the passive side does not accept a certain REQ fields, and allows the > > > passive side to indicate an alternative value. Indirection is also > > > supported through the same protocol. It also allows the active side, > > > following the REJ, to use an alternate value, other than the one > > > suggested by the passive side, i.e. passive side only has a veto > > > capability." > > > > I think problems that need resolution are: > > 1. Some of places in spec require the connection to be terminated after REJ. > > It does not seem to describe what Dror says anywhere. > > > > For example, see SDP spec: > > > > A4.5.1.2 ABORTING CONNECTION SETUP > > CA4-43: When a CM REJ MAD is received by either the connecting or accepting > > peer the connection setup shall be aborted. > > If a CM REJ MAD is sent for an SDP-specific error, the reject reason code > > value shall be 28 (Consumer Reject -- 12.6.7.2 Rejection Reason on page 665. > > An SDP implementation is expected to cleanup any resources associated with > > an aborted connection. > > Yes, that was what I saw too. It was unclear to me what the > ramifications of aborting the connection are. It didn't say it couldn't > be retried. Also, doesn't it leave open what would be done with > Rejection Reason 26 ? Its a bit of a stretch to interpret "abort" as "retry". Still, if we expect the peer to retry this must be in the spec, right? > > 2. The implementation is still missing: does it belong as part of CM, > > or should it be a higher level thing like CMA? > > Yes, this is after how to deal with any standards issues with the ULPs. > > > 3. Is there some solution for backward compatibility? > > There does not seem to exist a way to figure out whether > > sending REJ makes sense since the remote side will retry the connection > > with a better MTU value, or not. > > If consumer reject is the only REJ reason and the format of the ARI > gives no clue, I agree (that there is no basis on which the active peer > has any idea what to do). Right. > > > So the only issue here is the inefficiency in terms of the back and > > > forth of CM messages to get to the 1K MTU connection. How important is > > > connection rate for SDP and SRP ? If not, can't we live with how things > > > are ? > > > > SDP implements sockets, and thats a very wide field, so everything is important. > > AFAIK, connection rate is very important for some socket applications, and does > > not matter at all for others. > > It sounds like better CM handling is important for connection rate. Right. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mshefty at ichips.intel.com Tue Feb 21 14:02:21 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 21 Feb 2006 14:02:21 -0800 Subject: [openib-general] Re: Re: Re: [PATCH] change Mellanox SDP workaround toa moduleparameter In-Reply-To: <20060221213610.GD21543@mellanox.co.il> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3FEDFE1@mtlexch01.mtl.com> <1140189358.4333.44536.camel@hal.voltaire.com> <20060219131242.GE22037@mellanox.co.il> <1140547502.28051.7015.camel@hal.voltaire.com> <20060221210956.GC21543@mellanox.co.il> <1140556533.28051.7947.camel@hal.voltaire.com> <20060221213610.GD21543@mellanox.co.il> Message-ID: <43FB8DED.1010804@ichips.intel.com> Michael S. Tsirkin wrote: > A4.5.1.2 ABORTING CONNECTION SETUP > CA4-43: When a CM REJ MAD is received by either the connecting or accepting > peer the connection setup shall be aborted. > If a CM REJ MAD is sent for an SDP-specific error, the reject reason code > value shall be 28 (Consumer Reject -- 12.6.7.2 Rejection Reason on page 665. > An SDP implementation is expected to cleanup any resources associated with > an aborted connection. > > 2. The implementation is still missing: does it belong as part of CM, > or should it be a higher level thing like CMA? Can you clarify what implementation is missing? Both the CM and CMA pass REJ messages to the ULP. It is up to the ULP to determine what action to take, either by retrying the request or aborting the connection completely. Are you saying that SDP doesn't cleanup after receiving a REJ? - Sean From mshefty at ichips.intel.com Tue Feb 21 14:04:52 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 21 Feb 2006 14:04:52 -0800 Subject: [openib-general] Re: Re: Re: [PATCH] change Mellanox SDP workaround toa moduleparameter In-Reply-To: <20060221215612.GE21543@mellanox.co.il> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3FEDFE1@mtlexch01.mtl.com> <1140189358.4333.44536.camel@hal.voltaire.com> <20060219131242.GE22037@mellanox.co.il> <1140547502.28051.7015.camel@hal.voltaire.com> <20060221210956.GC21543@mellanox.co.il> <1140556533.28051.7947.camel@hal.voltaire.com> <20060221213610.GD21543@mellanox.co.il> <1140558211.28051.8131.camel@hal.voltaire.com> <20060221215612.GE21543@mellanox.co.il> Message-ID: <43FB8E84.1000104@ichips.intel.com> Michael S. Tsirkin wrote: >>>>So the only issue here is the inefficiency in terms of the back and >>>>forth of CM messages to get to the 1K MTU connection. How important is >>>>connection rate for SDP and SRP ? If not, can't we live with how things >>>>are ? >>> >>>SDP implements sockets, and thats a very wide field, so everything is important. >>>AFAIK, connection rate is very important for some socket applications, and does >>>not matter at all for others. >> >>It sounds like better CM handling is important for connection rate. > > Right. I didn't follow what changes are necessary to improve the connection rate. Can you clarify? - Sean From mst at mellanox.co.il Tue Feb 21 14:21:43 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 22 Feb 2006 00:21:43 +0200 Subject: [openib-general] Re: change Mellanox SDP workaround to a module parameter In-Reply-To: <43FB8DED.1010804@ichips.intel.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3FEDFE1@mtlexch01.mtl.com> <1140189358.4333.44536.camel@hal.voltaire.com> <20060219131242.GE22037@mellanox.co.il> <1140547502.28051.7015.camel@hal.voltaire.com> <20060221210956.GC21543@mellanox.co.il> <1140556533.28051.7947.camel@hal.voltaire.com> <20060221213610.GD21543@mellanox.co.il> <43FB8DED.1010804@ichips.intel.com> Message-ID: <20060221222143.GF21543@mellanox.co.il> Quoting Sean Hefty : > Can you clarify what implementation is missing? Here's an executive summary of the thread, so far (unfortunately at some point In-Reply-To got stripped from some messages and so the thread is broken on reflector, but you can search for Mellanox SDP workaround). We are talking about a situation where a passive side wants to downgrade the connecton MTU from the value provided by the active side. This is important for Tavor which supports 2K MTU, but performs in RC much better with 1K MTU. The only way to do this in the existing CM is by sending REJ with code 26, this REJ packet includes the MTU that can be supported. The active side is supposed to check this and retry the connection with the MTU value returned. Note CMA will have to do it too at least when CMA manages the QP states itself. Of course extra roundtrips and retries affect connection rate significantly. I wish we had an MTU field in the REP packet, but we dont. HTH, -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From halr at voltaire.com Tue Feb 21 14:18:00 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Feb 2006 17:18:00 -0500 Subject: [openib-general] Re: Re: Re: [PATCH] change Mellanox SDP workaround toa moduleparameter In-Reply-To: <43FB8E84.1000104@ichips.intel.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3FEDFE1@mtlexch01.mtl.com> <1140189358.4333.44536.camel@hal.voltaire.com> <20060219131242.GE22037@mellanox.co.il> <1140547502.28051.7015.camel@hal.voltaire.com> <20060221210956.GC21543@mellanox.co.il> <1140556533.28051.7947.camel@hal.voltaire.com> <20060221213610.GD21543@mellanox.co.il> <1140558211.28051.8131.camel@hal.voltaire.com> <20060221215612.GE21543@mellanox.co.il> <43FB8E84.1000104@ichips.intel.com> Message-ID: <1140560278.28051.8345.camel@hal.voltaire.com> On Tue, 2006-02-21 at 17:04, Sean Hefty wrote: > Michael S. Tsirkin wrote: > >>>>So the only issue here is the inefficiency in terms of the back and > >>>>forth of CM messages to get to the 1K MTU connection. How important is > >>>>connection rate for SDP and SRP ? If not, can't we live with how things > >>>>are ? > >>> > >>>SDP implements sockets, and thats a very wide field, so everything is important. > >>>AFAIK, connection rate is very important for some socket applications, and does > >>>not matter at all for others. > >> > >>It sounds like better CM handling is important for connection rate. > > > > Right. > > I didn't follow what changes are necessary to improve the connection rate. Can > you clarify? Changes in the CM protocol so that a 3 way handshake can occur so that for example a lower MTU connection can be accepted with one REQ/REP/RTU sequence without REQ/REJ then REQ/REP/RTU. -- Hal > > - Sean From mst at mellanox.co.il Tue Feb 21 14:23:41 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 22 Feb 2006 00:23:41 +0200 Subject: [openib-general] Re: Re: Re: [PATCH] change Mellanox SDP workaround toa moduleparameter In-Reply-To: <43FB8E84.1000104@ichips.intel.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3FEDFE1@mtlexch01.mtl.com> <1140189358.4333.44536.camel@hal.voltaire.com> <20060219131242.GE22037@mellanox.co.il> <1140547502.28051.7015.camel@hal.voltaire.com> <20060221210956.GC21543@mellanox.co.il> <1140556533.28051.7947.camel@hal.voltaire.com> <20060221213610.GD21543@mellanox.co.il> <1140558211.28051.8131.camel@hal.voltaire.com> <20060221215612.GE21543@mellanox.co.il> <43FB8E84.1000104@ichips.intel.com> Message-ID: <20060221222341.GG21543@mellanox.co.il> Quoting r. Sean Hefty : > I didn't follow what changes are necessary to improve the connection rate. > Can you clarify? We were talking about extending the CM REP packet to carry additional information. See my other message for detail. It doesnt look like anyone will be bringing this up on the MgtWG anytime soon, though. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mshefty at ichips.intel.com Tue Feb 21 14:30:46 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 21 Feb 2006 14:30:46 -0800 Subject: [openib-general] Re: Re: Re: [PATCH] change Mellanox SDP workaround toa moduleparameter In-Reply-To: <1140560278.28051.8345.camel@hal.voltaire.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3FEDFE1@mtlexch01.mtl.com> <1140189358.4333.44536.camel@hal.voltaire.com> <20060219131242.GE22037@mellanox.co.il> <1140547502.28051.7015.camel@hal.voltaire.com> <20060221210956.GC21543@mellanox.co.il> <1140556533.28051.7947.camel@hal.voltaire.com> <20060221213610.GD21543@mellanox.co.il> <1140558211.28051.8131.camel@hal.voltaire.com> <20060221215612.GE21543@mellanox.co.il> <43FB8E84.1000104@ichips.intel.com> <1140560278.28051.8345.camel@hal.voltaire.com> Message-ID: <43FB9496.1050505@ichips.intel.com> Hal Rosenstock wrote: >>>>>>So the only issue here is the inefficiency in terms of the back and >>>>>>forth of CM messages to get to the 1K MTU connection. How important is >>>>>>connection rate for SDP and SRP ? If not, can't we live with how things >>>>>>are ? >>>>> >>>>>SDP implements sockets, and thats a very wide field, so everything is important. >>>>>AFAIK, connection rate is very important for some socket applications, and does >>>>>not matter at all for others. >>>> >>>>It sounds like better CM handling is important for connection rate. >>> >>>Right. >> >>I didn't follow what changes are necessary to improve the connection rate. Can >>you clarify? > > > Changes in the CM protocol so that a 3 way handshake can occur so that > for example a lower MTU connection can be accepted with one REQ/REP/RTU > sequence without REQ/REJ then REQ/REP/RTU. Okay - this is what I thought. Personally, I don't think that it makes sense to change the architecture to optimize for hardware that happens to perform poorly under a given condition. What would happen if the hardware simply reported that it only supported a smaller MTU? - Sean From mst at mellanox.co.il Tue Feb 21 14:35:17 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 22 Feb 2006 00:35:17 +0200 Subject: [openib-general] Re: Re: Re: [PATCH] change Mellanox SDP workaround toa moduleparameter In-Reply-To: <43FB9496.1050505@ichips.intel.com> References: <20060219131242.GE22037@mellanox.co.il> <1140547502.28051.7015.camel@hal.voltaire.com> <20060221210956.GC21543@mellanox.co.il> <1140556533.28051.7947.camel@hal.voltaire.com> <20060221213610.GD21543@mellanox.co.il> <1140558211.28051.8131.camel@hal.voltaire.com> <20060221215612.GE21543@mellanox.co.il> <43FB8E84.1000104@ichips.intel.com> <1140560278.28051.8345.camel@hal.voltaire.com> <43FB9496.1050505@ichips.intel.com> Message-ID: <20060221223517.GH21543@mellanox.co.il> Quoting r. Sean Hefty : > >>I didn't follow what changes are necessary to improve the connection > >>rate. Can you clarify? > > > > > >Changes in the CM protocol so that a 3 way handshake can occur so that > >for example a lower MTU connection can be accepted with one REQ/REP/RTU > >sequence without REQ/REJ then REQ/REP/RTU. > > Okay - this is what I thought. Personally, I don't think that it makes > sense to change the architecture to optimize for hardware that happens to > perform poorly under a given condition. Okay, but then we need to implement the REJ/retry sequence. I think it affects at least CMA and maybe CM. > What would happen if the hardware simply reported that it only supported a > smaller MTU? To SM? IPoIB wont work - in requires 2K MTU. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From halr at voltaire.com Tue Feb 21 14:32:09 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Feb 2006 17:32:09 -0500 Subject: [openib-general] Re: Re: Re: [PATCH] change Mellanox SDP workaround toa moduleparameter In-Reply-To: <20060221222341.GG21543@mellanox.co.il> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3FEDFE1@mtlexch01.mtl.com> <1140189358.4333.44536.camel@hal.voltaire.com> <20060219131242.GE22037@mellanox.co.il> <1140547502.28051.7015.camel@hal.voltaire.com> <20060221210956.GC21543@mellanox.co.il> <1140556533.28051.7947.camel@hal.voltaire.com> <20060221213610.GD21543@mellanox.co.il> <1140558211.28051.8131.camel@hal.voltaire.com> <20060221215612.GE21543@mellanox.co.il> <43FB8E84.1000104@ichips.intel.com> <20060221222341.GG21543@mellanox.co.il> Message-ID: <1140560727.28051.8412.camel@hal.voltaire.com> On Tue, 2006-02-21 at 17:23, Michael S. Tsirkin wrote: > Quoting r. Sean Hefty : > > I didn't follow what changes are necessary to improve the connection rate. > > Can you clarify? > > We were talking about extending the CM REP packet to carry additional > information. See my other message for detail. It doesnt look like anyone > will be bringing this up on the MgtWG anytime soon, though. It's the SWG which spec'd the CM not the MgtWG. -- Hal From mshefty at ichips.intel.com Tue Feb 21 14:45:15 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 21 Feb 2006 14:45:15 -0800 Subject: [openib-general] Re: Re: Re: [PATCH] change Mellanox SDP workaround toa moduleparameter In-Reply-To: <20060221223517.GH21543@mellanox.co.il> References: <20060219131242.GE22037@mellanox.co.il> <1140547502.28051.7015.camel@hal.voltaire.com> <20060221210956.GC21543@mellanox.co.il> <1140556533.28051.7947.camel@hal.voltaire.com> <20060221213610.GD21543@mellanox.co.il> <1140558211.28051.8131.camel@hal.voltaire.com> <20060221215612.GE21543@mellanox.co.il> <43FB8E84.1000104@ichips.intel.com> <1140560278.28051.8345.camel@hal.voltaire.com> <43FB9496.1050505@ichips.intel.com> <20060221223517.GH21543@mellanox.co.il> Message-ID: <43FB97FB.5090109@ichips.intel.com> Michael S. Tsirkin wrote: > Okay, but then we need to implement the REJ/retry sequence. > I think it affects at least CMA and maybe CM. I believe that this sort of retry need to be initiated by the ULP. The CMA needs some additional work to permit re-using an rdma_cm_id after a connection has been rejected, so that the user or CMA can modify the MTU before trying to re-establish a connection. - Sean From rdreier at cisco.com Tue Feb 21 14:57:41 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 21 Feb 2006 14:57:41 -0800 Subject: [openib-general] Re: ibv_create_srq doesn't update the SRQ init attributes In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3016C50DB@mtlexch01.mtl.com> (Dotan Barak's message of "Tue, 21 Feb 2006 08:19:51 +0200") References: <6AB138A2AB8C8E4A98B9C0C3D52670E3016C50DB@mtlexch01.mtl.com> Message-ID: Dotan> In the user level the response structure doesn't allow the Dotan> return of the SRQ attributes from kernel level to user Dotan> level, so I think that the ABI should be changed to support Dotan> it. Yes, I think that's the right way to do it. - R. From mst at mellanox.co.il Tue Feb 21 15:11:46 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 22 Feb 2006 01:11:46 +0200 Subject: [openib-general] Re: Re: Re: [PATCH] change Mellanox SDP workaround toa moduleparameter In-Reply-To: <43FB97FB.5090109@ichips.intel.com> References: <20060221210956.GC21543@mellanox.co.il> <1140556533.28051.7947.camel@hal.voltaire.com> <20060221213610.GD21543@mellanox.co.il> <1140558211.28051.8131.camel@hal.voltaire.com> <20060221215612.GE21543@mellanox.co.il> <43FB8E84.1000104@ichips.intel.com> <1140560278.28051.8345.camel@hal.voltaire.com> <43FB9496.1050505@ichips.intel.com> <20060221223517.GH21543@mellanox.co.il> <43FB97FB.5090109@ichips.intel.com> Message-ID: <20060221231146.GA23176@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [openib-general] Re: Re: Re: [PATCH] change Mellanox SDP?workaround toa moduleparameter > > Michael S. Tsirkin wrote: > >Okay, but then we need to implement the REJ/retry sequence. > >I think it affects at least CMA and maybe CM. > > I believe that this sort of retry need to be initiated by the ULP. Okay, but we still need the CM to generate the approriate REJ with code 26, right? > The CMA needs some additional work to permit re-using an rdma_cm_id after a > connection has been rejected, so that the user or CMA can modify the MTU > before trying to re-establish a connection. Given that the issue only affects some hardware, it would be nice to somehow hide this from ULPs, otherwise it seems unlikely that they will get it right. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From sashak at voltaire.com Tue Feb 21 15:33:00 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 22 Feb 2006 01:33:00 +0200 Subject: [openib-general] [PATCH] opensm: fixes in signal handling Message-ID: <20060221233300.GE9041@sashak.voltaire.com> Hello, There are fixes for broken signal handling stuff. Finally this should prevent some opensm crashes (actually deadlocks), for instance caused by such command: while [ 1 ] ; do kill -HUP ; done Yael, I hope this patch does not broke windows compilations (this masks signal related functions under __WIN__), but cannot be absolutely sure - please look at this from "under windows" point of view. Thanks. Sasha. This fixes broken signal handling. In this patch: - signal handling stuff is moved to main.c - cl_sig_* is replaced by more powerfull posix (I hope it should not be bad for win because this is under !__WIN__ anyway) - signal handler does not call resweeper or wakeup directly, but only update new osm_hup_flag (or osm_exit_flag on SIGINT or SIGTERM) - signal delivery are masked in all threads expept first one, so only expected thread will be interrupted (from sleep() or poll()) - resweep thread will be wakeuped from main.c thread instead direct - poll was added to osm_console - this provides timeout ability and workarouds getline()'s signal interruption problem. diff --git a/osm/include/opensm/osm_console.h b/osm/include/opensm/osm_console.h index 5a4036f..c5cd22a 100644 --- a/osm/include/opensm/osm_console.h +++ b/osm/include/opensm/osm_console.h @@ -50,6 +50,7 @@ BEGIN_C_DECLS void osm_console(osm_opensm_t *p_osm); +void osm_console_prompt(void); END_C_DECLS diff --git a/osm/include/opensm/osm_opensm.h b/osm/include/opensm/osm_opensm.h index 833c4c3..3235ad4 100644 --- a/osm/include/opensm/osm_opensm.h +++ b/osm/include/opensm/osm_opensm.h @@ -388,38 +388,12 @@ osm_opensm_wait_for_subnet_up( /****v* OpenSM/osm_exit_flag */ -extern volatile int osm_exit_flag; +extern volatile unsigned int osm_exit_flag; /* * DESCRIPTION * Set to one to cause all threads to leave *********/ -#ifndef __WIN__ -/****f* OpenSM: OpenSM/osm_reg_sig_handler -* NAME -* osm_reg_sig_handler -* -* DESCRIPTION -* Registers the common signal handler -* -* SYNOPSIS -*/ -void osm_reg_sig_handler( -IN osm_opensm_t* const p_osm); -/* -* PARAMETERS -* p_osm -* [in] Pointer to a OpenSM object to handle signals on. -* -* RETURN VALUES -* None -* -* NOTES -* -* SEE ALSO -*********/ -#endif /* __WIN__ */ - END_C_DECLS #endif /* _OSM_OPENSM_H_ */ diff --git a/osm/opensm/main.c b/osm/opensm/main.c index c5ba443..797b14c 100644 --- a/osm/opensm/main.c +++ b/osm/opensm/main.c @@ -77,14 +77,59 @@ instantiating more than one opensm object. */ osm_opensm_t osm; -volatile int osm_exit_flag = 0; + +volatile unsigned int osm_exit_flag = 0; + +static volatile unsigned int osm_hup_flag = 0; #define GUID_ARRAY_SIZE 64 #define INVALID_GUID (0xFFFFFFFFFFFFFFFFULL) + +#ifdef __WIN__ +#define block_signals() +#define setup_signals() +#else + +static void mark_exit_flag(int signum) +{ + if(!osm_exit_flag) + printf("OpenSM: Got signal %d - exiting...\n", signum); + osm_exit_flag = 1; +} + +static void mark_hup_flag(int signum) +{ + osm_hup_flag = 1; +} + +static sigset_t saved_sigset; + +static void block_signals() +{ + sigset_t set; + sigfillset(&set); + sigprocmask(SIG_SETMASK, &set, &saved_sigset); +} + +static void setup_signals() +{ + struct sigaction act; + sigfillset(&act.sa_mask); + act.sa_handler = mark_exit_flag; + act.sa_flags = 0; +#ifndef OSM_VENDOR_INTF_OPENIB + sigaction(SIGINT, &act, NULL); +#endif + sigaction(SIGTERM, &act, NULL); + act.sa_handler = mark_hup_flag; + sigaction(SIGHUP, &act, NULL); + sigprocmask(SIG_SETMASK, &saved_sigset, NULL); +} +#endif /* __WIN__ */ + /********************************************************************** **********************************************************************/ -void show_usage(void); void show_usage(void) @@ -247,7 +292,6 @@ show_usage(void) /********************************************************************** **********************************************************************/ -void show_menu(void); void show_menu(void) @@ -764,6 +808,8 @@ main( if ( cache_options == TRUE ) osm_subn_write_conf_file( &opt ); + block_signals(); + status = osm_opensm_init( &osm, &opt ); if( status != IB_SUCCESS ) { @@ -794,9 +840,6 @@ main( goto Exit; } - /* this should handle ^C etc */ - osm_reg_sig_handler( &osm ); - status = osm_opensm_bind( &osm, guid ); if( status != IB_SUCCESS ) { @@ -817,6 +860,8 @@ main( } } + setup_signals(); + osm_opensm_sweep( &osm ); /* since osm_opensm_init get opt as RO we'll set the opt value with UI pfn here */ /* Now do the registration */ @@ -839,11 +884,23 @@ main( In the future, some sort of console interactivity could be implemented in this loop. */ - while( !osm_exit_flag ) + if (opt.console) { + printf("\nOpenSM Console\n\n"); + osm_console_prompt(); + } + while( !osm_exit_flag ) { if (opt.console) osm_console(&osm); else cl_thread_suspend( 10000 ); + + if (osm_hup_flag) { + osm_hup_flag = 0; + /* a HUP signal should only start a new heavy sweep */ + osm.subn.force_immediate_heavy_sweep = TRUE; + cl_event_signal(&osm.sm.signal); + } + } } #if 0 diff --git a/osm/opensm/osm_console.c b/osm/opensm/osm_console.c index c470b49..43d9f87 100644 --- a/osm/opensm/osm_console.c +++ b/osm/opensm/osm_console.c @@ -39,6 +39,7 @@ #define _GNU_SOURCE /* for getline */ #include #include +#include #include #define OSM_COMMAND_LINE_LEN 120 @@ -186,15 +187,27 @@ static void parse_cmd_line(char *line, o } } +void osm_console_prompt(void) +{ + printf("%s", OSM_COMMAND_PROMPT); + fflush(stdout); +} + void osm_console(osm_opensm_t *p_osm) { + struct pollfd pollfd; char *p_line; size_t len; ssize_t n; - printf("\nOpenSM Console\n\n"); - while (1) { - printf("%s", OSM_COMMAND_PROMPT); + pollfd.fd = 0; + pollfd.events = POLLIN; + pollfd.revents = 0; + + if (poll(&pollfd, 1, 10000) <= 0) + return; + + if (pollfd.revents|POLLIN) { p_line = NULL; /* Get input line */ n = getline(&p_line, &len, stdin); @@ -206,6 +219,7 @@ void osm_console(osm_opensm_t *p_osm) printf("Input error\n"); fflush(stdin); } + osm_console_prompt(); } } diff --git a/osm/opensm/osm_opensm.c b/osm/opensm/osm_opensm.c index 6ca6796..54d0ae3 100644 --- a/osm/opensm/osm_opensm.c +++ b/osm/opensm/osm_opensm.c @@ -54,7 +54,6 @@ #include #include #include -#include #include #include #include @@ -149,52 +148,6 @@ osm_opensm_create_mcgroups( } /********************************************************************** - * SHUT DOWN IS CONTROLLED BY A GLOBAL EXIT FLAG - **********************************************************************/ -#ifndef __WIN__ -static osm_opensm_t *__p_osm_to_signal; - -void -__sig_handler( - int signum ) -{ - static int got_signal = 0; - - if( signum != SIGHUP ) - { - if( !got_signal ) - { - got_signal++; - printf( "OpenSM: Got signal %d - exiting...\n", signum ); - osm_exit_flag = 1; - } - } - else - { - /* a HUP signal should only start a new heavy sweep */ - __p_osm_to_signal->subn.force_immediate_heavy_sweep = TRUE; - osm_state_mgr_process( &__p_osm_to_signal->sm.state_mgr, - OSM_SIGNAL_SWEEP ); - } -} - -void -osm_reg_sig_handler( - IN osm_opensm_t * const p_osm ) -{ - __p_osm_to_signal = p_osm; -#ifndef OSM_VENDOR_INTF_OPENIB - cl_reg_sig_hdl( SIGINT, __sig_handler ); -#endif - cl_reg_sig_hdl( SIGTERM, __sig_handler ); - cl_reg_sig_hdl( SIGHUP, __sig_handler ); - osm_exit_flag = 0; - - return; -} -#endif /* __WIN__ */ - -/********************************************************************** **********************************************************************/ ib_api_status_t osm_opensm_init( diff --git a/osm/opensm/osm_sm.c b/osm/opensm/osm_sm.c index 8ace290..e252861 100644 --- a/osm/opensm/osm_sm.c +++ b/osm/opensm/osm_sm.c @@ -87,10 +87,6 @@ __osm_sm_sweeper( if( p_sm->thread_state == OSM_THREAD_STATE_INIT ) { - osm_log( p_sm->p_log, OSM_LOG_DEBUG, - "__osm_sm_sweeper: " "Masking ^C Signals\n" ); - cl_sig_mask_sigint( ); - p_sm->thread_state = OSM_THREAD_STATE_RUN; } diff --git a/osm/opensm/osm_vl15intf.c b/osm/opensm/osm_vl15intf.c index ef18e54..4796a17 100644 --- a/osm/opensm/osm_vl15intf.c +++ b/osm/opensm/osm_vl15intf.c @@ -85,7 +85,6 @@ __osm_vl15_poller( if ( p_vl->thread_state == OSM_THREAD_STATE_NONE) { - cl_sig_mask_sigint( ); p_vl->thread_state = OSM_THREAD_STATE_RUN; } From ftillier at silverstorm.com Tue Feb 21 15:38:09 2006 From: ftillier at silverstorm.com (Fabian Tillier) Date: Tue, 21 Feb 2006 15:38:09 -0800 Subject: [openib-general] [PATCH] opensm: fixes in signal handling In-Reply-To: <20060221233300.GE9041@sashak.voltaire.com> References: <20060221233300.GE9041@sashak.voltaire.com> Message-ID: <79ae2f320602211538s64e5dec1jc60a782c6efbd1bb@mail.gmail.com> On 2/21/06, Sasha Khapyorsky wrote: > Hello, > > There are fixes for broken signal handling stuff. Finally this should > prevent some opensm crashes (actually deadlocks), for instance caused > by such command: > > while [ 1 ] ; do kill -HUP ; done > > > Yael, I hope this patch does not broke windows compilations (this masks > signal related functions under __WIN__), but cannot be absolutely sure - > please look at this from "under windows" point of view. Thanks. Haven't we discussed eliminating the signal handling from OpenSM before, since it was put in place to work around access layer bugs? Is there any reason to keep signal handling in OpenSM? - Fab From mshefty at ichips.intel.com Tue Feb 21 15:47:49 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 21 Feb 2006 15:47:49 -0800 Subject: [openib-general] Re: Re: Re: [PATCH] change Mellanox SDP workaround toa moduleparameter In-Reply-To: <20060221231146.GA23176@mellanox.co.il> References: <20060221210956.GC21543@mellanox.co.il> <1140556533.28051.7947.camel@hal.voltaire.com> <20060221213610.GD21543@mellanox.co.il> <1140558211.28051.8131.camel@hal.voltaire.com> <20060221215612.GE21543@mellanox.co.il> <43FB8E84.1000104@ichips.intel.com> <1140560278.28051.8345.camel@hal.voltaire.com> <43FB9496.1050505@ichips.intel.com> <20060221223517.GH21543@mellanox.co.il> <43FB97FB.5090109@ichips.intel.com> <20060221231146.GA23176@mellanox.co.il> Message-ID: <43FBA6A5.9060508@ichips.intel.com> Michael S. Tsirkin wrote: >>>Okay, but then we need to implement the REJ/retry sequence. >>>I think it affects at least CMA and maybe CM. >> >>I believe that this sort of retry need to be initiated by the ULP. > > Okay, but we still need the CM to generate the approriate REJ with code 26, > right? > >>The CMA needs some additional work to permit re-using an rdma_cm_id after a >>connection has been rejected, so that the user or CMA can modify the MTU >>before trying to re-establish a connection. > > Given that the issue only affects some hardware, it would be nice to somehow > hide this from ULPs, otherwise it seems unlikely that they will get it right. Are you wanting _all_ connections to this hardware to change the MTU? I can see how this would be useful. I was assuming that this was an SDP only related issue. I'm not sure where we want this sort of policy. I'm reluctant to mask this sort of connection change completely in either the IB CM or CMA. We may still be able to locate the implementation there, but there should be someway for the user to override the settings. Since this is a hardware specific problem, can the driver deal with this? All received MADs are given to the driver before being processed. Can the driver intercept the REQ, consume it, and issue a REJ? This might be able to deal with the problem on the receive side. The sender may also want to adjust the MTU based on local hardware, but I'm not sure where's the best place to handle this. - Sean From sashak at voltaire.com Tue Feb 21 15:59:59 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 22 Feb 2006 01:59:59 +0200 Subject: [openib-general] [PATCH] opensm: fixes in signal handling In-Reply-To: <79ae2f320602211538s64e5dec1jc60a782c6efbd1bb@mail.gmail.com> References: <20060221233300.GE9041@sashak.voltaire.com> <79ae2f320602211538s64e5dec1jc60a782c6efbd1bb@mail.gmail.com> Message-ID: <20060221235959.GG9041@sashak.voltaire.com> On 15:38 Tue 21 Feb , Fabian Tillier wrote: > > Is there any reason to keep signal handling in OpenSM? Yes, SIGHUP is used to initiate resweep process, SIGTERM and SIGINT for graceful exit (some days we may want to run opensm as daemon, I guess). Sasha. From ftillier at silverstorm.com Tue Feb 21 16:00:33 2006 From: ftillier at silverstorm.com (Fabian Tillier) Date: Tue, 21 Feb 2006 16:00:33 -0800 Subject: [openib-general] [PATCH] opensm: fixes in signal handling In-Reply-To: <20060221235959.GG9041@sashak.voltaire.com> References: <20060221233300.GE9041@sashak.voltaire.com> <79ae2f320602211538s64e5dec1jc60a782c6efbd1bb@mail.gmail.com> <20060221235959.GG9041@sashak.voltaire.com> Message-ID: <79ae2f320602211600m1c713185w5166a1d44d6a363f@mail.gmail.com> On 2/21/06, Sasha Khapyorsky wrote: > On 15:38 Tue 21 Feb , Fabian Tillier wrote: > > > > Is there any reason to keep signal handling in OpenSM? > > Yes, SIGHUP is used to initiate resweep process, SIGTERM and SIGINT for > graceful exit (some days we may want to run opensm as daemon, I guess). Cool, thanks for the clarification. - Fab From mst at mellanox.co.il Tue Feb 21 16:13:25 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 22 Feb 2006 02:13:25 +0200 Subject: [openib-general] Re: Re: Re: Re: [PATCH] change Mellanox SDP workaround toa moduleparameter In-Reply-To: <43FBA6A5.9060508@ichips.intel.com> References: <20060221213610.GD21543@mellanox.co.il> <1140558211.28051.8131.camel@hal.voltaire.com> <20060221215612.GE21543@mellanox.co.il> <43FB8E84.1000104@ichips.intel.com> <1140560278.28051.8345.camel@hal.voltaire.com> <43FB9496.1050505@ichips.intel.com> <20060221223517.GH21543@mellanox.co.il> <43FB97FB.5090109@ichips.intel.com> <20060221231146.GA23176@mellanox.co.il> <43FBA6A5.9060508@ichips.intel.com> Message-ID: <20060222001325.GB23176@mellanox.co.il> Quoting r. Sean Hefty : > > >The CMA needs some additional work to permit re-using an rdma_cm_id after > > >a connection has been rejected, so that the user or CMA can modify the > > >MTU before trying to re-establish a connection. > > > >Given that the issue only affects some hardware, it would be nice to somehow > >hide this from ULPs, otherwise it seems unlikely that they will get it right. > > Are you wanting _all_ connections to this hardware to change the MTU? I > can see how this would be useful. I was assuming that this was an SDP only > related issue. Yes. > I'm not sure where we want this sort of policy. I'm reluctant to mask this > sort of connection change completely in either the IB CM or CMA. We may > still be able to locate the implementation there, but there should be > someway for the user to override the settings. OK, although I expect most ULPs wont use it. > Since this is a hardware specific problem, can the driver deal with this? > All received MADs are given to the driver before being processed. Can the > driver intercept the REQ, consume it, and issue a REJ? This might be able > to deal with the problem on the receive side. The sender may also want to > adjust the MTU based on local hardware, but I'm not sure where's the best > place to handle this. Very tricky. Just testing some bit in CM or CMA makes much more sense to me. Lets call it mtu_quirk. We still have the problem on the send side, so I dont think its worth it. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From troy at scl.ameslab.gov Tue Feb 21 16:14:15 2006 From: troy at scl.ameslab.gov (Troy Benjegerdes) Date: Tue, 21 Feb 2006 18:14:15 -0600 Subject: [openib-general] debian package version check issues Message-ID: <20060222001415.GA1357@minbar.scl.ameslab.gov> There's a few bogons in the libmthca version checks.. opteron2:/usr/src/openib-src/userspace/libmthca# dpkg-buildpackage dpkg-buildpackage: source package is libmthca dpkg-buildpackage: source version is 1.0 dpkg-buildpackage: source changed by Roland Dreier dpkg-buildpackage: host architecture amd64 dpkg-checkbuilddeps: Unmet build dependencies: libibverbs-dev (>= 1.0-rc2) dpkg-buildpackage: Build dependencies/conflicts unsatisfied; aborting. dpkg-buildpackage: (Use -d flag to override.) opteron2:/usr/src/openib-src/userspace/libmthca# dpkg -l libibverbs-dev Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Installed/Config-files/Unpacked/Failed-config/Half-installed |/ Err?=(none)/Hold/Reinst-required/X=both-problems (Status,Err: uppercase=bad) ||/ Name Version Description +++-================-================-================================================ ii libibverbs-dev 1.0 Development files for the libibverbs library From troy at scl.ameslab.gov Tue Feb 21 16:17:05 2006 From: troy at scl.ameslab.gov (Troy Benjegerdes) Date: Tue, 21 Feb 2006 18:17:05 -0600 Subject: [openib-general] debian package version check issues In-Reply-To: <20060222001415.GA1357@minbar.scl.ameslab.gov> References: <20060222001415.GA1357@minbar.scl.ameslab.gov> Message-ID: <20060222001705.GB1357@minbar.scl.ameslab.gov> On Tue, Feb 21, 2006 at 06:14:15PM -0600, Troy Benjegerdes wrote: > There's a few bogons in the libmthca version checks.. > And some build problems too, apparently.. I just installed the libibverbs-dev package. cc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall -D_GNU_SOURCE -g -Wall -O2 -MT src_mthca_la-memfree.lo -MD -MP -MF .deps/src_mthca_la-memfree.Tpo -c src/memfree.c -o src_mthca_la-memfree.o >/dev/null 2>&1 if /bin/sh ./libtool --tag=CC --mode=compile cc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall -D_GNU_SOURCE -g -Wall -O2 -MT src_mthca_la-mthca.lo -MD -MP -MF ".deps/src_mthca_la-mthca.Tpo" -c -o src_mthca_la-mthca.lo `test -f 'src/mthca.c' || echo './'`src/mthca.c; \ then mv -f ".deps/src_mthca_la-mthca.Tpo" ".deps/src_mthca_la-mthca.Plo"; else rm -f ".deps/src_mthca_la-mthca.Tpo"; exit 1; fi cc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall -D_GNU_SOURCE -g -Wall -O2 -MT src_mthca_la-mthca.lo -MD -MP -MF .deps/src_mthca_la-mthca.Tpo -c src/mthca.c -fPIC -DPIC -o .libs/src_mthca_la-mthca.o In file included from src/mthca.c:48: src/mthca-abi.h:70: error: field 'ibv_cmd' has incomplete type src/mthca.c:109: error: unknown field 'resize_cq' specified in initializer src/mthca.c:113: error: unknown field 'query_srq' specified in initializer src/mthca.c:113: warning: initialization from incompatible pointer type src/mthca.c:116: error: unknown field 'query_qp' specified in initializer src/mthca.c:116: warning: initialization from incompatible pointer type make[2]: *** [src_mthca_la-mthca.lo] Error 1 make[2]: Leaving directory `/usr/src/openib-src/userspace/libmthca' make[1]: *** [all] Error 2 make[1]: Leaving directory `/usr/src/openib-src/userspace/libmthca' make: *** [debian/stamp-makefile-build] Error 2 opteron2:/usr/src/openib-src/userspace/libmthca# svnversion . 5457 From troy at scl.ameslab.gov Tue Feb 21 16:23:21 2006 From: troy at scl.ameslab.gov (Troy Benjegerdes) Date: Tue, 21 Feb 2006 18:23:21 -0600 Subject: [openib-general] debian package version check issues In-Reply-To: <20060222001415.GA1357@minbar.scl.ameslab.gov> References: <20060222001415.GA1357@minbar.scl.ameslab.gov> Message-ID: <20060222002321.GC1357@minbar.scl.ameslab.gov> Note to self: Make sure all old '/usr/local/include/infiniband' stuff is nuked when you install the debian packages. (Btw, is that an error if configure picks up /usr/local/include/infiniband before /usr/include/infiniband?) On Tue, Feb 21, 2006 at 06:14:15PM -0600, Troy Benjegerdes wrote: > There's a few bogons in the libmthca version checks.. > > opteron2:/usr/src/openib-src/userspace/libmthca# dpkg-buildpackage > dpkg-buildpackage: source package is libmthca > dpkg-buildpackage: source version is 1.0 > dpkg-buildpackage: source changed by Roland Dreier > dpkg-buildpackage: host architecture amd64 > dpkg-checkbuilddeps: Unmet build dependencies: libibverbs-dev (>= > 1.0-rc2) > dpkg-buildpackage: Build dependencies/conflicts unsatisfied; aborting. > dpkg-buildpackage: (Use -d flag to override.) From lindahl at pathscale.com Tue Feb 21 17:17:45 2006 From: lindahl at pathscale.com (Greg Lindahl) Date: Tue, 21 Feb 2006 17:17:45 -0800 Subject: [openib-general] IPoIB broadcast MC group membership In-Reply-To: <1140542552.28051.6480.camel@hal.voltaire.com> References: <79ae2f320602201814s3df76825w50fb759238e91077@mail.gmail.com> <79ae2f320602201853n2982624an1c845dac9776ccef@mail.gmail.com> <79ae2f320602202210s7a668b52if968e1fad058019c@mail.gmail.com> <1140532783.28051.5457.camel@hal.voltaire.com> <79ae2f320602210815o62eff7a5qc00744c4b755c8d7@mail.gmail.com> <1140539024.28051.6108.camel@hal.voltaire.com> <79ae2f320602210908j1256615fl480a6af71dd0ebc5@mail.gmail.com> <1140542552.28051.6480.camel@hal.voltaire.com> Message-ID: <20060222011745.GL2853@greglaptop.internal.keyresearch.com> Is this a correct summary of this thread? * IPoIB uses an InfiniBand multicast group to fake ethernet broadcast * This is optional, I'm not sure what functionality is lost without it * MVAPICH uses a multicast group for some MPI collectives * This can be turned off by setting env var DISABLE_HARDWARE_MCST * An IB multicast group has to use ports of the same speed * This one was a surprise to me Ergo, when you mix 1X, 4X SDR, and 4X DDR hosts, it behaves differently from a homogeneous network. -- greg From rdreier at cisco.com Tue Feb 21 17:21:33 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 21 Feb 2006 17:21:33 -0800 Subject: [openib-general] IPoIB broadcast MC group membership In-Reply-To: <20060222011745.GL2853@greglaptop.internal.keyresearch.com> (Greg Lindahl's message of "Tue, 21 Feb 2006 17:17:45 -0800") References: <79ae2f320602201814s3df76825w50fb759238e91077@mail.gmail.com> <79ae2f320602201853n2982624an1c845dac9776ccef@mail.gmail.com> <79ae2f320602202210s7a668b52if968e1fad058019c@mail.gmail.com> <1140532783.28051.5457.camel@hal.voltaire.com> <79ae2f320602210815o62eff7a5qc00744c4b755c8d7@mail.gmail.com> <1140539024.28051.6108.camel@hal.voltaire.com> <79ae2f320602210908j1256615fl480a6af71dd0ebc5@mail.gmail.com> <1140542552.28051.6480.camel@hal.voltaire.com> <20060222011745.GL2853@greglaptop.internal.keyresearch.com> Message-ID: Greg> Is this a correct summary of this thread? * IPoIB uses an Greg> InfiniBand multicast group to fake ethernet broadcast * This Greg> is optional, I'm not sure what functionality is lost without Greg> it No, IPoIB uses multicast groups to implement IP broadcast and multicast. Without multicast, IPoIB can't do ARP (or IP multicast, etc). So it's an absolute requirement. Greg> * MVAPICH uses a multicast group for some MPI collectives * Greg> This can be turned off by setting env var Greg> DISABLE_HARDWARE_MCST Probably right, don't know for sure. Greg> * An IB multicast group has to use ports of the same speed * Greg> This one was a surprise to me No, but an IB multicast group has a speed associated to it. This is to allow, say, a 4X port sending multicast packets to use the right static rate to avoid overrunning a 1X port that is also a member of the group. - R. From bos at pathscale.com Tue Feb 21 17:44:39 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Tue, 21 Feb 2006 17:44:39 -0800 Subject: [openib-general] Towards a 1.0 release of OpenIB Message-ID: <1140572679.6603.28.camel@camp4.serpentine.com> Here's a strawman proposal for a 1.0 release process. Please let me know what you think. I have a set of absolutely minimal goals for the 1.0 release, and I would like to open up a short period of wider discussion about those goals. Expectation management: * The process is open and transparent. Discussion happens on openib-general. Bugs go into Bugzilla. Documentation lives in the wiki. Changes are made in Subversion. There should be no way someone can step up after the fact and say "but I wasn't informed of the plan!" * The target user population is reasonably savvy early adopters. * For everything that we commit to shipping, we must be able to tell users what has been tested, how heavily, and on what hardware. Testing: * We need to know what tests people can run, and in what environments. * We would like everyone to be able to run the same tests, so someone must gather test suites and execution instructions together. Methods of delivery: * A branch of the Subversion repository. * A set of source tarballs. * A collection of binary packages. We need to identify distros that people are interested in, and distros that people have time and resources to build for. Milestone timeline: * Feb 24 - create 1.0 release branch in Subversion repository * Feb 28 - close of "what I want in the 1.0 release" discussion * Feb 28 - Bugzilla configured properly * Mar 03 - wiki contains actual data about test suites, who's running what, status, etc. * Mar 06 - rc1 snapshot and source tarballs available * Mar 27 - rc2 * Apr 17 - rc3 * May 08 - 1.0 Within the next week, I'd like to gain an understanding of the following things: * Which features users want to see tested * Who can sign up to test and maintain those features, and how * Which distros users want binary packages for * Who can sign up to build and test those packages * Whether we need to be building binary kernel packages to make testing more consistent * Which patches or features need to be pushed to the upstream kernel (I'd prefer an unpatched kernel.org 2.6.17 to just work with the 1.0 userspace, for example) From lindahl at pathscale.com Tue Feb 21 18:28:17 2006 From: lindahl at pathscale.com (Greg Lindahl) Date: Tue, 21 Feb 2006 18:28:17 -0800 Subject: [openib-general] IPoIB broadcast MC group membership In-Reply-To: References: <79ae2f320602201853n2982624an1c845dac9776ccef@mail.gmail.com> <79ae2f320602202210s7a668b52if968e1fad058019c@mail.gmail.com> <1140532783.28051.5457.camel@hal.voltaire.com> <79ae2f320602210815o62eff7a5qc00744c4b755c8d7@mail.gmail.com> <1140539024.28051.6108.camel@hal.voltaire.com> <79ae2f320602210908j1256615fl480a6af71dd0ebc5@mail.gmail.com> <1140542552.28051.6480.camel@hal.voltaire.com> <20060222011745.GL2853@greglaptop.internal.keyresearch.com> Message-ID: <20060222022817.GA5391@greglaptop.internal.keyresearch.com> On Tue, Feb 21, 2006 at 05:21:33PM -0800, Roland Dreier wrote: > No, but an IB multicast group has a speed associated to it. This is > to allow, say, a 4X port sending multicast packets to use the right > static rate to avoid overrunning a 1X port that is also a member of > the group. Thanks for the clarification, Roland, I see that the previous discussion had successfully confused me. So is it the case that the create and the joins to a multicast group have to specify the correct speed? And the problem then would be that a IB host at boot doesn't know the right speed? Am I getting colder or warmer? -- greg From ftillier at silverstorm.com Tue Feb 21 20:17:02 2006 From: ftillier at silverstorm.com (Fabian Tillier) Date: Tue, 21 Feb 2006 20:17:02 -0800 Subject: [openib-general] IPoIB broadcast MC group membership In-Reply-To: <20060222022817.GA5391@greglaptop.internal.keyresearch.com> References: <79ae2f320602201853n2982624an1c845dac9776ccef@mail.gmail.com> <79ae2f320602202210s7a668b52if968e1fad058019c@mail.gmail.com> <1140532783.28051.5457.camel@hal.voltaire.com> <79ae2f320602210815o62eff7a5qc00744c4b755c8d7@mail.gmail.com> <1140539024.28051.6108.camel@hal.voltaire.com> <79ae2f320602210908j1256615fl480a6af71dd0ebc5@mail.gmail.com> <1140542552.28051.6480.camel@hal.voltaire.com> <20060222011745.GL2853@greglaptop.internal.keyresearch.com> <20060222022817.GA5391@greglaptop.internal.keyresearch.com> Message-ID: <79ae2f320602212017i7feafbb0se3d75e196ea734f0@mail.gmail.com> On 2/21/06, Greg Lindahl wrote: > On Tue, Feb 21, 2006 at 05:21:33PM -0800, Roland Dreier wrote: > > > No, but an IB multicast group has a speed associated to it. This is > > to allow, say, a 4X port sending multicast packets to use the right > > static rate to avoid overrunning a 1X port that is also a member of > > the group. > > Thanks for the clarification, Roland, I see that the previous > discussion had successfully confused me. So is it the case that the > create and the joins to a multicast group have to specify the correct > speed? The node joining or creating the multicast group doesn't need to specify the rate - the SA can figure out the rate to use based on the requestor (for creation), or validate that the requestor supports the existing group's rate (for joining). The problem is that OpenSM does not enforce this, allowing a rate mismatch between members of a multicast group. > And the problem then would be that a IB host at boot doesn't > know the right speed? A host can always figure out its rate by looking at the local port attributes - the link width active and link speed active would give it all the information necessary. > Am I getting colder or warmer? Colder. The problem, assuming OpenSM will be fixed to check rates, is for IPoIB to detect that the join failed due to a rate mismatch so that it can log an intelligent message to the system log. Currently, for Linux it will retry forever, albeit at a decaying interval. The algorithm in Windows is a little different, as the driver has been designed to function properly with SMs that don't pre-create the broadcast group, and in this case, the potential races in creation and joining the group coupled with the lack of detailed error status from the SA upon failure make it impossible to differentiate between a true error or just a timing race. Hope that helps! - Fab From lindahl at pathscale.com Tue Feb 21 21:52:54 2006 From: lindahl at pathscale.com (Greg Lindahl) Date: Tue, 21 Feb 2006 21:52:54 -0800 Subject: [openib-general] IPoIB broadcast MC group membership In-Reply-To: <79ae2f320602212017i7feafbb0se3d75e196ea734f0@mail.gmail.com> References: <79ae2f320602202210s7a668b52if968e1fad058019c@mail.gmail.com> <1140532783.28051.5457.camel@hal.voltaire.com> <79ae2f320602210815o62eff7a5qc00744c4b755c8d7@mail.gmail.com> <1140539024.28051.6108.camel@hal.voltaire.com> <79ae2f320602210908j1256615fl480a6af71dd0ebc5@mail.gmail.com> <1140542552.28051.6480.camel@hal.voltaire.com> <20060222011745.GL2853@greglaptop.internal.keyresearch.com> <20060222022817.GA5391@greglaptop.internal.keyresearch.com> <79ae2f320602212017i7feafbb0se3d75e196ea734f0@mail.gmail.com> Message-ID: <20060222055254.GA1554@greglaptop.hsd1.ca.comcast.net> On Tue, Feb 21, 2006 at 08:17:02PM -0800, Fabian Tillier wrote: > The node joining or creating the multicast group doesn't need to > specify the rate - the SA can figure out the rate to use based on the > requestor (for creation), or validate that the requestor supports the > existing group's rate (for joining). Um, but that gets back to my point: I want 1X, 4X SDR, and 4X DDR nodes running IPoIB to share a multicast group. Are you saying this can be done by making the group a 1X group? Or that it's impossible to have such a group? Or that everyone would have to drop to 1X to make such a group? -- greg From ftillier at silverstorm.com Tue Feb 21 23:40:53 2006 From: ftillier at silverstorm.com (Fabian Tillier) Date: Tue, 21 Feb 2006 23:40:53 -0800 Subject: [openib-general] IPoIB broadcast MC group membership In-Reply-To: <20060222055254.GA1554@greglaptop.hsd1.ca.comcast.net> References: <79ae2f320602202210s7a668b52if968e1fad058019c@mail.gmail.com> <79ae2f320602210815o62eff7a5qc00744c4b755c8d7@mail.gmail.com> <1140539024.28051.6108.camel@hal.voltaire.com> <79ae2f320602210908j1256615fl480a6af71dd0ebc5@mail.gmail.com> <1140542552.28051.6480.camel@hal.voltaire.com> <20060222011745.GL2853@greglaptop.internal.keyresearch.com> <20060222022817.GA5391@greglaptop.internal.keyresearch.com> <79ae2f320602212017i7feafbb0se3d75e196ea734f0@mail.gmail.com> <20060222055254.GA1554@greglaptop.hsd1.ca.comcast.net> Message-ID: <79ae2f320602212340j35663f90p39d63ff71de6bb92@mail.gmail.com> On 2/21/06, Greg Lindahl wrote: > On Tue, Feb 21, 2006 at 08:17:02PM -0800, Fabian Tillier wrote: > > > The node joining or creating the multicast group doesn't need to > > specify the rate - the SA can figure out the rate to use based on the > > requestor (for creation), or validate that the requestor supports the > > existing group's rate (for joining). > > Um, but that gets back to my point: I want 1X, 4X SDR, and 4X DDR > nodes running IPoIB to share a multicast group. Are you saying this > can be done by making the group a 1X group? Or that it's impossible > to have such a group? Or that everyone would have to drop to 1X to > make such a group? You'd have to make the group 1X. Note that the group being 1X doesn't limit unicast traffic to 1X rates, since the rate for unicast traffic would be set based on the rate reported in the path records for the various endpoints. So 4X SDR and 4X DDR nodes would have to set their inter-packet delay for the broadcast group to end up with a 1X packet injection rate. - Fab From jackm at mellanox.co.il Tue Feb 21 23:37:56 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Wed, 22 Feb 2006 09:37:56 +0200 Subject: [openib-general] =?iso-8859-1?q?=09RE=3A_=5BPATCH_3_of_3=5D_mad=3A_large_RMPP?= support, Round 2 In-Reply-To: <43FB5571.1060000@ichips.intel.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3016C4BCC@mtlexch01.mtl.com> <43FB5571.1060000@ichips.intel.com> Message-ID: <200602220937.56443.jackm@mellanox.co.il> On Tuesday 21 February 2006 20:01, Sean Hefty wrote: > > I did it this way to avoid using kzalloc for all segment allocations, or > > kmalloc/memset for all. This way, only the last segment is "memset"'ed > > and only in the padding area. > > The API states that the buffer returned from ib_create_send_mad will be > cleared. We either need to clear the buffer, or update the documentation. > Either way is OK by me. Note that ib_create_send_mad does allocate the "base" MAD with kzalloc. Its just the RMPP segments that are allocated with kmalloc and are not initially cleared. > > [JPM]: This was done to hide segment implementation details from the > > user_mad layer. You are correct that we can get rid of the size field, > > and store it only once. > > However, I think that the proper place for this value is the > > "ib_mad_send_wr_private" structure, which also holds the list pointer. > > How about if we change > > struct ib_mad_multipacket_seg > > *ib_mad_get_multipacket_seg(struct ib_mad_send_buf *send_buf, int > > seg_num) > > > > to also return the segment size: > > Since the user needs to know the segment size, it makes more sense to me to > just expose it through ib_mad_send_buf. > I just thought that by returning the segment size in the procedure call, we preserve the option to easily support multiple segment sizes within a single RMPP message. However, if you think that we will never need such an option, I don't object to exposing the segment size in the ib_mad_send_buf structure. > > [JPM]: You are correct. In fact, the test just before this one is also > > performed in ib_create_send_mad(), and may also be deleted (this was not > > part of the patch): > > > > if (rmpp_active && !agent->rmpp_version) { > > ret = -EINVAL; > > goto err_ah; > > } > > I will remove these, and submit a patch for any other changes. Thanks. > > - Sean From lindahl at pathscale.com Wed Feb 22 00:26:35 2006 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed, 22 Feb 2006 00:26:35 -0800 Subject: [openib-general] IPoIB broadcast MC group membership In-Reply-To: <79ae2f320602212340j35663f90p39d63ff71de6bb92@mail.gmail.com> References: <79ae2f320602210815o62eff7a5qc00744c4b755c8d7@mail.gmail.com> <1140539024.28051.6108.camel@hal.voltaire.com> <79ae2f320602210908j1256615fl480a6af71dd0ebc5@mail.gmail.com> <1140542552.28051.6480.camel@hal.voltaire.com> <20060222011745.GL2853@greglaptop.internal.keyresearch.com> <20060222022817.GA5391@greglaptop.internal.keyresearch.com> <79ae2f320602212017i7feafbb0se3d75e196ea734f0@mail.gmail.com> <20060222055254.GA1554@greglaptop.hsd1.ca.comcast.net> <79ae2f320602212340j35663f90p39d63ff71de6bb92@mail.gmail.com> Message-ID: <20060222082635.GE1902@greglaptop.hsd1.ca.comcast.net> On Tue, Feb 21, 2006 at 11:40:53PM -0800, Fabian Tillier wrote: > You'd have to make the group 1X. Note that the group being 1X doesn't > limit unicast traffic to 1X rates, since the rate for unicast traffic > would be set based on the rate reported in the path records for the > various endpoints. > > So 4X SDR and 4X DDR nodes would have to set their inter-packet delay > for the broadcast group to end up with a 1X packet injection rate. So, basically, MVAPICH doesn't have code that does either the group creation properly when there is a mixture of HCA bandwidths, or limit the packet injection rate. And IPoIB could violate this rule depending on how user programs use it, e.g. if I did a lot of broadcasting, I could easily exceed 1X's bandwidth. So this is more than just a "fix OpenSM" issue. It's more of a "fix the spec" issue, if I'm understanding it correctly. -- greg From mst at mellanox.co.il Wed Feb 22 02:40:38 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 22 Feb 2006 12:40:38 +0200 Subject: [openib-general] ipoib_multicast_ah.patch Message-ID: <20060222104037.GB21077@mellanox.co.il> Hi, Roland! The following issue (ipoib_multicast_ah.patch) was found by code review. I added a bit more explanations. What do you think now? --- ipoib_mcast_send tests mcast->ah twice. If this value is changed between these two points, we leak an skb. As a solution, take priv->lock around assignment to mcast->ah thus making sure ipoib_mcast_send is not in flight. Signed-off-by: Eli Cohen Signed-off-by: Michael S. Tsirkin Index: linux-2.6.14/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- linux-2.6.14.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-01-11 14:38:18.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-01-11 14:38:23.000000000 +0200 @@ -210,6 +210,8 @@ static int ipoib_mcast_join_finish(struc { struct net_device *dev = mcast->dev; struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_ah *ah; + unsigned long flags; int ret; mcast->mcmember = *mcmember; @@ -266,8 +268,8 @@ static int ipoib_mcast_join_finish(struc av.static_rate, priv->local_rate, ib_sa_rate_enum_to_int(mcast->mcmember.rate)); - mcast->ah = ipoib_create_ah(dev, priv->pd, &av); - if (!mcast->ah) { + ah = ipoib_create_ah(dev, priv->pd, &av); + if (!ah) { ipoib_warn(priv, "ib_address_create failed\n"); } else { ipoib_dbg_mcast(priv, "MGID " IPOIB_GID_FMT @@ -277,6 +279,9 @@ static int ipoib_mcast_join_finish(struc be16_to_cpu(mcast->mcmember.mlid), mcast->mcmember.sl); } + spin_lock_irqsave(&priv->lock, flags); + mcast->ah = ah; + spin_unlock_irqrestore(&priv->lock, flags); } /* actually send any queued packets */ -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From gil at mellanox.co.il Wed Feb 22 03:08:49 2006 From: gil at mellanox.co.il (Gil Bloch) Date: Wed, 22 Feb 2006 13:08:49 +0200 Subject: [openib-general] Plans for libibverbs 1.0, 1.1 and beyond Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E301075D8C@mtlexch01.mtl.com> Roland, I believe we should add support for a resize WQ command (as a part of modify QP) to enable changing the WQ size. On a very large scale cluster, with many operating QPs, the work queue memory consumption might be expansive. Thus the MPI implementation should tradeoff for pipelining requests vs. WQ memory consumption. The resize WQ will allow on-demand adaptive WQ setting instead of static allocation of the memory resource, which I believe can increase performance and save memory at the same time. Regards, Gil Bloch > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Roland Dreier > Sent: Thursday, February 16, 2006 9:18 PM > To: openib-general at openib.org > Subject: [openib-general] Plans for libibverbs 1.0, 1.1 and beyond > > I thought it might be helpful to give an informal roadmap of my plans > for libibverbs. As I said, I hope to have a libibverbs 1.0 release > out in three weeks or so. Once that happens, I plan to do 1.0.x > maintenance releases "as needed." > > At the same time, I have some ideas for libibverbs 1.1. I think that > I can get my ideas done in six months or so, which seems like a good > interval between major releases. My ideas (also listed in the > libibverbs README) are the following; I'd like to hear what other > people's plans for libibverbs work are so that we can figure out if > those projects fit into a 1.1 release, and when we expect to land the > new features. > > * Implement memory window (MW) support. This will break the > device driver ABI, because new methods will need to be added to > struct ibv_context_ops. > > * Implement the reregister memory region (MR) verb. We will add an > extension to the IB spec to allow the application to indicate that > the region is only being extended, and that operations in progress > should _not_ fail (contrary to the IB spec, which states that > reregister must be implemented so that it behaves equivalently to a > deregister followed by a register). This will break the device > driver ABI, because a new method will need to be added to struct > ibv_context_ops. > > * Eliminate the dependency on libsysfs by implementing the required > sysfs handling directly. This will break the API, because the dev > and ibdev members of struct ibv_device will be removed. It will > also break the device driver ABI, because the signature of the > driver initialization function will change. The driver > initialization function will be changed as part of this work; this > has the added benefit of allowing us to choose a better name than > "openib_driver_init." > > I'm also thinking of moving my libibverbs and libmthca development > trees to git (most likely hosted at kernel.org). This has the > drawback of moving their development repositories out of the common > openib.org svn tree. However, it will make handling 1.0, 1.1 and > feature development branches much easier. I'd like to hear opinions > on this before I make a decision. > > - R. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general From halr at voltaire.com Wed Feb 22 03:32:29 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 22 Feb 2006 06:32:29 -0500 Subject: [openib-general] IPoIB broadcast MC group membership In-Reply-To: <79ae2f320602212340j35663f90p39d63ff71de6bb92@mail.gmail.com> References: <79ae2f320602202210s7a668b52if968e1fad058019c@mail.gmail.com> <79ae2f320602210815o62eff7a5qc00744c4b755c8d7@mail.gmail.com> <1140539024.28051.6108.camel@hal.voltaire.com> <79ae2f320602210908j1256615fl480a6af71dd0ebc5@mail.gmail.com> <1140542552.28051.6480.camel@hal.voltaire.com> <20060222011745.GL2853@greglaptop.internal.keyresearch.com> <20060222022817.GA5391@greglaptop.internal.keyresearch.com> <79ae2f320602212017i7feafbb0se3d75e196ea734f0@mail.gmail.com> <20060222055254.GA1554@greglaptop.hsd1.ca.comcast.net> <79ae2f320602212340j35663f90p39d63ff71de6bb92@mail.gmail.com> Message-ID: <1140607947.28051.13869.camel@hal.voltaire.com> On Wed, 2006-02-22 at 02:40, Fabian Tillier wrote: > On 2/21/06, Greg Lindahl wrote: > > On Tue, Feb 21, 2006 at 08:17:02PM -0800, Fabian Tillier wrote: > > > > > The node joining or creating the multicast group doesn't need to > > > specify the rate - the SA can figure out the rate to use based on the > > > requestor (for creation), or validate that the requestor supports the > > > existing group's rate (for joining). > > > > Um, but that gets back to my point: I want 1X, 4X SDR, and 4X DDR > > nodes running IPoIB to share a multicast group. Are you saying this > > can be done by making the group a 1X group? Or that it's impossible > > to have such a group? Or that everyone would have to drop to 1X to > > make such a group? > > You'd have to make the group 1X. Note that the group being 1X doesn't > limit unicast traffic to 1X rates, since the rate for unicast traffic > would be set based on the rate reported in the path records for the > various endpoints. It does, however, limit all other (IB) multicast groups in that partition to the same rate as the IPoIB broadcast group. That may be the correct choice of the admin (and 1x nodes would be refused). -- Hal > So 4X SDR and 4X DDR nodes would have to set their inter-packet delay > for the broadcast group to end up with a 1X packet injection rate. > > - Fab > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From takshak at gs-lab.com Wed Feb 22 03:56:10 2006 From: takshak at gs-lab.com (Takshak C.) Date: Wed, 22 Feb 2006 17:26:10 +0530 Subject: [openib-general] Get Table Records for SA Attribute ID ? In-Reply-To: <1140120738.4333.33149.camel@hal.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F589AC20@taurus.voltaire.com> <43F4788E.3070909@gs-lab.com> <1140120738.4333.33149.camel@hal.voltaire.com> Message-ID: <43FC515A.3020404@gs-lab.com> Hal Rosenstock wrote: >>Please throw some light on this. Do you have any userspace SA support for retrieving path, service record >>information ? >> >> > >There have been discussions about userspace SA support but nothing >currently for OpenIB (gen2). Currently, you can get this by using > > Could you please tell me, when userspace SA support will be available in openIB gen2. >osm_vendor_ibumad_sa.c which supports most SA requests. It is built as >part of libosmvendor (part of the OpenSM build) but can be used outside >of OpenSM. It is used by osmtest if you want to look at some use cases. >It obtains PathRecords and ServiceRecords. That might be an easier >direction to go than trying to use the management libraries to build the >pieces of a userspace SA client you want. > >-- Hal > > See, to execute osmtest, I found that openSM instance must be there. So, even if I use part of libosmvendor library ( osm_vendor_ibumad_sa.c) functions, I have to start openSM instance to execute the SA query successfully. Without starting openSM client, I m able to retrieve node description, node info, SM info, port info by using management libraries libibumad and libibmad. What I want to achieve is, without talking with openSM instance, my SA query client should go and get the required information. Is this possible ?. Would like to know your inputs on this. Regards, - Takshak >>Regards. >>- Takshak >> >> >>Hal Rosenstock wrote: >> >> >> >>>Hi, >>> >>>There are a couple of issues with the below. >>> >>>1. SA MAD structure is missing the RMPP header. Once I saw that I didn't check for further issues with the format. >>> >>>2. I will assume your register call sets RMPP. >>> >>>3. SA class version is 2. >>> >>>What SM are you using ? If you are using OpenSM, you can turn on verbose and see if the packet is seen by the SM. You could also enable madeye (in utils) to see if the packet is sent (and if anything is received back). >>> >>>-- Hal >>> >>>________________________________ >>> >>>From: openib-general-bounces at openib.org on behalf of Takshak C. >>>Sent: Mon 2/6/2006 8:00 AM >>>To: openib-general at openib.org >>>Subject: [openib-general] Get Table Records for SA Attribute ID ? >>> >>> >>> >>>Hi, >>> >>>I m trying to get the table records for SA attribute ID in following way. >>>But, I m not getting a single record, could anyone comment on the problem. >>> >>>1. I have created saMadFormat structure described in the specification as below: >>> >>>struct saMadFormat >>>{ >>> >>> uint8_t base_version ; >>> uint8_t mgmt_class ; >>> uint8_t class_version ; >>> uint8_t sa_method ; >>> uint16_t status ; >>> uint16_t not_used ; >>> uint64_t tid ; >>> uint16_t attr_id ; >>> uint16_t resv ; >>> uint32_t attr_mod ; >>> uint64_t sa_key; >>> uint64_t sm_key ; >>> uint32_t seg_num ; >>> uint32_t payload_len ; >>> uint8_t frag_flag ; >>> uint8_t edit_mod ; >>> uint16_t window ; >>> uint32_t endRID ; >>> uint64_t comp_mask ; >>> uint8_t adminData[192] ; >>>}; >>> >>>2. Then I have done all the basic operations like umad_open, umad_register for the IB_SA_CLASS >>> and umad_open_port etc successfully. >>> >>>3. struct saMadFormat *saQuery = (struct saMadFormat*)(umad_get_mad(umad)); >>> memset(saQuery, 0, sizeof(*saQuery)); >>> >>> saQuery->base_version = 1; >>> saQuery->mgmt_class = IB_SA_CLASS ; >>> saQuery->class_version = 1 ; >>> saQuery->sa_method = IB_MAD_METHOD_GET_TABLE ; >>> saQuery->attr_id = IB_SA_ATTR_PATHRECORD ; >>> saQuery->attr_mod = 0 ; >>> saQuery->tid = htonll(drmad_tid++); >>> saQuery->endRID = 0 ; >>> >>> umad_set_addr(umad, lid, 1, 0, IB_DEFAULT_QP1_QKEY); >>> umad_set_grh(umad, 0); >>> umad_set_pkey(umad, 0xFFFF); >>> >>>4. length = IB_MAD_SIZE; >>> >>> if (umad_send(portid, mad_agent, umad, length, timeout_ms, 0) < 0) >>> IBPANIC("send failed"); >>> >>> if (umad_recv(portid, umad, &length, -1) != mad_agent) >>> IBPANIC("recv error: %s", drmad_status_str(saQuery)); >>> >>> >>> >>> if (!dump_char) { >>> xdump(stdout, 0, saQuery->adminData, 192); >>> return 0; >>> } >>> >>>I m expecting that, I will get the resultant data in saQuery->adminData. >>>Is this correct ? If not then, how should I retrieve the table records ? >>>Any Idea ? >>> >>> >>>Thanks >>>- Takshak >>> >>>_______________________________________________ >>>openib-general mailing list >>>openib-general at openib.org >>>http://openib.org/mailman/listinfo/openib-general >>> >>>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>> >>> >>> >>> >>> >>> >>> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Wed Feb 22 04:40:37 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 22 Feb 2006 07:40:37 -0500 Subject: [openib-general] Get Table Records for SA Attribute ID ? In-Reply-To: <43FC515A.3020404@gs-lab.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F589AC20@taurus.voltaire.com> <43F4788E.3070909@gs-lab.com> <1140120738.4333.33149.camel@hal.voltaire.com> <43FC515A.3020404@gs-lab.com> Message-ID: <1140612035.28051.14448.camel@hal.voltaire.com> On Wed, 2006-02-22 at 06:56, Takshak C. wrote: > Hal Rosenstock wrote: > > > Please throw some light on this. Do you have any userspace SA support for retrieving path, service record > > > information ? > > > > > > > There have been discussions about userspace SA support but nothing > > currently for OpenIB (gen2). Currently, you can get this by using > > > Could you please tell me, when userspace SA support will be available > in openIB gen2. I don't know but I'm not sure how much this helps you based on your questions below. > > osm_vendor_ibumad_sa.c which supports most SA requests. It is built as > > part of libosmvendor (part of the OpenSM build) but can be used outside > > of OpenSM. It is used by osmtest if you want to look at some use cases. > > It obtains PathRecords and ServiceRecords. That might be an easier > > direction to go than trying to use the management libraries to build the > > pieces of a userspace SA client you want. > > > > -- Hal > > > See, to execute osmtest, I found that openSM instance must be there. Must be where ? What is your IB configuration ? > So, even if I use part > of libosmvendor library ( osm_vendor_ibumad_sa.c) functions, I have to > start openSM > instance to execute the SA query successfully. An SM is needed in the subnet and SA is part of that and answers such queries. > Without starting openSM client, I m able to retrieve node description, > node info, SM info, > port info by using management libraries libibumad and libibmad. of the local node only (until the SM brings up the subnet). > What I want to achieve is, without talking with openSM instance, my SA > query client > should go and get the required information. Why ? > Is this possible ?. No. What would you query for paths to if the subnet were not up ? -- Hal > Would like to know your inputs on this. > > Regards, > - Takshak > > > Regards. > > > - Takshak > > > > > > > > > Hal Rosenstock wrote: > > > > > > > > > > Hi, > > > > > > > > There are a couple of issues with the below. > > > > > > > > 1. SA MAD structure is missing the RMPP header. Once I saw that I didn't check for further issues with the format. > > > > > > > > 2. I will assume your register call sets RMPP. > > > > > > > > 3. SA class version is 2. > > > > > > > > What SM are you using ? If you are using OpenSM, you can turn on verbose and see if the packet is seen by the SM. You could also enable madeye (in utils) to see if the packet is sent (and if anything is received back). > > > > > > > > -- Hal > > > > > > > > ________________________________ > > > > > > > > From: openib-general-bounces at openib.org on behalf of Takshak C. > > > > Sent: Mon 2/6/2006 8:00 AM > > > > To: openib-general at openib.org > > > > Subject: [openib-general] Get Table Records for SA Attribute ID ? > > > > > > > > > > > > > > > > Hi, > > > > > > > > I m trying to get the table records for SA attribute ID in following way. > > > > But, I m not getting a single record, could anyone comment on the problem. > > > > > > > > 1. I have created saMadFormat structure described in the specification as below: > > > > > > > > struct saMadFormat > > > > { > > > > > > > > uint8_t base_version ; > > > > uint8_t mgmt_class ; > > > > uint8_t class_version ; > > > > uint8_t sa_method ; > > > > uint16_t status ; > > > > uint16_t not_used ; > > > > uint64_t tid ; > > > > uint16_t attr_id ; > > > > uint16_t resv ; > > > > uint32_t attr_mod ; > > > > uint64_t sa_key; > > > > uint64_t sm_key ; > > > > uint32_t seg_num ; > > > > uint32_t payload_len ; > > > > uint8_t frag_flag ; > > > > uint8_t edit_mod ; > > > > uint16_t window ; > > > > uint32_t endRID ; > > > > uint64_t comp_mask ; > > > > uint8_t adminData[192] ; > > > > }; > > > > > > > > 2. Then I have done all the basic operations like umad_open, umad_register for the IB_SA_CLASS > > > > and umad_open_port etc successfully. > > > > > > > > 3. struct saMadFormat *saQuery = (struct saMadFormat*)(umad_get_mad(umad)); > > > > memset(saQuery, 0, sizeof(*saQuery)); > > > > > > > > saQuery->base_version = 1; > > > > saQuery->mgmt_class = IB_SA_CLASS ; > > > > saQuery->class_version = 1 ; > > > > saQuery->sa_method = IB_MAD_METHOD_GET_TABLE ; > > > > saQuery->attr_id = IB_SA_ATTR_PATHRECORD ; > > > > saQuery->attr_mod = 0 ; > > > > saQuery->tid = htonll(drmad_tid++); > > > > saQuery->endRID = 0 ; > > > > > > > > umad_set_addr(umad, lid, 1, 0, IB_DEFAULT_QP1_QKEY); > > > > umad_set_grh(umad, 0); > > > > umad_set_pkey(umad, 0xFFFF); > > > > > > > > 4. length = IB_MAD_SIZE; > > > > > > > > if (umad_send(portid, mad_agent, umad, length, timeout_ms, 0) < 0) > > > > IBPANIC("send failed"); > > > > > > > > if (umad_recv(portid, umad, &length, -1) != mad_agent) > > > > IBPANIC("recv error: %s", drmad_status_str(saQuery)); > > > > > > > > > > > > > > > > if (!dump_char) { > > > > xdump(stdout, 0, saQuery->adminData, 192); > > > > return 0; > > > > } > > > > > > > > I m expecting that, I will get the resultant data in saQuery->adminData. > > > > Is this correct ? If not then, how should I retrieve the table records ? > > > > Any Idea ? > > > > > > > > > > > > Thanks > > > > - Takshak > > > > > > > > _______________________________________________ > > > > openib-general mailing list > > > > openib-general at openib.org > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > > > > > > > > > > > > > > > > > > From takshak at gs-lab.com Wed Feb 22 05:09:28 2006 From: takshak at gs-lab.com (Takshak C.) Date: Wed, 22 Feb 2006 18:39:28 +0530 Subject: [openib-general] Get Table Records for SA Attribute ID ? In-Reply-To: <1140612035.28051.14448.camel@hal.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F589AC20@taurus.voltaire.com> <43F4788E.3070909@gs-lab.com> <1140120738.4333.33149.camel@hal.voltaire.com> <43FC515A.3020404@gs-lab.com> <1140612035.28051.14448.camel@hal.voltaire.com> Message-ID: <43FC6288.4040402@gs-lab.com> Thanks a lot Hal, for clearing my doubts. I would like to redefine my problem based on your inputs. I am into a scenario, where vendor specific primary SM is running in the subnet. This running SM is different than openSM. I have loaded an openIB stack on the host. Some of the sample examples from management/diags/src/ directory like smpquery for nodeinfo etc works and gives result to me. Now, could it be possible for me to write a SA query and fetch the path, service or info records without starting openSM instance as I have already primary SM running in the subnet. ? I believe, this question could be right and your answer would help me. I do not want to start openSM because then synchronization between primary SM and openSM would bring other issues or difficulties. Could you please tell me, how should I go about it ? Waiting. Regards. - Takshak Hal Rosenstock wrote: >On Wed, 2006-02-22 at 06:56, Takshak C. wrote: > > >>Hal Rosenstock wrote: >> >> >>>>Please throw some light on this. Do you have any userspace SA support for retrieving path, service record >>>>information ? >>>> >>>> >>>> >>>There have been discussions about userspace SA support but nothing >>>currently for OpenIB (gen2). Currently, you can get this by using >>> >>> >>> >>Could you please tell me, when userspace SA support will be available >>in openIB gen2. >> >> > >I don't know but I'm not sure how much this helps you based on your >questions below. > > > >>>osm_vendor_ibumad_sa.c which supports most SA requests. It is built as >>>part of libosmvendor (part of the OpenSM build) but can be used outside >>>of OpenSM. It is used by osmtest if you want to look at some use cases. >>>It obtains PathRecords and ServiceRecords. That might be an easier >>>direction to go than trying to use the management libraries to build the >>>pieces of a userspace SA client you want. >>> >>>-- Hal >>> >>> >>> >>See, to execute osmtest, I found that openSM instance must be there. >> >> > >Must be where ? What is your IB configuration ? > > > >> So, even if I use part >>of libosmvendor library ( osm_vendor_ibumad_sa.c) functions, I have to >>start openSM >>instance to execute the SA query successfully. >> >> > >An SM is needed in the subnet and SA is part of that and answers such >queries. > > > >>Without starting openSM client, I m able to retrieve node description, >>node info, SM info, >>port info by using management libraries libibumad and libibmad. >> >> > >of the local node only (until the SM brings up the subnet). > > > >>What I want to achieve is, without talking with openSM instance, my SA >>query client >>should go and get the required information. >> >> > >Why ? > > > >>Is this possible ?. >> >> > >No. What would you query for paths to if the subnet were not up ? > >-- Hal > > > >>Would like to know your inputs on this. >> >>Regards, >>- Takshak >> >> >>>>Regards. >>>>- Takshak >>>> >>>> >>>>Hal Rosenstock wrote: >>>> >>>> >>>> >>>> >>>>>Hi, >>>>> >>>>>There are a couple of issues with the below. >>>>> >>>>>1. SA MAD structure is missing the RMPP header. Once I saw that I didn't check for further issues with the format. >>>>> >>>>>2. I will assume your register call sets RMPP. >>>>> >>>>>3. SA class version is 2. >>>>> >>>>>What SM are you using ? If you are using OpenSM, you can turn on verbose and see if the packet is seen by the SM. You could also enable madeye (in utils) to see if the packet is sent (and if anything is received back). >>>>> >>>>>-- Hal >>>>> >>>>>________________________________ >>>>> >>>>>From: openib-general-bounces at openib.org on behalf of Takshak C. >>>>>Sent: Mon 2/6/2006 8:00 AM >>>>>To: openib-general at openib.org >>>>>Subject: [openib-general] Get Table Records for SA Attribute ID ? >>>>> >>>>> >>>>> >>>>>Hi, >>>>> >>>>>I m trying to get the table records for SA attribute ID in following way. >>>>>But, I m not getting a single record, could anyone comment on the problem. >>>>> >>>>>1. I have created saMadFormat structure described in the specification as below: >>>>> >>>>>struct saMadFormat >>>>>{ >>>>> >>>>> uint8_t base_version ; >>>>> uint8_t mgmt_class ; >>>>> uint8_t class_version ; >>>>> uint8_t sa_method ; >>>>> uint16_t status ; >>>>> uint16_t not_used ; >>>>> uint64_t tid ; >>>>> uint16_t attr_id ; >>>>> uint16_t resv ; >>>>> uint32_t attr_mod ; >>>>> uint64_t sa_key; >>>>> uint64_t sm_key ; >>>>> uint32_t seg_num ; >>>>> uint32_t payload_len ; >>>>> uint8_t frag_flag ; >>>>> uint8_t edit_mod ; >>>>> uint16_t window ; >>>>> uint32_t endRID ; >>>>> uint64_t comp_mask ; >>>>> uint8_t adminData[192] ; >>>>>}; >>>>> >>>>>2. Then I have done all the basic operations like umad_open, umad_register for the IB_SA_CLASS >>>>> and umad_open_port etc successfully. >>>>> >>>>>3. struct saMadFormat *saQuery = (struct saMadFormat*)(umad_get_mad(umad)); >>>>> memset(saQuery, 0, sizeof(*saQuery)); >>>>> >>>>> saQuery->base_version = 1; >>>>> saQuery->mgmt_class = IB_SA_CLASS ; >>>>> saQuery->class_version = 1 ; >>>>> saQuery->sa_method = IB_MAD_METHOD_GET_TABLE ; >>>>> saQuery->attr_id = IB_SA_ATTR_PATHRECORD ; >>>>> saQuery->attr_mod = 0 ; >>>>> saQuery->tid = htonll(drmad_tid++); >>>>> saQuery->endRID = 0 ; >>>>> >>>>> umad_set_addr(umad, lid, 1, 0, IB_DEFAULT_QP1_QKEY); >>>>> umad_set_grh(umad, 0); >>>>> umad_set_pkey(umad, 0xFFFF); >>>>> >>>>>4. length = IB_MAD_SIZE; >>>>> >>>>> if (umad_send(portid, mad_agent, umad, length, timeout_ms, 0) < 0) >>>>> IBPANIC("send failed"); >>>>> >>>>> if (umad_recv(portid, umad, &length, -1) != mad_agent) >>>>> IBPANIC("recv error: %s", drmad_status_str(saQuery)); >>>>> >>>>> >>>>> >>>>> if (!dump_char) { >>>>> xdump(stdout, 0, saQuery->adminData, 192); >>>>> return 0; >>>>> } >>>>> >>>>>I m expecting that, I will get the resultant data in saQuery->adminData. >>>>>Is this correct ? If not then, how should I retrieve the table records ? >>>>>Any Idea ? >>>>> >>>>> >>>>>Thanks >>>>>- Takshak >>>>> >>>>>_______________________________________________ >>>>>openib-general mailing list >>>>>openib-general at openib.org >>>>>http://openib.org/mailman/listinfo/openib-general >>>>> >>>>>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>> >>> >>> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Wed Feb 22 05:57:11 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 22 Feb 2006 08:57:11 -0500 Subject: [openib-general] Get Table Records for SA Attribute ID ? In-Reply-To: <43FC6288.4040402@gs-lab.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F589AC20@taurus.voltaire.com> <43F4788E.3070909@gs-lab.com> <1140120738.4333.33149.camel@hal.voltaire.com> <43FC515A.3020404@gs-lab.com> <1140612035.28051.14448.camel@hal.voltaire.com> <43FC6288.4040402@gs-lab.com> Message-ID: <1140615978.28051.15030.camel@hal.voltaire.com> On Wed, 2006-02-22 at 08:09, Takshak C. wrote: > Thanks a lot Hal, for clearing my doubts. > I would like to redefine my problem based on your inputs. > > I am into a scenario, where vendor specific primary SM is running in > the subnet. > This running SM is different than openSM. I have loaded an openIB > stack on the host. OK. I understand your configuration. > Some of the sample examples from management/diags/src/ directory like > smpquery > for nodeinfo etc works and gives result to me. > > Now, could it be possible for me to write a SA query and fetch the > path, service > or info records Info records ? > without starting openSM instance as I have already primary SM > running in the subnet. ? Yes; all you are (conceptually) talking about is a user SA client. > I believe, this question could be right and your answer would > help me. > > I do not want to start openSM because then synchronization between > primary SM > and openSM would bring other issues or difficulties. Understood. It was unclear whether you had an SM in your subnet. You should be able to link libopensm and the other management libraries to an SA application which would do this (and not require OpenSM itself). > Could you please tell me, how should I go about it ? Waiting. I think I've already answered this. -- Hal > Regards. > - Takshak > > > > Hal Rosenstock wrote: > > On Wed, 2006-02-22 at 06:56, Takshak C. wrote: > > > > > Hal Rosenstock wrote: > > > > > > > > Please throw some light on this. Do you have any userspace SA support for retrieving path, service record > > > > > information ? > > > > > > > > > > > > > > > > > > There have been discussions about userspace SA support but nothing > > > > currently for OpenIB (gen2). Currently, you can get this by using > > > > > > > > > > > > > > Could you please tell me, when userspace SA support will be available > > > in openIB gen2. > > > > > > > I don't know but I'm not sure how much this helps you based on your > > questions below. > > > > > > > > osm_vendor_ibumad_sa.c which supports most SA requests. It is built as > > > > part of libosmvendor (part of the OpenSM build) but can be used outside > > > > of OpenSM. It is used by osmtest if you want to look at some use cases. > > > > It obtains PathRecords and ServiceRecords. That might be an easier > > > > direction to go than trying to use the management libraries to build the > > > > pieces of a userspace SA client you want. > > > > > > > > -- Hal > > > > > > > > > > > > > > See, to execute osmtest, I found that openSM instance must be there. > > > > > > > Must be where ? What is your IB configuration ? > > > > > > > So, even if I use part > > > of libosmvendor library ( osm_vendor_ibumad_sa.c) functions, I have to > > > start openSM > > > instance to execute the SA query successfully. > > > > > > > An SM is needed in the subnet and SA is part of that and answers such > > queries. > > > > > > > Without starting openSM client, I m able to retrieve node description, > > > node info, SM info, > > > port info by using management libraries libibumad and libibmad. > > > > > > > of the local node only (until the SM brings up the subnet). > > > > > > > What I want to achieve is, without talking with openSM instance, my SA > > > query client > > > should go and get the required information. > > > > > > > Why ? > > > > > > > Is this possible ?. > > > > > > > No. What would you query for paths to if the subnet were not up ? > > > > -- Hal > > > > > > > Would like to know your inputs on this. > > > > > > Regards, > > > - Takshak > > > > > > > > Regards. > > > > > - Takshak > > > > > > > > > > > > > > > Hal Rosenstock wrote: > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > There are a couple of issues with the below. > > > > > > > > > > > > 1. SA MAD structure is missing the RMPP header. Once I saw that I didn't check for further issues with the format. > > > > > > > > > > > > 2. I will assume your register call sets RMPP. > > > > > > > > > > > > 3. SA class version is 2. > > > > > > > > > > > > What SM are you using ? If you are using OpenSM, you can turn on verbose and see if the packet is seen by the SM. You could also enable madeye (in utils) to see if the packet is sent (and if anything is received back). > > > > > > > > > > > > -- Hal > > > > > > > > > > > > ________________________________ > > > > > > > > > > > > From: openib-general-bounces at openib.org on behalf of Takshak C. > > > > > > Sent: Mon 2/6/2006 8:00 AM > > > > > > To: openib-general at openib.org > > > > > > Subject: [openib-general] Get Table Records for SA Attribute ID ? > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > I m trying to get the table records for SA attribute ID in following way. > > > > > > But, I m not getting a single record, could anyone comment on the problem. > > > > > > > > > > > > 1. I have created saMadFormat structure described in the specification as below: > > > > > > > > > > > > struct saMadFormat > > > > > > { > > > > > > > > > > > > uint8_t base_version ; > > > > > > uint8_t mgmt_class ; > > > > > > uint8_t class_version ; > > > > > > uint8_t sa_method ; > > > > > > uint16_t status ; > > > > > > uint16_t not_used ; > > > > > > uint64_t tid ; > > > > > > uint16_t attr_id ; > > > > > > uint16_t resv ; > > > > > > uint32_t attr_mod ; > > > > > > uint64_t sa_key; > > > > > > uint64_t sm_key ; > > > > > > uint32_t seg_num ; > > > > > > uint32_t payload_len ; > > > > > > uint8_t frag_flag ; > > > > > > uint8_t edit_mod ; > > > > > > uint16_t window ; > > > > > > uint32_t endRID ; > > > > > > uint64_t comp_mask ; > > > > > > uint8_t adminData[192] ; > > > > > > }; > > > > > > > > > > > > 2. Then I have done all the basic operations like umad_open, umad_register for the IB_SA_CLASS > > > > > > and umad_open_port etc successfully. > > > > > > > > > > > > 3. struct saMadFormat *saQuery = (struct saMadFormat*)(umad_get_mad(umad)); > > > > > > memset(saQuery, 0, sizeof(*saQuery)); > > > > > > > > > > > > saQuery->base_version = 1; > > > > > > saQuery->mgmt_class = IB_SA_CLASS ; > > > > > > saQuery->class_version = 1 ; > > > > > > saQuery->sa_method = IB_MAD_METHOD_GET_TABLE ; > > > > > > saQuery->attr_id = IB_SA_ATTR_PATHRECORD ; > > > > > > saQuery->attr_mod = 0 ; > > > > > > saQuery->tid = htonll(drmad_tid++); > > > > > > saQuery->endRID = 0 ; > > > > > > > > > > > > umad_set_addr(umad, lid, 1, 0, IB_DEFAULT_QP1_QKEY); > > > > > > umad_set_grh(umad, 0); > > > > > > umad_set_pkey(umad, 0xFFFF); > > > > > > > > > > > > 4. length = IB_MAD_SIZE; > > > > > > > > > > > > if (umad_send(portid, mad_agent, umad, length, timeout_ms, 0) < 0) > > > > > > IBPANIC("send failed"); > > > > > > > > > > > > if (umad_recv(portid, umad, &length, -1) != mad_agent) > > > > > > IBPANIC("recv error: %s", drmad_status_str(saQuery)); > > > > > > > > > > > > > > > > > > > > > > > > if (!dump_char) { > > > > > > xdump(stdout, 0, saQuery->adminData, 192); > > > > > > return 0; > > > > > > } > > > > > > > > > > > > I m expecting that, I will get the resultant data in saQuery->adminData. > > > > > > Is this correct ? If not then, how should I retrieve the table records ? > > > > > > Any Idea ? > > > > > > > > > > > > > > > > > > Thanks > > > > > > - Takshak > > > > > > > > > > > > _______________________________________________ > > > > > > openib-general mailing list > > > > > > openib-general at openib.org > > > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > From ogerlitz at voltaire.com Wed Feb 22 06:22:23 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 22 Feb 2006 16:22:23 +0200 (IST) Subject: [openib-general] [PATCH 0/6] [RFC] iSER initiator Message-ID: The patch series that follows is sent to the openib community asking for comments on the iSER (iSCSI Extensions for RDMA) code. Eventually the end goal is going upstream, so we would like to get as much feedback as possible. This driver is an iSER transport implementation for the Open iSCSI initiator (www.open-iscsi.org) whose kernel portion and TCP provider are merged in as of 2.6.15 (drivers/scsi/iscsi_trasport_iscsi.c & iscsi_tcp.c) Hence iSER is both a provider of the Linux iSCSI transport api (defined in scsi/scsi_transport_iscsi.h) and a SCSI LLD (Low Level Driver) of the Linux SCSI midlayer api (defined in scsi/scsi_host.h) More information and TODO items are present at openib wiki iser section. This RFC is posted knowing the todo list is not over, i've put specfic comments and known issues at some of the patches head. The Open iSCSI initiator discovery of targets, connect and login into a target is carried out from user space, where once the login negotiation is done, the user space connection is "binded" to a kernel connection. The diargram under http://www.open-iscsi.org/docs/open-iscsi-1.jpg shows the connecting sequence. The transport is expected to use a socket for the connection where Linux has the means to move a socket from user to kernel space. Under this restriction and the inability to move QP from user to kernel space, we had to use an iser socket family, where the socket actually maps to struct iser_conn which contain the IB connection CMA ID and QP. Basically, it goes like: +1 target discovery over TCP/IP with the discovery server +2 socket creat/bind/setopt/connect to the target +3 iscsi session create +4 iscsi connection create +5 bind iscsi connection to the socket +6 login request/response negotiation +7 iscsi connection start +8 the SCSI midlayer starts its inquiry and so on The code has been tested with 2.6.15, latest openib (r5460) and the user part of the open-iscsi-0.4-434 release, with the 434 release, the only patch for open-iscsi needed to support iSER is linux-2.6.15-iscsi_iser.diff under https://openib.org/svn/gen2/trunk/src/linux-kernel/patches. From ogerlitz at voltaire.com Wed Feb 22 06:25:49 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 22 Feb 2006 16:25:49 +0200 (IST) Subject: [openib-general] [PATCH 1/6] [RFC] iscsi_iser header file In-Reply-To: Message-ID: + some of the defines here replicate thos in drivers/scsi/iscsi_tcp.h so merging them is possible + a cleanup we plan is to reduce the usage of the iser dbg/err/bug macros, convert the remaining iser_bug calls into standard BUG() calls. + the wire structures are iser_hdr (below) and variuos iscsi_hdr (from scsi/iscsi_proto.h) --- /ulp/iser-x/iscsi_iser.h 2006-02-22 15:06:45.000000000 +0200 +++ /ulp/iser/iscsi_iser.h 2006-02-22 13:48:55.000000000 +0200 @@ -1 +1,501 @@ +/* + * iSER transport for the Open iSCSI Initiator & iSER transport internals + * + * Copyright (C) 2004 Dmitry Yusupov + * Copyright (C) 2004 Alex Aizman + * Copyright (C) 2005 Mike Christie + * based on code maintained by open-iscsi at googlegroups.com + * + * Copyright (c) 2004, 2005, 2006 Voltaire, Inc. All rights reserved. + * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: iscsi_iser.h 5459 2006-02-22 11:00:48Z ogerlitz $ + */ +#ifndef __ISCSI_ISER_H__ +#define __ISCSI_ISER_H__ +#include +#include +#include +#include + +#include +#include +#include +#include +#include + +/* XXX remove this compatibility hack when 2.6.16 is released */ +#include +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,16) +#include +#else +#include +#endif /* XXX end of hack */ + +#include +#include + +#include +#include +#include + +#include +#include +#include + +#define PFX "iser:" + +#define iser_dbg(fmt, arg...) \ + do { \ + if (iser_debug_level > 0) \ + printk(KERN_DEBUG PFX "%s:" fmt,\ + __func__ , ## arg); \ + } while (0) + +#define iser_err(fmt, arg...) \ + do { \ + printk(KERN_ERR PFX "%s:" fmt, \ + __func__ , ## arg); \ + } while (0) + +#define iser_bug(fmt,arg...) \ + do { \ + printk(KERN_ERR PFX "%s: PANIC! " fmt, \ + __func__ , ## arg); \ + BUG(); \ + } while(0) + +#define ISCSI_ISER_XMIT_CMDS_MAX 128 /* must be power of 2 */ +#define ISCSI_ISER_MGMT_CMDS_MAX 32 /* must be power of 2 */ + /* support upto 512KB in one RDMA */ +#define ISCSI_ISER_SG_TABLESIZE (0x80000 >> PAGE_SHIFT) +#define ISCSI_ISER_CMD_PER_LUN ISCSI_ISER_XMIT_CMDS_MAX +#define ISCSI_ISER_MAX_LUN 256 +#define ISCSI_ISER_MAX_CMD_LEN 16 + +#define ISCSI_MGMT_ITT_OFFSET 0xa00 + +/* Session's states */ +#define ISCSI_STATE_FREE 1 +#define ISCSI_STATE_LOGGED_IN 2 +#define ISCSI_STATE_FAILED 3 +#define ISCSI_STATE_TERMINATE 4 + +/* Connection's states */ +#define ISCSI_CONN_INITIAL_STAGE 0 +#define ISCSI_CONN_STARTED 1 +#define ISCSI_CONN_STOPPED 2 +#define ISCSI_CONN_CLEANUP_WAIT 3 + +/* Connection suspend "bit" */ +#define SUSPEND_BIT 1 + +/* Task Mgmt states */ +#define TMABORT_INITIAL 0x0 +#define TMABORT_SUCCESS 0x1 +#define TMABORT_FAILED 0x2 +#define TMABORT_TIMEDOUT 0x3 + +#define ITT_MASK (0xfff) +#define CID_SHIFT 12 +#define CID_MASK (0xffff< data scatterlist -> sg */ + unsigned int size; /* data len for single, nentries for sg */ + dma_addr_t dma_addr; /* returned by dma_map_single */ + unsigned int dma_nents; /* returned by dma_map_sg for */ +}; + +/* fwd declarations */ +struct iser_adaptor; +struct iscsi_iser_conn; +struct iscsi_iser_session; +struct iscsi_iser_mgmt_task; +struct iscsi_iser_cmd_task; + +struct iser_mem_reg { + u32 lkey; + u32 rkey; + u64 va; + u64 len; + void *mem_h; +}; + +struct iser_regd_buf { + struct iser_mem_reg reg; /* memory registration info */ + void *virt_addr; + struct iser_adaptor *p_adaptor; /* p_adaptor->device for dma_unmap */ + dma_addr_t dma_addr; /* if non zero, addr for dma_unmap */ + enum dma_data_direction direction; /* direction for dma_unmap */ + unsigned int data_size; + atomic_t ref_count; /* refcount, freed when dec to 0 */ +}; + +#define MAX_REGD_BUF_VECTOR_LEN 2 + +struct iser_dto { + struct iscsi_iser_cmd_task *p_task; + struct iscsi_iser_conn *p_conn; + int notify_enable; + + /* vector of registered buffers */ + unsigned int regd_vector_len; + struct iser_regd_buf *regd[MAX_REGD_BUF_VECTOR_LEN]; + + /* offset into the registered buffer may be specified */ + unsigned int offset[MAX_REGD_BUF_VECTOR_LEN]; + + /* a smaller size may be specified, if 0, then full size is used */ + unsigned int used_sz[MAX_REGD_BUF_VECTOR_LEN]; +}; + +enum iser_desc_type { + ISCSI_RX, + ISCSI_TX_CONTROL , + ISCSI_TX_SCSI_COMMAND, + ISCSI_TX_DATAOUT +}; + +struct iser_desc { + struct iser_hdr iser_header; + struct iscsi_hdr iscsi_header; + struct iser_regd_buf hdr_regd_buf; + void *data; /* used by RX & TX_CONTROL */ + struct iser_regd_buf data_regd_buf; /* used by RX & TX_CONTROL */ + enum iser_desc_type type; + struct iser_dto dto; +}; + +struct iser_adaptor { + struct ib_device *device; + struct ib_pd *pd; + struct ib_cq *cq; + struct ib_mr *mr; + struct tasklet_struct cq_tasklet; + struct list_head ig_list; /* entry in ig adaptors list */ + int refcount; + char name[ISER_OBJECT_NAME_SIZE]; +}; + +struct iser_conn +{ + atomic_t state; /* rdma connection state */ + struct iser_adaptor *p_adaptor; /* adaptor context */ + struct rdma_cm_id *cma_id; /* CMA ID */ + struct ib_qp *qp; /* QP */ + struct ib_fmr_pool *fmr_pool; /* pool of IB FMRs */ + int disc_evt_flag; /* disconn event delivered */ + wait_queue_head_t wait; /* waitq for conn/disconn */ + struct iscsi_iser_conn *p_iscsi_conn; /* iscsi conn for upcalls */ + atomic_t post_recv_buf_count; /* posted rx count */ + atomic_t post_send_buf_count; /* posted tx count */ + struct work_struct comperror_work; /* conn term sleepable ctx*/ + char name[ISER_OBJECT_NAME_SIZE]; +}; + +struct iscsi_iser_queue { + struct kfifo *queue; /* FIFO Queue */ + void **pool; /* Pool of elements */ + int max; /* Max number of elements */ +}; + +struct iscsi_iser_conn { + struct socket *sock; /* iSER socket */ + struct iser_conn *ib_conn; /* iSER IB conn */ + int stop_stage; /* conn_stop() flag: * + * stop to recover, * + * stop to terminate */ + /* iSCSI connection-wide sequencing */ + uint32_t exp_statsn; + + /* control data */ + int id; /* CID */ + struct iscsi_iser_session *session; /* parent session */ + struct list_head item; /* maintains list of conns */ + int c_stage; /* connection state */ + struct iscsi_iser_mgmt_task *login_mtask;/* mtask for login/text */ + struct iscsi_iser_mgmt_task *mtask; /* xmit mtask in progress */ + struct iscsi_iser_cmd_task *ctask; /* xmit ctask in progress */ + + /* xmit */ + struct kfifo *immqueue; /* immediate xmit queue */ + struct kfifo *mgmtqueue; /* mgmt (control) xmit queue */ + struct kfifo *xmitqueue; /* data-path cmd queue */ + struct work_struct xmitwork; /* per-conn. xmit workqueue */ + struct mutex xmitmutex; /* serializes connection xmit, + * access to kfifos: * + * xmitqueue, * + * immqueue, mgmtqueue */ + unsigned long suspend_tx; /* suspend Tx */ + spinlock_t lock; + + /* abort */ + wait_queue_head_t ehwait; /* used in eh_abort() */ + struct iscsi_tm tmhdr; + struct timer_list tmabort_timer; /* abort timer */ + int tmabort_state; /* see TMABORT_INITIAL,etc.*/ + + /* negotiated params */ + int max_recv_dlength; /* initiator_max_recv_dsl*/ + int max_xmit_dlength; /* target_max_recv_dsl */ +}; + +struct iscsi_iser_session { + /* iSCSI session-wide sequencing */ + uint32_t cmdsn; + uint32_t exp_cmdsn; + uint32_t max_cmdsn; + + /* configuration */ + int initial_r2t_en; + int max_r2t; + int imm_data_en; + int first_burst; + int max_burst; + int time2wait; + int time2retain; + int pdu_inorder_en; + int dataseq_inorder_en; + int erl; + + /* control data */ + struct Scsi_Host *host; + int id; + struct iscsi_iser_conn *leadconn; /* leading connection */ + spinlock_t lock; /* protects session state, * + * sequence numbers, * + * session resources: * + * - cmdpool, * + * - mgmtpool, */ + volatile int state; /* session state */ + int conn_cnt; + int age; /* counts session re-opens */ + + struct list_head connections; /* list of connections */ + int cmds_max; /* size of cmds array */ + struct iscsi_iser_cmd_task **cmds; /* Original Cmds arr */ + struct iscsi_iser_queue cmdpool; /* PDU's pool */ + int mgmtpool_max;/* size of mgmt array */ + struct iscsi_iser_mgmt_task **mgmt_cmds; /* Original mgmt arr */ + struct iscsi_iser_queue mgmtpool; /* Mgmt PDU's pool */ +}; + +struct iscsi_iser_mgmt_task { + struct iser_desc desc; + struct iscsi_hdr *hdr; /* mgmt. PDU header points * + * to desc.iscsi_hdr */ + char *data; /* mgmt payload points to * + * desc.data */ + int data_count; /* counts data to be sent */ + uint32_t itt; /* this ITT */ +}; + +struct iscsi_iser_cmd_task { + struct iser_desc desc; + + struct iscsi_cmd *hdr; /* iSCSI PDU header points * + * to desc.iscsi_hdr */ + int itt; /* this ITT */ + int datasn; /* DataSN */ + uint32_t unsol_datasn; + int imm_count; /* imm-data (bytes) */ + int unsol_count; /* unsolicited (bytes) */ + int data_count; /* remaining Data-Out */ + int rdma_data_count;/* RDMA bytes */ + struct scsi_cmnd *sc; /* associated SCSI cmd */ + int total_length; + struct iscsi_iser_conn *conn; /* used connection */ + struct iscsi_iser_mgmt_task *mtask; /* tmf mtask in progr */ + + enum iser_task_status status; + int command_sent; /* set if command sent */ + int dir[ISER_DIRS_NUM]; /* set if dir use*/ + struct iser_regd_buf rdma_regd[ISER_DIRS_NUM];/* regd rdma buf */ + unsigned long data_len[ISER_DIRS_NUM]; /* total data len*/ + struct iser_data_buf data[ISER_DIRS_NUM]; /* orig. data des*/ + struct iser_data_buf data_copy[ISER_DIRS_NUM];/* contig. copy */ +}; + +struct iser_page_vec { + u64 *pages; + int length; + int offset; + int data_size; +}; + +struct iser_global { + struct mutex adaptor_list_mutex;/* */ + struct list_head adaptor_list; /* all iSER adaptors */ + + kmem_cache_t *desc_cache; +}; + +extern struct iser_global ig; +extern int iser_debug_level; + +/* allocate connection resources needed for rdma functionality */ +int iser_conn_set_full_featured_mode(struct iscsi_iser_conn *p_iser_conn); + +int iser_send_control(struct iscsi_iser_conn *p_iser_conn, + struct iscsi_iser_mgmt_task *p_mtask); + +int iser_send_command(struct iscsi_iser_conn *p_iser_conn, + struct iscsi_iser_cmd_task *p_ctask); + +int iser_send_data_out(struct iscsi_iser_conn *p_iser_conn, + struct iscsi_iser_cmd_task *p_ctask, + struct iscsi_data *hdr); + +int iscsi_iser_hdr_recv(struct iscsi_iser_conn *conn, struct iscsi_hdr *hdr, + char *rx_data); + +void iscsi_iser_conn_failure(struct iscsi_iser_conn *conn, enum iscsi_err err); + + +void iser_conn_init(struct iser_conn *p_iser_conn); + +int iser_conn_establish(struct iser_conn *p_iser_conn, + struct sockaddr_in *dst_addr, + struct sockaddr_in *src_addr); + +void iser_conn_terminate(struct iser_conn *ib_conn); + +void iser_rcv_completion(struct iser_desc *p_desc, + unsigned long dto_xfer_len); + +void iser_snd_completion(struct iser_desc *p_desc); + +void iser_ctask_rdma_init(struct iscsi_iser_cmd_task *p_iser_task); +void iser_ctask_rdma_finalize(struct iscsi_iser_cmd_task *p_iser_task); + +void iser_dto_buffs_release(struct iser_dto *p_dto); + +int iser_regd_buff_release(struct iser_regd_buf *p_regd_buf); + +void iser_reg_single(struct iser_adaptor *p_iser_adaptor, + struct iser_regd_buf *p_regd_buf, + enum dma_data_direction direction); + +int iser_sg_size(struct iser_data_buf *p_mem); + +void iser_start_rdma_unaligned_sg(struct iscsi_iser_cmd_task *p_iser_task, + enum iser_data_dir cmd_dir); + +void iser_finalize_rdma_unaligned_sg(struct iscsi_iser_cmd_task *p_iser_task); + +int iser_reg_rdma_mem(struct iscsi_iser_cmd_task *p_iser_task, + enum iser_data_dir cmd_dir); + +int iser_connect(struct iser_conn *p_iser_conn, + struct sockaddr_in *src_addr, struct sockaddr_in *dst_addr); + +int iser_reg_page_vec(struct iser_conn *p_iser_conn, + struct iser_page_vec *page_vec, + struct iser_mem_reg *p_mem_reg); + +void iser_unreg_mem(struct iser_mem_reg *mem_reg); + +int iser_post_recv(struct iser_desc *p_rx_desc); +int iser_post_send(struct iser_desc *p_tx_desc); +#endif From ogerlitz at voltaire.com Wed Feb 22 06:27:29 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 22 Feb 2006 16:27:29 +0200 (IST) Subject: [openib-general] [PATCH 2/6] [RFC] open iscsi iser transport provider code In-Reply-To: Message-ID: + this is the equivalent of drivers/scsi/iscsi_tcp.c + As of the trunk conventions of being workable with the latest stable kernel (2.6.15) iscsi_iser.c is not uptodate with iscsi_tcp.c which passed many changes towards 2.6.16. The sync that would take place once 2.6.16 is out, will remove much of the duplications. --- /ulp/iser-x/iscsi_iser.c 2006-02-22 15:06:49.000000000 +0200 +++ /ulp/iser/iscsi_iser.c 2006-02-22 15:14:42.000000000 +0200 @@ -1 +1,1850 @@ +/* + * iSCSI Initiator over iSER Data-Path + * + * Copyright (C) 2004 Dmitry Yusupov + * Copyright (C) 2004 Alex Aizman + * Copyright (C) 2005 Mike Christie + * Copyright (c) 2005, 2006 Voltaire, Inc. All rights reserved. + * maintained by openib-general at openib.org + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Credits: + * Christoph Hellwig + * FUJITA Tomonori + * Arne Redlich + * Zhenyu Wang + * Modified by: + * Erez Zilber + * + * + * $Id: iscsi_iser.c 5460 2006-02-22 11:25:08Z ogerlitz $ + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include + +#include +#include +#include +#include +#include +#include +#include +#include + +#include "iscsi_iser.h" +#include "iser_socket.h" + +#define INVALID_SN_DELTA 0xffff + +#ifdef DEBUG_ISER +#define debug_iser(fmt...) printk(KERN_DEBUG "iser: " fmt) +#else +#define debug_iser(fmt...) +#endif + +#ifdef DEBUG_SCSI +#define debug_scsi(fmt...) printk(KERN_DEBUG "scsi: " fmt) +#else +#define debug_scsi(fmt...) +#endif + +static unsigned int iscsi_max_lun = 512; +module_param_named(max_lun, iscsi_max_lun, uint, S_IRUGO); + +#define DRV_VER "$Rev$" +#define DRV_DATE "$LastChangedDate$" + +int iser_debug_level = 0; + +MODULE_DESCRIPTION("iSER (iSCSI Extensions for RDMA) Datamover " + "v" DRV_VER "(" DRV_DATE ")"); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_AUTHOR("Alex Nezhinsky, Dan Bar Dov"); + +module_param_named(debug_level, iser_debug_level, int, 0644); +MODULE_PARM_DESC(debug_level,"Enable debug tracing if > 0 (default:disabled)"); + +struct iser_global ig; + +void +iscsi_iser_conn_failure(struct iscsi_iser_conn *conn, enum iscsi_err err) +{ + struct iscsi_iser_session *session = conn->session; + unsigned long flags; + + spin_lock_irqsave(&session->lock, flags); + if (session->conn_cnt == 1 || session->leadconn == conn) + session->state = ISCSI_STATE_FAILED; + spin_unlock_irqrestore(&session->lock, flags); + set_bit(SUSPEND_BIT, &conn->suspend_tx); + iscsi_conn_error(iscsi_handle(conn), err); +} + +static inline int +iscsi_iser_check_assign_cmdsn(struct iscsi_iser_session *session, + struct iscsi_nopin *hdr) +{ + uint32_t max_cmdsn = be32_to_cpu(hdr->max_cmdsn); + uint32_t exp_cmdsn = be32_to_cpu(hdr->exp_cmdsn); + + if (max_cmdsn < exp_cmdsn -1 && + max_cmdsn > exp_cmdsn - INVALID_SN_DELTA) + return ISCSI_ERR_MAX_CMDSN; + if (max_cmdsn > session->max_cmdsn || + max_cmdsn < session->max_cmdsn - INVALID_SN_DELTA) + session->max_cmdsn = max_cmdsn; + if (exp_cmdsn > session->exp_cmdsn || + exp_cmdsn < session->exp_cmdsn - INVALID_SN_DELTA) + session->exp_cmdsn = exp_cmdsn; + + return 0; +} + +static inline void +iscsi_iser_ctask_cleanup(struct iscsi_iser_conn *conn, + struct iscsi_iser_cmd_task *ctask) +{ + struct scsi_cmnd *sc = ctask->sc; + struct iscsi_iser_session *session = conn->session; + + spin_lock(&session->lock); + if (unlikely(!sc)) { + spin_unlock(&session->lock); + return; + } + ctask->sc = NULL; + __kfifo_put(session->cmdpool.queue, (void*)&ctask, sizeof(void*)); + spin_unlock(&session->lock); +} + +/** + * iscsi_cmd_rsp - SCSI Command Response processing + * @conn: iscsi connection + * @ctask: scsi command task + **/ +static int +iscsi_iser_cmd_rsp(struct iscsi_iser_conn *conn, + struct iscsi_iser_cmd_task *ctask, + struct iscsi_hdr *hdr, char *rx_data) +{ + int rc; + struct iscsi_cmd_rsp *rhdr = (struct iscsi_cmd_rsp *)hdr; + struct iscsi_iser_session *session = conn->session; + struct scsi_cmnd *sc = ctask->sc; + int senselen = 0; + char *data = NULL; + + rc = iscsi_iser_check_assign_cmdsn(session, (struct iscsi_nopin*)rhdr); + if (rc) { + sc->result = (DID_ERROR << 16); + goto out; + } + + conn->exp_statsn = be32_to_cpu(rhdr->statsn) + 1; + + sc->result = (DID_OK << 16) | rhdr->cmd_status; + + if (rhdr->response != ISCSI_STATUS_CMD_COMPLETED) { + sc->result = (DID_ERROR << 16); + goto out; + } + + if (ntoh24(rhdr->dlength)) { + data = rx_data; + senselen = (data[0] << 8) | data[1]; + } + + if (rhdr->cmd_status == SAM_STAT_CHECK_CONDITION && senselen) { + int sensecopy = min(senselen, SCSI_SENSE_BUFFERSIZE); + + memcpy(sc->sense_buffer, data + 2, sensecopy); + debug_scsi("copied %d bytes of sense\n", sensecopy); + } + + if (sc->sc_data_direction == DMA_TO_DEVICE) + goto out; + + if (rhdr->flags & ISCSI_FLAG_CMD_UNDERFLOW) { + int res_count = be32_to_cpu(rhdr->residual_count); + + if (res_count > 0 && res_count <= sc->request_bufflen) + sc->resid = res_count; + else + sc->result = (DID_BAD_TARGET << 16) | rhdr->cmd_status; + } else if (rhdr->flags & ISCSI_FLAG_CMD_BIDI_UNDERFLOW) + sc->result = (DID_BAD_TARGET << 16) | rhdr->cmd_status; + else if (rhdr->flags & ISCSI_FLAG_CMD_OVERFLOW) + sc->resid = be32_to_cpu(rhdr->residual_count); + +out: + debug_scsi("done [sc %lx res %d itt 0x%x]\n", + (long)sc, sc->result, ctask->itt); + + iscsi_iser_ctask_cleanup(conn, ctask); + sc->scsi_done(sc); + return rc; +} + +int +iscsi_iser_hdr_recv(struct iscsi_iser_conn *conn, struct iscsi_hdr *hdr, char *rx_data) +{ + int rc = 0; + struct iscsi_iser_cmd_task *ctask; + struct iscsi_iser_session *session = conn->session; + uint32_t itt; + int datalen; + int ahslen; + + /* verify PDU length */ + datalen = ntoh24(hdr->dlength); + if (datalen > DEFAULT_MAX_RECV_DATA_SEGMENT_LENGTH) { + printk(KERN_ERR "iscsi_tcp: datalen %d > %d\n", + datalen, DEFAULT_MAX_RECV_DATA_SEGMENT_LENGTH); + return ISCSI_ERR_DATALEN; + } + + /* read AHS */ + ahslen = hdr->hlength * 4; + + /* verify itt (itt encoding: age+cid+itt) */ + itt = hdr->itt; + if (itt != cpu_to_be32(ISCSI_RESERVED_TAG)) { + if ((itt & AGE_MASK) != + (session->age << AGE_SHIFT)) { + printk(KERN_ERR "iscsi_iser: received itt %x expected " + "session age (%x)\n", itt, + session->age & AGE_MASK); + return ISCSI_ERR_BAD_ITT; + } + + if ((itt & CID_MASK) != (conn->id << CID_SHIFT)) { + printk(KERN_ERR "iscsi_iser: received itt %x, expected " + "CID (%x)\n", itt, conn->id); + return ISCSI_ERR_BAD_ITT; + } + itt = itt & ITT_MASK; + } else + itt = itt; + + if (itt < session->cmds_max) { + ctask = (struct iscsi_iser_cmd_task *)session->cmds[itt]; + + if (!ctask->sc) { + printk(KERN_INFO "iscsi_iser: dropping ctask with " + "itt 0x%x\n", ctask->itt); + return 0; + } + + if (ctask->sc->SCp.phase != session->age) { + printk(KERN_ERR "iscsi_iser: ctask's session age %d, " + "expected %d\n", ctask->sc->SCp.phase, + session->age); + return ISCSI_ERR_SESSION_FAILED; + } + + debug_scsi("rsp [op 0x%x cid %d sc %lx itt 0x%x len %d]\n", + hdr->opcode, conn->id, (long)ctask->sc, + ctask->itt, datalen); + + switch(hdr->opcode) { + case ISCSI_OP_SCSI_CMD_RSP: + BUG_ON((void*)ctask != ctask->sc->SCp.ptr); + if (!datalen) + rc = iscsi_iser_cmd_rsp(conn, ctask, hdr, rx_data); + break; + + default: + rc = ISCSI_ERR_BAD_OPCODE; + break; + } + } else if (itt >= ISCSI_MGMT_ITT_OFFSET && + itt < ISCSI_MGMT_ITT_OFFSET + session->mgmtpool_max) { + struct iscsi_iser_mgmt_task *mtask = (struct iscsi_iser_mgmt_task *) + session->mgmt_cmds[itt - + ISCSI_MGMT_ITT_OFFSET]; + + debug_scsi("immrsp [op 0x%x cid %d itt 0x%x len %d]\n", + hdr->opcode, conn->id, mtask->itt, + datalen); + + switch(hdr->opcode) { + case ISCSI_OP_LOGIN_RSP: + case ISCSI_OP_TEXT_RSP: + case ISCSI_OP_LOGOUT_RSP: + rc = iscsi_iser_check_assign_cmdsn(session, + (struct iscsi_nopin*)hdr); + if (rc) + break; + + rc = iscsi_recv_pdu(iscsi_handle(conn), hdr, + NULL, 0); + if (conn->login_mtask != mtask) { + spin_lock(&session->lock); + __kfifo_put(session->mgmtpool.queue, + (void*)&mtask, sizeof(void*)); + spin_unlock(&session->lock); + } + break; + case ISCSI_OP_SCSI_TMFUNC_RSP: + rc = iscsi_iser_check_assign_cmdsn(session, + (struct iscsi_nopin*)hdr); + if (rc) + break; + + if (datalen || ahslen) { + rc = ISCSI_ERR_PROTO; + break; + } + + spin_lock(&session->lock); + if (conn->tmabort_state == TMABORT_INITIAL) { + __kfifo_put(session->mgmtpool.queue, + (void*)&mtask, sizeof(void*)); + conn->tmabort_state = + ((struct iscsi_tm_rsp *)hdr)-> + response == ISCSI_TMF_RSP_COMPLETE ? + TMABORT_SUCCESS:TMABORT_FAILED; + /* unblock eh_abort() */ + wake_up(&conn->ehwait); + } + spin_unlock(&session->lock); + break; + case ISCSI_OP_NOOP_IN: + if (hdr->ttt != ISCSI_RESERVED_TAG) { + rc = ISCSI_ERR_PROTO; + break; + } + rc = iscsi_iser_check_assign_cmdsn(session, + (struct iscsi_nopin*)hdr); + if (rc) + break; + conn->exp_statsn = be32_to_cpu(hdr->statsn) + 1; + + rc = iscsi_recv_pdu(iscsi_handle(conn), hdr, + NULL, 0); + mtask = (struct iscsi_iser_mgmt_task *) + session->mgmt_cmds[itt - + ISCSI_MGMT_ITT_OFFSET]; + if (conn->login_mtask != mtask) { + spin_lock(&session->lock); + __kfifo_put(session->mgmtpool.queue, + (void*)&mtask, sizeof(void*)); + spin_unlock(&session->lock); + } + break; + default: + rc = ISCSI_ERR_BAD_OPCODE; + break; + } + } else if (itt == ISCSI_RESERVED_TAG) { + switch(hdr->opcode) { + case ISCSI_OP_NOOP_IN: + rc = iscsi_iser_check_assign_cmdsn(session, + (struct iscsi_nopin*)hdr); + if (!rc && hdr->ttt != ISCSI_RESERVED_TAG) + rc = iscsi_recv_pdu(iscsi_handle(conn), + hdr, NULL, 0); + break; + case ISCSI_OP_REJECT: + /* we need sth like iscsi_reject_rsp()*/ + case ISCSI_OP_ASYNC_EVENT: + /* we need sth like iscsi_async_event_rsp() */ + rc = ISCSI_ERR_BAD_OPCODE; + break; + default: + rc = ISCSI_ERR_BAD_OPCODE; + break; + } + } else + rc = ISCSI_ERR_BAD_ITT; + + return rc; +} + +static void +iscsi_iser_unsolicit_data_init(struct iscsi_iser_conn *conn, + struct iscsi_iser_cmd_task *ctask, + struct iscsi_data *hdr) +{ + memset(hdr, 0, sizeof(struct iscsi_data)); + hdr->ttt = cpu_to_be32(ISCSI_RESERVED_TAG); + hdr->datasn = cpu_to_be32(ctask->unsol_datasn); + ctask->unsol_datasn++; + hdr->opcode = ISCSI_OP_SCSI_DATA_OUT; + memcpy(hdr->lun, ctask->hdr->lun, sizeof(hdr->lun)); + + hdr->itt = ctask->hdr->itt; + hdr->exp_statsn = cpu_to_be32(conn->exp_statsn); + + hdr->offset = cpu_to_be32(ctask->total_length - + ctask->rdma_data_count - + ctask->unsol_count); + + if (ctask->unsol_count > conn->max_xmit_dlength) { + hton24(hdr->dlength, conn->max_xmit_dlength); + ctask->data_count = conn->max_xmit_dlength; + hdr->flags = 0; + } else { + hton24(hdr->dlength, ctask->unsol_count); + ctask->data_count = ctask->unsol_count; + hdr->flags = ISCSI_FLAG_CMD_FINAL; + } +} + + +/** + * iscsi_iser_cmd_init - Initialize iSCSI SCSI_READ or SCSI_WRITE commands + * + **/ +static void +iscsi_iser_cmd_init(struct iscsi_iser_conn *conn, + struct iscsi_iser_cmd_task *ctask, + struct scsi_cmnd *sc) +{ + struct iscsi_iser_session *session = conn->session; + + ctask->sc = sc; + ctask->conn = conn; + + ctask->hdr->opcode = ISCSI_OP_SCSI_CMD; + ctask->hdr->flags = ISCSI_ATTR_SIMPLE; + ctask->hdr->lun[1] = sc->device->lun; + ctask->hdr->itt = ctask->itt | (conn->id << CID_SHIFT) | + (session->age << AGE_SHIFT); + ctask->hdr->data_length = cpu_to_be32(sc->request_bufflen); + ctask->hdr->cmdsn = cpu_to_be32(session->cmdsn); session->cmdsn++; + ctask->hdr->exp_statsn = cpu_to_be32(conn->exp_statsn); + memcpy(ctask->hdr->cdb, sc->cmnd, sc->cmd_len); + memset(&ctask->hdr->cdb[sc->cmd_len], 0, + MAX_COMMAND_SIZE - sc->cmd_len); + + ctask->mtask = NULL; + ctask->command_sent = 0; + + ctask->total_length = sc->request_bufflen; + + if (sc->sc_data_direction == DMA_TO_DEVICE) { + ctask->hdr->flags |= ISCSI_FLAG_CMD_WRITE; + BUG_ON(ctask->total_length == 0); + + /* unsolicited bytes to be sent as imm. data - with cmd pdu */ + ctask->imm_count = 0; + /* unsolicited bytes to be sent as data-out */ + ctask->unsol_count = 0; + ctask->unsol_datasn = 0; + + if (session->imm_data_en) { + if (ctask->total_length >= session->first_burst) + ctask->imm_count = min(session->first_burst, + conn->max_xmit_dlength); + else + ctask->imm_count = min(ctask->total_length, + conn->max_xmit_dlength); + hton24(ctask->hdr->dlength, ctask->imm_count); + } else + zero_data(ctask->hdr->dlength); + + if (!session->initial_r2t_en) + ctask->unsol_count = min(session->first_burst, + ctask->total_length) - ctask->imm_count; + if (!ctask->unsol_count) + /* No unsolicit Data-Out's */ + ctask->hdr->flags |= ISCSI_FLAG_CMD_FINAL; + + /* bytes to be sent via RDMA operations */ + ctask->rdma_data_count = ctask->total_length - + ctask->imm_count - + ctask->unsol_count; + + debug_scsi("cmd [itt %x total %d imm %d imm_data %d " + "rdma_data %d]\n", + ctask->itt, ctask->total_length, ctask->imm_count, + ctask->unsol_count, ctask->rdma_data_count); + } else { + ctask->hdr->flags |= ISCSI_FLAG_CMD_FINAL; + if (sc->sc_data_direction == DMA_FROM_DEVICE) + ctask->hdr->flags |= ISCSI_FLAG_CMD_READ; + ctask->datasn = 0; + zero_data(ctask->hdr->dlength); + ctask->rdma_data_count = ctask->total_length; + } + + iser_ctask_rdma_init(ctask); +} + +/** + * iscsi_mtask_xmit - xmit management(immediate) task + * @conn: iscsi connection + * @mtask: task management task + * + * Notes: + * The function can return -EAGAIN in which case caller must + * call it again later, or recover. '0' return code means successful + * xmit. + * + **/ +static int +iscsi_iser_mtask_xmit(struct iscsi_iser_conn *conn, + struct iscsi_iser_mgmt_task *mtask) +{ + int error = 0; + + debug_scsi("mtask deq [cid %d itt 0x%x]\n", conn->id, mtask->itt); + + error = iser_send_control(conn, mtask); + + if (error && error != -EAGAIN) + iscsi_iser_conn_failure(conn, ISCSI_ERR_CONN_FAILED); + + return error; +} + +static int +iscsi_iser_ctask_xmit_unsol_data(struct iscsi_iser_conn *conn, + struct iscsi_iser_cmd_task *ctask) +{ + struct iscsi_data hdr; + int error = 0; + + /* Send data-out PDUs while there's still unsolicited data to send */ + while (ctask->unsol_count > 0) { + iscsi_iser_unsolicit_data_init(conn, ctask, &hdr); + + debug_scsi("Sending data-out: itt 0x%x, data count %d\n", + hdr.itt, ctask->data_count); + + /* the buffer description has been passed with the command */ + /* Send the command */ + error = iser_send_data_out(conn, ctask, &hdr); + if (error) { + ctask->unsol_datasn--; + goto iscsi_iser_ctask_xmit_unsol_data_exit; + } + ctask->unsol_count -= ctask->data_count; + debug_scsi("Need to send %d more as data-out PDUs\n", + ctask->unsol_count); + } + +iscsi_iser_ctask_xmit_unsol_data_exit: + return error; +} + +static int +iscsi_iser_ctask_xmit(struct iscsi_iser_conn *conn, + struct iscsi_iser_cmd_task *ctask) +{ + int error = 0; + + debug_scsi("ctask deq [cid %d itt 0x%x]\n", + conn->id, ctask->itt); + + /* + * serialize with TMF AbortTask + */ + if (ctask->mtask) + return error; + + /* Send the cmd PDU */ + if (!ctask->command_sent) { + error = iser_send_command(conn, ctask); + if (error) + goto iscsi_iser_ctask_xmit_exit; + ctask->command_sent = 1; + } + + /* Send unsolicited data-out PDU(s) if necessary */ + if (ctask->unsol_count) + error = iscsi_iser_ctask_xmit_unsol_data(conn, ctask); + + iscsi_iser_ctask_xmit_exit: + if (error && error != -EAGAIN) + iscsi_iser_conn_failure(conn, ISCSI_ERR_CONN_FAILED); + return error; +} + +/** + * iscsi_data_xmit - xmit any command into the scheduled connection + * @conn: iscsi connection + * + * Notes: + * The function can return -EAGAIN in which case the caller must + * re-schedule it again later or recover. '0' return code means + * successful xmit. + **/ +static int +iscsi_iser_data_xmit(struct iscsi_iser_conn *conn) +{ + if (unlikely(conn->suspend_tx)) { + debug_iser("conn %d Tx suspended!\n", conn->id); + return 0; + } + + /* + * Transmit in the following order: + * + * 1) un-finished xmit (ctask or mtask) + * 2) immediate control PDUs + * 3) SCSI commands + * 4) non-immediate control PDUs + * + * No need to lock around __kfifo_get as long as + * there's one producer and one consumer. + */ + + BUG_ON(conn->ctask && conn->mtask); + + if (conn->ctask) { + if (iscsi_iser_ctask_xmit(conn, conn->ctask)) + goto iscsi_iser_data_xmit_fail; + /* done with this in-progress ctask */ + conn->ctask = NULL; + } + if (conn->mtask) { + if (iscsi_iser_mtask_xmit(conn, conn->mtask)) + goto iscsi_iser_data_xmit_fail; + /* done with this in-progress mtask */ + conn->mtask = NULL; + } + + /* process immediate first */ + if (unlikely(__kfifo_len(conn->immqueue))) { + struct iscsi_iser_session *session = conn->session; + while (__kfifo_get(conn->immqueue, (void*)&conn->mtask, + sizeof(void*))) { + if (iscsi_iser_mtask_xmit(conn, conn->mtask)) + goto iscsi_iser_data_xmit_fail; + + if (conn->mtask->hdr->itt == + cpu_to_be32(ISCSI_RESERVED_TAG)) { + spin_lock_bh(&session->lock); + __kfifo_put(session->mgmtpool.queue, + (void*)&conn->mtask, + sizeof(void*)); + spin_unlock_bh(&session->lock); + } + } + /* done with this mtask */ + conn->mtask = NULL; + } + + /* process command queue */ + while (__kfifo_get(conn->xmitqueue, (void*)&conn->ctask, + sizeof(void*))) { + if (iscsi_iser_ctask_xmit(conn, conn->ctask)) + goto iscsi_iser_data_xmit_fail; + } + /* done with this ctask */ + conn->ctask = NULL; + + /* process the rest control plane PDUs, if any */ + if (unlikely(__kfifo_len(conn->mgmtqueue))) { + struct iscsi_iser_session *session = conn->session; + + while (__kfifo_get(conn->mgmtqueue, (void*)&conn->mtask, + sizeof(void*))) { + if (iscsi_iser_mtask_xmit(conn, conn->mtask)) + goto iscsi_iser_data_xmit_fail; + + if (conn->mtask->hdr->itt == + cpu_to_be32(ISCSI_RESERVED_TAG)) { + spin_lock_bh(&session->lock); + __kfifo_put(session->mgmtpool.queue, + (void*)&conn->mtask, + sizeof(void*)); + spin_unlock_bh(&session->lock); + } + } + /* done with this mtask */ + conn->mtask = NULL; + } + + return 0; + +iscsi_iser_data_xmit_fail: + if (unlikely(conn->suspend_tx)) + return 0; + + return -EAGAIN; +} + +static void +iscsi_iser_xmitworker(void *data) +{ + struct iscsi_iser_conn *conn = data; + + /* + * serialize Xmit worker on a per-connection basis. + */ + mutex_lock(&conn->xmitmutex); + if (iscsi_iser_data_xmit(conn)) + schedule_work(&conn->xmitwork); + mutex_unlock(&conn->xmitmutex); +} + + + +#define FAILURE_BAD_HOST 1 +#define FAILURE_SESSION_FAILED 2 +#define FAILURE_SESSION_FREED 3 +#define FAILURE_WINDOW_CLOSED 4 +#define FAILURE_SESSION_TERMINATE 5 + +static int +iscsi_iser_queuecommand(struct scsi_cmnd *sc, + void (*done)(struct scsi_cmnd *)) +{ + struct Scsi_Host *host; + int reason = 0; + struct iscsi_iser_session *session; + struct iscsi_iser_conn *conn = NULL; + struct iscsi_iser_cmd_task *ctask = NULL; + + sc->scsi_done = done; + sc->result = 0; + + host = sc->device->host; + session = iscsi_hostdata(host->hostdata); + BUG_ON(host != session->host); + + spin_lock(&session->lock); + + if (session->state != ISCSI_STATE_LOGGED_IN) { + if (session->state == ISCSI_STATE_FAILED) { + reason = FAILURE_SESSION_FAILED; + debug_scsi("rejecting becuase session->state = %d\n", + session->state); + goto reject; + } else if (session->state == ISCSI_STATE_TERMINATE) { + reason = FAILURE_SESSION_TERMINATE; + goto fault; + } + reason = FAILURE_SESSION_FREED; + goto fault; + } + + /* + * Check for iSCSI window and take care of CmdSN wrap-around + */ + if ((int)(session->max_cmdsn - session->cmdsn) < 0) { + reason = FAILURE_WINDOW_CLOSED; + debug_scsi("rejecting becuase session->max_cmdsn = %d " + "& session->cmdsn = %d\n", + session->max_cmdsn, + session->cmdsn); + goto reject; + } + + conn = session->leadconn; + + __kfifo_get(session->cmdpool.queue, (void*)&ctask, sizeof(void*)); + BUG_ON(ctask->sc); + + sc->SCp.phase = session->age; + sc->SCp.ptr = (char*)ctask; + iscsi_iser_cmd_init(conn, ctask, sc); + + __kfifo_put(conn->xmitqueue, (void*)&ctask, sizeof(void*)); + debug_scsi( + "ctask enq [%s cid %d sc %lx itt 0x%x " + "len %d cmdsn %d win %d]\n", + sc->sc_data_direction == DMA_TO_DEVICE ? "write" : "read", + conn->id, (long)sc, ctask->itt, sc->request_bufflen, + session->cmdsn, session->max_cmdsn - session->exp_cmdsn + 1); + spin_unlock(&session->lock); + + if (!in_interrupt() && mutex_trylock(&conn->xmitmutex)) { + spin_unlock_irq(host->host_lock); + if (iscsi_iser_data_xmit(conn)) + schedule_work(&conn->xmitwork); + mutex_unlock(&conn->xmitmutex); + spin_lock_irq(host->host_lock); + } else + schedule_work(&conn->xmitwork); + + return 0; + +reject: + spin_unlock(&session->lock); + debug_scsi("cmd 0x%x rejected (%d)\n", sc->cmnd[0], reason); + return SCSI_MLQUEUE_HOST_BUSY; + +fault: + spin_unlock(&session->lock); + printk(KERN_ERR "iscsi_iser: cmd 0x%x is not queued (%d)\n", + sc->cmnd[0], reason); + sc->sense_buffer[0] = 0x70; + sc->sense_buffer[2] = NOT_READY; + sc->sense_buffer[7] = 0x6; + sc->sense_buffer[12] = 0x08; + sc->sense_buffer[13] = 0x00; + sc->result = (DID_NO_CONNECT << 16); + sc->resid = sc->request_bufflen; + sc->scsi_done(sc); + return 0; +} + +static int +iscsi_iser_pool_init(struct iscsi_iser_queue *q, int max, + void ***items, int item_size) +{ + int i; + + *items = kmalloc(max * sizeof(void*), GFP_KERNEL); + if (*items == NULL) + return -ENOMEM; + + q->max = max; + q->pool = kmalloc(max * sizeof(void*), GFP_KERNEL); + if (q->pool == NULL) { + kfree(*items); + return -ENOMEM; + } + + q->queue = kfifo_init((void*)q->pool, max * sizeof(void*), + GFP_KERNEL, NULL); + if (q->queue == ERR_PTR(-ENOMEM)) { + kfree(q->pool); + kfree(*items); + return -ENOMEM; + } + + for (i = 0; i < max; i++) { + q->pool[i] = kmalloc(item_size, GFP_KERNEL); + if (q->pool[i] == NULL) { + int j; + for (j = 0; j < i; j++) { + kfree(q->pool[j]); + } + kfifo_free(q->queue); + kfree(q->pool); + kfree(*items); + return -ENOMEM; + } + memset(q->pool[i], 0, item_size); + (*items)[i] = q->pool[i]; + __kfifo_put(q->queue, (void*)&q->pool[i], sizeof(void*)); + } + return 0; +} + +static void +iscsi_iser_pool_free(struct iscsi_iser_queue *q, void **items) +{ + int i; + + for (i = 0; i < q->max; i++) + kfree(items[i]); + kfree(q->pool); + kfree(items); +} + +static iscsi_connh_t +iscsi_iser_conn_create(iscsi_sessionh_t sessionh, + uint32_t conn_idx) +{ + struct iscsi_iser_session *session = iscsi_ptr(sessionh); + struct iscsi_iser_conn *conn = NULL; + + conn = kzalloc(sizeof *conn, GFP_KERNEL); + if (conn == NULL) { + goto conn_alloc_fail; + } + + /* Init the connection */ + conn->c_stage = ISCSI_CONN_INITIAL_STAGE; + + conn->tmabort_state = TMABORT_INITIAL; + + conn->session = session; + conn->id = conn_idx; + + conn->exp_statsn = 0; + + /* initialize general xmit PDU commands queue */ + conn->xmitqueue = kfifo_alloc(session->cmds_max * sizeof(void*), + GFP_KERNEL, NULL); + if (conn->xmitqueue == ERR_PTR(-ENOMEM)) + goto xmitqueue_alloc_fail; + + /* initialize general immediate & non-immediate PDU commands queue */ + conn->immqueue = kfifo_alloc(session->mgmtpool_max * sizeof(void*), + GFP_KERNEL, NULL); + if (conn->immqueue == ERR_PTR(-ENOMEM)) + goto immqueue_alloc_fail; + + conn->mgmtqueue = kfifo_alloc(session->mgmtpool_max * sizeof(void*), + GFP_KERNEL, NULL); + if (conn->mgmtqueue == ERR_PTR(-ENOMEM)) + goto mgmtqueue_alloc_fail; + + INIT_WORK(&conn->xmitwork, iscsi_iser_xmitworker, conn); + + /* allocate login_mtask used for the login/text sequences */ + spin_lock_bh(&session->lock); + if (!__kfifo_get(session->mgmtpool.queue, + (void*)&conn->login_mtask, + sizeof(void*))) { + spin_unlock_bh(&session->lock); + goto login_mtask_alloc_fail; + } + spin_unlock_bh(&session->lock); + + init_timer(&conn->tmabort_timer); + mutex_init(&conn->xmitmutex); + init_waitqueue_head(&conn->ehwait); + spin_lock_init(&conn->lock); + + return iscsi_handle(conn); + +login_mtask_alloc_fail: + kfifo_free(conn->mgmtqueue); +mgmtqueue_alloc_fail: + kfifo_free(conn->immqueue); +immqueue_alloc_fail: + kfifo_free(conn->xmitqueue); +xmitqueue_alloc_fail: + kfree(conn); +conn_alloc_fail: + return iscsi_handle(NULL); +} + +static void +iscsi_iser_conn_destroy(iscsi_connh_t connh) +{ + struct iscsi_iser_conn *conn = iscsi_ptr(connh); + struct iscsi_iser_session *session = conn->session; + unsigned long flags; + + mutex_lock(&conn->xmitmutex); + set_bit(SUSPEND_BIT, &conn->suspend_tx); + + if (conn->c_stage == ISCSI_CONN_INITIAL_STAGE && conn->sock) { + sock_release(conn->sock); + conn->sock = NULL; + } + + spin_lock_bh(&session->lock); + conn->c_stage = ISCSI_CONN_CLEANUP_WAIT; + if (session->leadconn == conn) { + /* + * leading connection? then give up on recovery. + */ + session->state = ISCSI_STATE_TERMINATE; + wake_up(&conn->ehwait); + } + spin_unlock_bh(&session->lock); + + mutex_unlock(&conn->xmitmutex); + + /* + * Block until all in-progress commands for this connection + * time out or fail. + */ + for (;;) { + spin_lock_irqsave(session->host->host_lock, flags); + if (!session->host->host_busy) { /* OK for ERL == 0 */ + spin_unlock_irqrestore(session->host->host_lock, flags); + debug_iser("%s: released host_lock (host's not busy)\n", __FUNCTION__); + break; + } + spin_unlock_irqrestore(session->host->host_lock, flags); + msleep_interruptible(500); + debug_iser("conn_destroy(): host = 0x%p, host_busy %d host_failed %d\n", session->host, + session->host->host_busy, session->host->host_failed); + /* + * force eh_abort() to unblock + */ + wake_up(&conn->ehwait); + } + + spin_lock_bh(&session->lock); + __kfifo_put(session->mgmtpool.queue, (void*)&conn->login_mtask, + sizeof(void*)); + list_del(&conn->item); + if (list_empty(&session->connections)) + session->leadconn = NULL; + if (session->leadconn && session->leadconn == conn) + session->leadconn = container_of(session->connections.next, + struct iscsi_iser_conn, item); + + if (session->leadconn == NULL) + /* none connections exits.. reset sequencing */ + session->cmdsn = session->max_cmdsn = session->exp_cmdsn = 1; + spin_unlock_bh(&session->lock); + + kfifo_free(conn->xmitqueue); + kfifo_free(conn->immqueue); + kfifo_free(conn->mgmtqueue); + kfree(conn); +} + +static int +iscsi_iser_conn_bind(iscsi_sessionh_t sessionh, + iscsi_connh_t connh, uint32_t transport_fd, + int is_leading) +{ + struct iscsi_iser_session *session = iscsi_ptr(sessionh); + struct iscsi_iser_conn *tmp = ERR_PTR(-EEXIST), *conn = iscsi_ptr(connh); + struct socket *sock; + struct iser_conn *p_iser_conn; + int error = 0; + + /* lookup for existing socket */ + sock = sockfd_lookup(transport_fd, &error); + if (!sock) { + printk(KERN_ERR "iscsi_iser: sockfd_lookup failed %d\n", + error); + return -EEXIST; + } + + /* lookup for existing connection */ + spin_lock_bh(&session->lock); + list_for_each_entry(tmp, &session->connections, item) { + if (tmp == conn) { + if (conn->c_stage != ISCSI_CONN_STOPPED || + conn->stop_stage == STOP_CONN_TERM) { + printk(KERN_ERR "iscsi_iser: can't bind " + "non-stopped connection (%d:%d)\n", + conn->c_stage, conn->stop_stage); + spin_unlock_bh(&session->lock); + return -EIO; + } + break; + } + } + if (tmp != conn) { + /* bind new iSCSI connection to session */ + conn->session = session; + + list_add(&conn->item, &session->connections); + } + spin_unlock_bh(&session->lock); + + if (conn->stop_stage != STOP_CONN_SUSPEND) { + /* bind iSCSI connection and socket */ + conn->sock = sock; + } + + /* binds the iSER connection retrieved from the previously connected * + * socket to the iSCSI layer connection. exchanges connection pointers */ + p_iser_conn = iser_conn_from_sock(sock); + p_iser_conn->p_iscsi_conn = conn; + conn->ib_conn = p_iser_conn; + + if (is_leading) + session->leadconn = conn; + + clear_bit(SUSPEND_BIT, &conn->suspend_tx); + + return 0; +} + +static int +iscsi_iser_conn_start(iscsi_connh_t connh) +{ + struct iscsi_iser_conn *conn = iscsi_ptr(connh); + struct iscsi_iser_session *session = conn->session; + int error = 0; + + if (session == NULL) { + printk(KERN_ERR "iscsi_iser: can't start unbound connection\n"); + return -EPERM; + } + + spin_lock_bh(&session->lock); + conn->c_stage = ISCSI_CONN_STARTED; + session->state = ISCSI_STATE_LOGGED_IN; + + switch(conn->stop_stage) { + case STOP_CONN_RECOVER: + /* + * unblock eh_abort() if it is blocked. re-try all + * commands after successful recovery + */ + session->conn_cnt++; + conn->stop_stage = 0; + conn->tmabort_state = TMABORT_INITIAL; + session->age++; + wake_up(&conn->ehwait); + break; + case STOP_CONN_TERM: + session->conn_cnt++; + conn->stop_stage = 0; + break; + case STOP_CONN_SUSPEND: + conn->stop_stage = 0; + clear_bit(SUSPEND_BIT, &conn->suspend_tx); + break; + default: + break; + } + spin_unlock_bh(&session->lock); + + error = iser_conn_set_full_featured_mode(conn); + + return error; +} + +static void +iscsi_iser_conn_stop(iscsi_connh_t connh, int flag) +{ + struct iscsi_iser_conn *conn = iscsi_ptr(connh); + struct iscsi_iser_session *session = conn->session; + struct iscsi_iser_cmd_task *ctask; + struct iscsi_iser_mgmt_task *mtask; + unsigned long flags; + + BUG_ON(!conn->sock); + + mutex_lock(&conn->xmitmutex); + + spin_lock_irqsave(session->host->host_lock, flags); + spin_lock(&session->lock); + conn->stop_stage = flag; + conn->c_stage = ISCSI_CONN_STOPPED; + set_bit(SUSPEND_BIT, &conn->suspend_tx); + + if (flag != STOP_CONN_SUSPEND) + session->conn_cnt--; + + if (session->conn_cnt == 0 || session->leadconn == conn) + session->state = ISCSI_STATE_FAILED; + + spin_unlock(&session->lock); + spin_unlock_irqrestore(session->host->host_lock, flags); + + if (flag == STOP_CONN_TERM || flag == STOP_CONN_RECOVER) { + /* + * flush xmit queues. + */ + spin_lock_bh(&session->lock); + while (__kfifo_get(conn->xmitqueue, (void*)&ctask, + sizeof(void*))) { + spin_unlock_bh(&session->lock); + local_bh_disable(); + iscsi_iser_ctask_cleanup(conn, ctask); + local_bh_enable(); + spin_lock_bh(&session->lock); + } + conn->ctask = NULL; + + while (__kfifo_get(conn->immqueue, (void*)&mtask, + sizeof(void*)) || + __kfifo_get(conn->mgmtqueue, (void*)&mtask, + sizeof(void*))) { + __kfifo_put(session->mgmtpool.queue,(void*)&mtask, + sizeof(void*)); + } + conn->mtask = NULL; + spin_unlock_bh(&session->lock); + + /* + * release conn only after we stopped data_xmit() + * activity and flushed all outstandings + */ + + /* starts conn teardown process, waits until all previously * + * posted buffers get flushed, deallocates all conn resources */ + iser_conn_terminate(conn->ib_conn); + + sock_release(conn->sock); + conn->sock = NULL; + } + mutex_unlock(&conn->xmitmutex); +} + + +static int +iscsi_iser_conn_send_generic(iscsi_connh_t connh, struct iscsi_hdr *hdr, + char *data, uint32_t data_size) +{ + struct iscsi_iser_conn *conn = iscsi_ptr(connh); + struct iscsi_iser_session *session = conn->session; + struct iscsi_iser_mgmt_task *mtask = NULL; + struct iscsi_nopout *nop = (struct iscsi_nopout *)hdr; + + spin_lock_bh(&session->lock); + if (session->state == ISCSI_STATE_TERMINATE) { + spin_unlock_bh(&session->lock); + return -EPERM; + } + if (hdr->opcode == (ISCSI_OP_LOGIN | ISCSI_OP_IMMEDIATE) || + hdr->opcode == (ISCSI_OP_TEXT | ISCSI_OP_IMMEDIATE)) { + /* + * Login and Text are sent serially, in + * request-followed-by-response sequence. + * Same mtask can be used. Same ITT must be used. + * Note that login_mtask is preallocated at cnx_create(). + */ + mtask = conn->login_mtask; + } else { + BUG_ON(conn->c_stage == ISCSI_CONN_INITIAL_STAGE); + BUG_ON(conn->c_stage == ISCSI_CONN_STOPPED); + if (!__kfifo_get(session->mgmtpool.queue, + (void*)&mtask, sizeof(void*))) { + spin_unlock_bh(&session->lock); + return -ENOSPC; + } + } + + /* + * pre-format CmdSN and ExpStatSN for outgoing PDU. + */ + if (hdr->itt != cpu_to_be32(ISCSI_RESERVED_TAG)) { + hdr->itt = mtask->itt | (conn->id << CID_SHIFT) | + (session->age << AGE_SHIFT); + nop->cmdsn = cpu_to_be32(session->cmdsn); + if (conn->c_stage == ISCSI_CONN_STARTED && + !(hdr->opcode & ISCSI_OP_IMMEDIATE)) + session->cmdsn++; + } else + /* do not advance CmdSN */ + nop->cmdsn = cpu_to_be32(session->cmdsn); + + nop->exp_statsn = cpu_to_be32(conn->exp_statsn); + + memcpy(mtask->hdr, hdr, sizeof(struct iscsi_hdr)); + + spin_unlock_bh(&session->lock); + + if (data_size) { + memcpy(mtask->data, data, data_size); + mtask->data_count = data_size; + } else + mtask->data_count = 0; + + debug_scsi("mgmtpdu [op 0x%x hdr->itt 0x%x datalen %d]\n", + hdr->opcode, hdr->itt, data_size); + + /* + * since send_pdu() could be called at least from two contexts, + * we need to serialize __kfifo_put, so we don't have to take + * additional lock on fast data-path + */ + if (hdr->opcode & ISCSI_OP_IMMEDIATE) + __kfifo_put(conn->immqueue, (void*)&mtask, sizeof(void*)); + else + __kfifo_put(conn->mgmtqueue, (void*)&mtask, sizeof(void*)); + + schedule_work(&conn->xmitwork); + + return 0; +} + +static int +iscsi_iser_eh_host_reset(struct scsi_cmnd *sc) +{ + struct iscsi_iser_cmd_task *ctask = (struct iscsi_iser_cmd_task *)sc->SCp.ptr; + struct iscsi_iser_conn *conn = ctask->conn; + struct iscsi_iser_session *session = conn->session; + + debug_iser("%s: host_busy %d host_failed %d\n", + __FUNCTION__, + session->host->host_busy, session->host->host_failed); + spin_lock_bh(&session->lock); + if (session->state == ISCSI_STATE_TERMINATE) { + debug_scsi("failing host reset: session terminated " + "[CID %d age %d]", conn->id, session->age); + spin_unlock_bh(&session->lock); + return FAILED; + } + spin_unlock_bh(&session->lock); + + debug_scsi("failing connection CID %d due to SCSI host reset " + "[itt 0x%x age %d]", conn->id, ctask->itt, + session->age); + iscsi_iser_conn_failure(conn, ISCSI_ERR_CONN_FAILED); + + return SUCCESS; +} + +static void +iscsi_iser_tmabort_timedout(unsigned long data) +{ + struct iscsi_iser_cmd_task *ctask = (struct iscsi_iser_cmd_task *)data; + struct iscsi_iser_conn *conn = ctask->conn; + struct iscsi_iser_session *session = conn->session; + + spin_lock(&session->lock); + if (conn->tmabort_state == TMABORT_INITIAL) { + __kfifo_put(session->mgmtpool.queue, + (void*)&ctask->mtask, sizeof(void*)); + conn->tmabort_state = TMABORT_TIMEDOUT; + debug_scsi("tmabort timedout [sc %lx itt 0x%x]\n", + (long)ctask->sc, ctask->itt); + /* unblock eh_abort() */ + wake_up(&conn->ehwait); + } + spin_unlock(&session->lock); +} + +static int +iscsi_iser_eh_abort(struct scsi_cmnd *sc) +{ + int rc; + struct iscsi_iser_cmd_task *ctask; + struct iscsi_iser_conn *conn; + struct iscsi_iser_session *session; + + ctask = (struct iscsi_iser_cmd_task *)sc->SCp.ptr; + conn = ctask->conn; + session = conn->session; + + debug_iser("%s: host_busy %d host_failed %d\n", + __FUNCTION__, + session->host->host_busy, session->host->host_failed); + + debug_scsi("aborting [sc %lx itt 0x%x]\n", (long)sc, ctask->itt); + + /* + * two cases for ERL=0 here: + * + * 1) connection-level failure; + * 2) recovery due protocol error; + */ + mutex_lock(&conn->xmitmutex); + spin_lock_bh(&session->lock); + debug_iser("%s: session->state = %d\n", __FUNCTION__, session->state); + if (session->state != ISCSI_STATE_LOGGED_IN) { + if (session->state == ISCSI_STATE_TERMINATE) { + spin_unlock_bh(&session->lock); + mutex_unlock(&conn->xmitmutex); + debug_scsi("abort failed becuase session->state == ISCSI_STATE_TERMINATE\n"); + goto failed; + } + spin_unlock_bh(&session->lock); + } else { + struct iscsi_tm *hdr = &conn->tmhdr; + + /* + * Still LOGGED_IN... + */ + + if (!ctask->sc || sc->SCp.phase != session->age) { + /* + * 1) ctask completed before time out. But session + * is still ok => Happy Retry. + * 2) session was re-open during time out of ctask. + */ + spin_unlock_bh(&session->lock); + mutex_unlock(&conn->xmitmutex); + goto success; + } + conn->tmabort_state = TMABORT_INITIAL; + spin_unlock_bh(&session->lock); + + /* + * ctask timed out but session is OK + * ERL=0 requires task mgmt abort to be issued on each + * failed command. requests must be serialized. + */ + memset(hdr, 0, sizeof(struct iscsi_tm)); + hdr->opcode = ISCSI_OP_SCSI_TMFUNC | ISCSI_OP_IMMEDIATE; + hdr->flags = ISCSI_TM_FUNC_ABORT_TASK; + hdr->flags |= ISCSI_FLAG_CMD_FINAL; + memcpy(hdr->lun, ctask->hdr->lun, sizeof(hdr->lun)); + hdr->rtt = ctask->hdr->itt; + hdr->refcmdsn = ctask->hdr->cmdsn; + + iser_err("op 0x%x aborting rtt 0x%x itt 0x%x dlength %d]\n", + hdr->opcode, hdr->rtt, hdr->itt, ntoh24(hdr->dlength)); + + rc = iscsi_iser_conn_send_generic(iscsi_handle(conn), (struct iscsi_hdr *)hdr, + NULL, 0); + + if (rc) { + iscsi_iser_conn_failure(conn, ISCSI_ERR_CONN_FAILED); + debug_scsi("abort sent failure [itt 0x%x]", ctask->itt); + } else { + /* + * TMF abort vs. TMF response race logic + */ + spin_lock_bh(&session->lock); + ctask->mtask = (struct iscsi_iser_mgmt_task *) + session->mgmt_cmds[(hdr->itt & ITT_MASK) - + ISCSI_MGMT_ITT_OFFSET]; + if (conn->tmabort_state == TMABORT_INITIAL) { + conn->tmabort_timer.expires = 3*HZ + jiffies; + conn->tmabort_timer.function = + iscsi_iser_tmabort_timedout; + conn->tmabort_timer.data = (unsigned long)ctask; + add_timer(&conn->tmabort_timer); + debug_scsi("abort sent [itt 0x%x]\n", ctask->itt); + } else { + if (!ctask->sc || + conn->tmabort_state == TMABORT_SUCCESS) { + conn->tmabort_state = TMABORT_INITIAL; + spin_unlock_bh(&session->lock); + mutex_unlock(&conn->xmitmutex); + goto success; + } + conn->tmabort_state = TMABORT_INITIAL; + iscsi_iser_conn_failure(conn, ISCSI_ERR_CONN_FAILED); + } + spin_unlock_bh(&session->lock); + } + } + mutex_unlock(&conn->xmitmutex); + + /* + * block eh thread until: + * + * 1) abort response; + * 2) abort timeout; + * 3) session re-opened; + * 4) session terminated; + */ + for (;;) { + int p_state = session->state; + + rc = wait_event_interruptible(conn->ehwait, + (p_state == ISCSI_STATE_LOGGED_IN ? + (session->state == ISCSI_STATE_TERMINATE || + conn->tmabort_state != TMABORT_INITIAL) : + (session->state == ISCSI_STATE_TERMINATE || + session->state == ISCSI_STATE_LOGGED_IN))); + if (rc) { + /* shutdown.. */ + session->state = ISCSI_STATE_TERMINATE; + debug_scsi("abort failed: session->state = %d\n", + session->state); + goto failed; + } + + if (signal_pending(current)) + flush_signals(current); + + + if (session->state == ISCSI_STATE_TERMINATE){ + debug_scsi("abort failed because session->state == ISCSI_STATE_TERMINATE (2)\n"); + goto failed; + } + + spin_lock_bh(&session->lock); + if (sc->SCp.phase == session->age && + (conn->tmabort_state == TMABORT_TIMEDOUT || + conn->tmabort_state == TMABORT_FAILED)) { + conn->tmabort_state = TMABORT_INITIAL; + if (!ctask->sc) { + /* + * ctask completed before tmf abort response or + * time out. + * But session is still ok => Happy Retry. + */ + spin_unlock_bh(&session->lock); + break; + } + spin_unlock_bh(&session->lock); + iscsi_iser_conn_failure(conn, ISCSI_ERR_CONN_FAILED); + continue; + } + spin_unlock_bh(&session->lock); + break; + } + +success: + debug_scsi("abort success [sc %lx itt 0x%x]\n", (long)sc, ctask->itt); + rc = SUCCESS; + goto exit; + +failed: + debug_scsi("abort failed [sc %lx itt 0x%x]\n", (long)sc, ctask->itt); + rc = FAILED; + +exit: + del_timer_sync(&conn->tmabort_timer); + + mutex_lock(&conn->xmitmutex); + if (conn->sock) { + struct sock *sk = conn->sock->sk; + + write_lock_bh(&sk->sk_callback_lock); + iscsi_iser_ctask_cleanup(conn, ctask); + write_unlock_bh(&sk->sk_callback_lock); + } + mutex_unlock(&conn->xmitmutex); + return rc; +} + + +static struct scsi_host_template iscsi_iser_sht = { + .name = "iSCSI Initiator over iSER, v." + ISCSI_VERSION_STR, + .queuecommand = iscsi_iser_queuecommand, + .can_queue = ISCSI_ISER_XMIT_CMDS_MAX - 1, + .sg_tablesize = ISCSI_ISER_SG_TABLESIZE, + .cmd_per_lun = ISCSI_ISER_CMD_PER_LUN, + .eh_abort_handler = iscsi_iser_eh_abort, + .eh_host_reset_handler = iscsi_iser_eh_host_reset, + .use_clustering = DISABLE_CLUSTERING, + .proc_name = "iscsi_iser", + .this_id = -1, +}; + +static iscsi_sessionh_t +iscsi_iser_session_create(uint32_t initial_cmdsn, + struct Scsi_Host *host) +{ + struct iscsi_iser_session *session = NULL; + int cmd_i, mgmt_i, j; + + session = iscsi_hostdata(host->hostdata); + memset(session, 0, sizeof(struct iscsi_iser_session)); + + session->host = host; + session->id = host->host_no; + session->mgmtpool_max = ISCSI_ISER_MGMT_CMDS_MAX; + session->cmds_max = ISCSI_ISER_XMIT_CMDS_MAX; + session->cmdsn = initial_cmdsn; + session->exp_cmdsn = initial_cmdsn + 1; + session->max_cmdsn = initial_cmdsn + 1; + + if (iscsi_iser_pool_init(&session->cmdpool, session->cmds_max, + (void***)&session->cmds, + sizeof(struct iscsi_iser_cmd_task))) + goto cmdpool_alloc_fail; + + /* pre-format cmds pool with ITT */ + for (cmd_i = 0; cmd_i < session->cmds_max; cmd_i++) { + session->cmds[cmd_i]->itt = cmd_i; + + session->cmds[cmd_i]->hdr = (struct iscsi_cmd *) + &session->cmds[cmd_i]->desc.iscsi_header; + } + + spin_lock_init(&session->lock); + INIT_LIST_HEAD(&session->connections); + + /* initialize immediate command pool */ + if (iscsi_iser_pool_init(&session->mgmtpool, session->mgmtpool_max, + (void***)&session->mgmt_cmds, + sizeof(struct iscsi_iser_mgmt_task))) + goto mgmtpool_alloc_fail; + + /* pre-format immediate cmds pool with ITT */ + for (mgmt_i = 0; mgmt_i < session->mgmtpool_max; mgmt_i++) { + session->mgmt_cmds[mgmt_i]->itt = ISCSI_MGMT_ITT_OFFSET + mgmt_i; + + session->mgmt_cmds[mgmt_i]->hdr = + &session->mgmt_cmds[mgmt_i]->desc.iscsi_header; + + session->mgmt_cmds[mgmt_i]->desc.data = + kmalloc(DEFAULT_MAX_RECV_DATA_SEGMENT_LENGTH, + GFP_KERNEL); + + if (!session->mgmt_cmds[mgmt_i]->desc.data) + goto immdata_alloc_fail; + + session->mgmt_cmds[mgmt_i]->data = + session->mgmt_cmds[mgmt_i]->desc.data; + } + + return iscsi_handle(session); + +immdata_alloc_fail: + for (j = 0; j < mgmt_i; j++) + kfree(session->mgmt_cmds[j]->desc.data); + iscsi_iser_pool_free(&session->mgmtpool, (void**)session->mgmt_cmds); +mgmtpool_alloc_fail: + iscsi_iser_pool_free(&session->cmdpool, (void**)session->cmds); +cmdpool_alloc_fail: + return iscsi_handle(NULL); +} + +static void +iscsi_iser_session_destroy(iscsi_sessionh_t sessionh) +{ + int mgmt_i; + struct iscsi_iser_session *session = iscsi_ptr(sessionh); + + for (mgmt_i = 0; mgmt_i < session->mgmtpool_max; mgmt_i++) + kfree(session->mgmt_cmds[mgmt_i]->desc.data); + + iscsi_iser_pool_free(&session->mgmtpool, (void**)session->mgmt_cmds); + iscsi_iser_pool_free(&session->cmdpool, (void**)session->cmds); +} + +static int +iscsi_iser_conn_set_param(iscsi_connh_t connh, + enum iscsi_param param, + uint32_t value) +{ + struct iscsi_iser_conn *conn = iscsi_ptr(connh); + struct iscsi_iser_session *session = conn->session; + + spin_lock_bh(&session->lock); + if (conn->c_stage != ISCSI_CONN_INITIAL_STAGE && + conn->stop_stage != STOP_CONN_RECOVER) { + printk(KERN_ERR "iscsi_iser: can not change parameter [%d]\n", + param); + spin_unlock_bh(&session->lock); + return 0; + } + spin_unlock_bh(&session->lock); + + switch (param) { + case ISCSI_PARAM_MAX_RECV_DLENGTH: + /* TBD */ + break; + case ISCSI_PARAM_MAX_XMIT_DLENGTH: + conn->max_xmit_dlength = value; + break; + case ISCSI_PARAM_HDRDGST_EN: + if (value) { + printk(KERN_ERR "DataDigest wasn't negotiated to None"); + return -EPROTO; + } + break; + case ISCSI_PARAM_DATADGST_EN: + if (value) { + printk(KERN_ERR "DataDigest wasn't negotiated to None"); + return -EPROTO; + } + break; + case ISCSI_PARAM_INITIAL_R2T_EN: + session->initial_r2t_en = value; + break; + case ISCSI_PARAM_IMM_DATA_EN: + session->imm_data_en = value; + break; + case ISCSI_PARAM_FIRST_BURST: + session->first_burst = value; + break; + case ISCSI_PARAM_MAX_BURST: + session->max_burst = value; + break; + case ISCSI_PARAM_PDU_INORDER_EN: + session->pdu_inorder_en = value; + break; + case ISCSI_PARAM_DATASEQ_INORDER_EN: + session->dataseq_inorder_en = value; + break; + case ISCSI_PARAM_ERL: + session->erl = value; + break; + case ISCSI_PARAM_IFMARKER_EN: + if (value) { + printk(KERN_ERR "IFMarker wasn't negotiated to No"); + return -EPROTO; + } + break; + case ISCSI_PARAM_OFMARKER_EN: + if (value) { + printk(KERN_ERR "OFMarker wasn't negotiated to No"); + return -EPROTO; + } + break; + case ISCSI_PARAM_RDMAEXTENSIONS: + if (!value) { + printk(KERN_ERR "RDMAExtensions wasn't negotiated to Yes"); + return -EPROTO; + } + break; + default: + break; + } + + return 0; +} + +static int +iscsi_iser_conn_get_param(iscsi_connh_t connh, + enum iscsi_param param, + uint32_t *value) +{ + struct iscsi_iser_conn *conn = iscsi_ptr(connh); + struct iscsi_iser_session *session = conn->session; + + switch (param) { + case ISCSI_PARAM_MAX_XMIT_DLENGTH: + *value = conn->max_xmit_dlength; + break; + case ISCSI_PARAM_HDRDGST_EN: + *value = 0; + break; + case ISCSI_PARAM_DATADGST_EN: + *value = 0; + break; + case ISCSI_PARAM_INITIAL_R2T_EN: + *value = session->initial_r2t_en; + break; + case ISCSI_PARAM_MAX_R2T: + *value = session->max_r2t; + break; + case ISCSI_PARAM_IMM_DATA_EN: + *value = session->imm_data_en; + break; + case ISCSI_PARAM_FIRST_BURST: + *value = session->first_burst; + break; + case ISCSI_PARAM_MAX_BURST: + *value = session->max_burst; + break; + case ISCSI_PARAM_PDU_INORDER_EN: + *value = session->pdu_inorder_en; + break; + case ISCSI_PARAM_DATASEQ_INORDER_EN: + *value = session->dataseq_inorder_en; + break; + case ISCSI_PARAM_ERL: + *value = session->erl; + break; + case ISCSI_PARAM_IFMARKER_EN: + *value = 0; + break; + case ISCSI_PARAM_OFMARKER_EN: + *value = 0; + break; + case ISCSI_PARAM_RDMAEXTENSIONS: + *value = 1; + break; + /*case ISCSI_PARAM_TARGET_RECV_DLENGTH: + *value = conn->target_recv_dlength; + break; + case ISCSI_PARAM_INITIATOR_RECV_DLENGTH: + *value = conn->initiator_recv_dlength; + break;*/ + default: + return ISCSI_ERR_PARAM_NOT_FOUND; + } + + return 0; +} + +static int +iscsi_iser_conn_send_pdu(iscsi_connh_t connh, struct iscsi_hdr *hdr, char *data, + uint32_t data_size) +{ + struct iscsi_iser_conn *conn = iscsi_ptr(connh); + int rc; + + mutex_lock(&conn->xmitmutex); + rc = iscsi_iser_conn_send_generic(connh, hdr, data, data_size); + mutex_unlock(&conn->xmitmutex); + + return rc; +} + +static struct iscsi_transport iscsi_iser_transport = { + .owner = THIS_MODULE, + .name = "iser", + .caps = CAP_RECOVERY_L0 | CAP_MULTI_R2T, + .af = AF_ISER, + .rdma = 1, + .host_template = &iscsi_iser_sht, + .hostdata_size = sizeof(struct iscsi_iser_session), + .max_lun = ISCSI_ISER_MAX_LUN, + .max_cmd_len = ISCSI_ISER_MAX_CMD_LEN, + .create_session = iscsi_iser_session_create, + .destroy_session = iscsi_iser_session_destroy, + .create_conn = iscsi_iser_conn_create, + .bind_conn = iscsi_iser_conn_bind, + .destroy_conn = iscsi_iser_conn_destroy, + .set_param = iscsi_iser_conn_set_param, + .get_param = iscsi_iser_conn_get_param, + .start_conn = iscsi_iser_conn_start, + .stop_conn = iscsi_iser_conn_stop, + .send_pdu = iscsi_iser_conn_send_pdu, +}; + +static int __init iser_init(void) +{ + int err; + + iser_dbg("Starting iSER datamover...\n"); + + if (iscsi_max_lun < 1) { + printk(KERN_ERR "Invalid max_lun value of %u\n", iscsi_max_lun); + return -EINVAL; + } + + iscsi_iser_transport.max_lun = iscsi_max_lun; + + memset(&ig, 0, sizeof(struct iser_global)); + + ig.desc_cache = kmem_cache_create("iser_descriptors", + sizeof (struct iser_desc), + 0, SLAB_HWCACHE_ALIGN, + NULL, NULL); + if (ig.desc_cache == NULL) + return -ENOMEM; + + /* adaptor init is called only after the first addr resolution */ + mutex_init(&ig.adaptor_list_mutex); + INIT_LIST_HEAD(&ig.adaptor_list); + + err = iser_register_sockets(); + if (err) { + iser_err("iser socket init failed!\n"); + goto register_socket_failure; + } + + err = iscsi_register_transport(&iscsi_iser_transport); + if (err) { + iser_err("iscsi_register_transport failed\n"); + goto register_transport_failure; + } + + return 0; + +register_transport_failure: + iser_unreg_sockets(); +register_socket_failure: + kmem_cache_destroy(ig.desc_cache); + + return err; +} + +static void __exit iser_exit(void) +{ + iser_dbg("Removing iSER datamover...\n"); + iscsi_unregister_transport(&iscsi_iser_transport); + kmem_cache_destroy(ig.desc_cache); + iser_unreg_sockets(); +} + +module_init(iser_init); +module_exit(iser_exit); From ogerlitz at voltaire.com Wed Feb 22 06:32:56 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 22 Feb 2006 16:32:56 +0200 (IST) Subject: [openib-general] [PATCH 3/6] [RFC] iser initiator In-Reply-To: Message-ID: + the main entry points to this code are iser_send_control/command/dataout for flow coming from iscsi_iser.c and iser_rcv_completion for handilg completions towards iscsi_iser.c --- /ulp/iser-x/iser_initiator.c 2006-02-22 15:06:56.000000000 +0200 +++ /ulp/iser/iser_initiator.c 2006-02-22 13:48:55.000000000 +0200 @@ -1 +1,743 @@ +/* + * Copyright (c) 2004, 2005, 2006 Voltaire, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: iser_initiator.c 5459 2006-02-22 11:00:48Z ogerlitz $ + */ +#include +#include +#include +#include +#include +#include + +#include "iscsi_iser.h" + +/* Constant PDU lengths calculations */ +#define ISER_HDR_LEN sizeof (struct iser_hdr) +#define ISER_PDU_BHS_LENGTH sizeof (struct iscsi_hdr) +#define ISER_TOTAL_HEADERS_LEN (ISER_HDR_LEN + ISER_PDU_BHS_LENGTH) + +#define USE_OFFSET(offset) (offset) +#define USE_NO_OFFSET 0 +#define USE_SIZE(size) (size) +#define USE_ENTIRE_SIZE 0 + +/* iser_dto_add_regd_buff - increments the reference count for * + * the registered buffer & adds it to the DTO object */ +static void iser_dto_add_regd_buff(struct iser_dto *p_dto, + struct iser_regd_buf *p_regd_buf, + unsigned long use_offset, + unsigned long use_size) +{ + int add_idx; + + atomic_inc(&p_regd_buf->ref_count); + + add_idx = p_dto->regd_vector_len; + p_dto->regd[add_idx] = p_regd_buf; + p_dto->used_sz[add_idx] = use_size; + p_dto->offset[add_idx] = use_offset; + + p_dto->regd_vector_len++; +} + +static int iser_dma_map_task_data(struct iscsi_iser_cmd_task *p_iser_task, + struct iser_data_buf *p_data, + enum iser_data_dir iser_dir, + enum dma_data_direction dma_dir) +{ + struct device *dma_device; + dma_addr_t dma_addr; + int dma_nents; + + p_iser_task->dir[iser_dir] = 1; + dma_device = p_iser_task->conn->ib_conn->p_adaptor->device->dma_device; + + if (p_data->type == ISER_BUF_TYPE_SINGLE) { + p_iser_task->data_len[iser_dir] = p_data->size; + dma_addr = dma_map_single(dma_device,p_data->p_buf, p_data->size, + dma_dir); + if (dma_mapping_error(dma_addr)) { + iser_err("dma_map_single failed at %p\n", p_data->p_buf); + return -EINVAL; + } + p_data->dma_addr = dma_addr; + } else { + dma_nents = dma_map_sg(dma_device, p_data->p_buf, p_data->size, + dma_dir); + if (dma_nents == 0) { + iser_err("dma_map_sg failed!!!\n"); + return -EINVAL; + } + p_data->dma_nents = dma_nents; + p_iser_task->data_len[iser_dir] = iser_sg_size(p_data); + } + return 0; +} + +static void iser_dma_unmap_task_data(struct iscsi_iser_cmd_task *p_iser_task) +{ + struct device *dma_device; + struct iser_data_buf *p_data; + + dma_device = p_iser_task->conn->ib_conn->p_adaptor->device->dma_device; + + p_data = &p_iser_task->data[ISER_DIR_IN]; + if (p_data->p_buf != NULL && p_data->type == ISER_BUF_TYPE_SCATTERLIST) + dma_unmap_sg(dma_device, p_data->p_buf, p_data->size, + DMA_FROM_DEVICE); + else if (p_data->p_buf != NULL) /* p_data->type == ISER_BUF_TYPE_SINGLE */ + dma_unmap_single(dma_device, p_data->dma_addr, p_data->size, + DMA_FROM_DEVICE); + + p_data = &p_iser_task->data[ISER_DIR_OUT]; + if (p_data->p_buf != NULL && p_data->type == ISER_BUF_TYPE_SCATTERLIST) + dma_unmap_sg(dma_device, p_data->p_buf, p_data->size, + DMA_TO_DEVICE); + else if (p_data->p_buf != NULL) /* p_data->type == ISER_BUF_TYPE_SINGLE */ + dma_unmap_single(dma_device, p_data->dma_addr, p_data->size, + DMA_TO_DEVICE); +} + +/* Register user buffer memory and initialize passive rdma + * dto descriptor. Total data size is stored in + * p_iser_task->data_len[ISER_DIR_IN]. + */ +static int iser_prepare_read_cmd(struct iscsi_iser_cmd_task *p_iser_task, + struct iser_data_buf *buf_in, + unsigned int edtl) + +{ + struct iser_regd_buf *p_regd_buf; + int err; + struct iser_hdr *hdr = &p_iser_task->desc.iser_header; + + err = iser_dma_map_task_data(p_iser_task, + buf_in, + ISER_DIR_IN, + DMA_FROM_DEVICE); + if (err) + return err; + + if (edtl > p_iser_task->data_len[ISER_DIR_IN]) { + iser_err("Total data length: %ld, less than EDTL: " + "%d, in READ cmd BHS itt: %d, p_conn: 0x%p\n", + p_iser_task->data_len[ISER_DIR_IN], edtl, + p_iser_task->itt, p_iser_task->conn); + return -EINVAL; + } + + memcpy(&p_iser_task->data[ISER_DIR_IN], buf_in, + sizeof(struct iser_data_buf)); + + err = iser_reg_rdma_mem(p_iser_task,ISER_DIR_IN); + if (err) { + iser_err("Failed to set up Data-IN RDMA\n"); + return err; + } + p_regd_buf = &p_iser_task->rdma_regd[ISER_DIR_IN]; + + hdr->flags |= ISER_RSV; + hdr->read_stag = cpu_to_be32(p_regd_buf->reg.rkey); + hdr->read_va = cpu_to_be64(p_regd_buf->reg.va); + + iser_dbg("Cmd itt:%d READ tags RKEY:%#.4X VA:%#llX\n", + p_iser_task->itt, p_regd_buf->reg.rkey, + (unsigned long long)p_regd_buf->reg.va); + + return 0; +} + +/* Register user buffer memory and initialize passive rdma + * dto descriptor. Total data size is stored in + * p_iser_task->data_len[ISER_DIR_OUT]. + */ +static int +iser_prepare_write_cmd(struct iscsi_iser_cmd_task *p_iser_task, + struct iser_data_buf *buf_out, + unsigned int imm_sz, + unsigned int unsol_sz, + unsigned int edtl) +{ + struct iser_regd_buf *p_regd_buf; + int err; + struct iser_dto *p_send_dto = &p_iser_task->desc.dto; + struct iser_hdr *hdr = &p_iser_task->desc.iser_header; + + err = iser_dma_map_task_data(p_iser_task, + buf_out, + ISER_DIR_OUT, + DMA_TO_DEVICE); + if (err) + return err; + + if (edtl > p_iser_task->data_len[ISER_DIR_OUT]) { + iser_err("Total data length: %ld, less than EDTL: %d, " + "in WRITE cmd BHS itt: %d, p_conn: 0x%p\n", + p_iser_task->data_len[ISER_DIR_OUT], + edtl, p_iser_task->itt, p_iser_task->conn); + return -EINVAL; + } + + memcpy(&p_iser_task->data[ISER_DIR_OUT], buf_out, + sizeof(struct iser_data_buf)); + + err = iser_reg_rdma_mem(p_iser_task,ISER_DIR_OUT); + if (err != 0) { + iser_err("Failed to register write cmd RDMA mem\n"); + return err; + } + + p_regd_buf = &p_iser_task->rdma_regd[ISER_DIR_OUT]; + + if (unsol_sz < edtl) { + hdr->flags |= ISER_WSV; + hdr->write_stag = cpu_to_be32(p_regd_buf->reg.rkey); + hdr->write_va = cpu_to_be64(p_regd_buf->reg.va + unsol_sz); + + iser_dbg("Cmd itt:%d, WRITE tags, RKEY:%#.4X " + "VA:%#llX + unsol:%d\n", + p_iser_task->itt, p_regd_buf->reg.rkey, + (unsigned long long)p_regd_buf->reg.va, unsol_sz); + } + + if (imm_sz > 0) { + iser_dbg("Cmd itt:%d, WRITE, adding imm.data sz: %d\n", + p_iser_task->itt, imm_sz); + iser_dto_add_regd_buff(p_send_dto, + p_regd_buf, + USE_NO_OFFSET, + USE_SIZE(imm_sz)); + } + + return 0; +} + +/** + * iser_post_receive_control - allocates, initializes and posts receive DTO. + */ +static int iser_post_receive_control(struct iscsi_iser_conn *p_iser_conn) +{ + struct iser_desc *rx_desc; + struct iser_regd_buf *p_regd_hdr; + struct iser_regd_buf *p_regd_data; + struct iser_dto *p_recv_dto = NULL; + struct iser_adaptor *p_iser_adaptor = p_iser_conn->ib_conn->p_adaptor; + int rx_data_size, err = 0; + + rx_desc = kmem_cache_alloc(ig.desc_cache, + GFP_KERNEL | __GFP_NOFAIL); + if (rx_desc == NULL) { + iser_err("Failed to alloc desc for post recv\n"); + err = -ENOMEM; + goto post_receive_control_exit; + } + rx_desc->type = ISCSI_RX; + + /* for the login sequence we must support rx of upto 8K */ + if (p_iser_conn->c_stage == ISCSI_CONN_INITIAL_STAGE) + rx_data_size = DEFAULT_MAX_RECV_DATA_SEGMENT_LENGTH; + else /* FIXME till user space sets conn->max_recv_dlength correctly */ + rx_data_size = 128; + + rx_desc->data = kmalloc(rx_data_size, GFP_KERNEL | __GFP_NOFAIL); + + if (rx_desc->data == NULL) { + iser_err("Failed to alloc data buf for post recv\n"); + err = -ENOMEM; + goto post_receive_control_exit; + + } + + p_recv_dto = &rx_desc->dto; + p_recv_dto->p_conn = p_iser_conn; + p_recv_dto->regd_vector_len = 0; + + p_regd_hdr = &rx_desc->hdr_regd_buf; + memset(p_regd_hdr, 0, sizeof(struct iser_regd_buf)); + p_regd_hdr->p_adaptor = p_iser_adaptor; + p_regd_hdr->virt_addr = rx_desc; /* == &rx_desc->iser_header */ + p_regd_hdr->data_size = ISER_TOTAL_HEADERS_LEN; + + iser_reg_single(p_iser_adaptor, p_regd_hdr, DMA_FROM_DEVICE); + + iser_dto_add_regd_buff(p_recv_dto, p_regd_hdr, USE_NO_OFFSET, + USE_ENTIRE_SIZE); + + p_regd_data = &rx_desc->data_regd_buf; + memset(p_regd_data, 0, sizeof(struct iser_regd_buf)); + p_regd_data->p_adaptor = p_iser_adaptor; + p_regd_data->virt_addr = rx_desc->data; + p_regd_data->data_size = rx_data_size; + + iser_reg_single(p_iser_adaptor, p_regd_data, DMA_FROM_DEVICE); + + iser_dto_add_regd_buff(p_recv_dto, p_regd_data, + USE_NO_OFFSET, USE_ENTIRE_SIZE); + + err = iser_post_recv(rx_desc); + +post_receive_control_exit: + if (err && rx_desc) { + iser_dto_buffs_release(p_recv_dto); + if (rx_desc->data != NULL) + kfree(rx_desc->data); + kmem_cache_free(ig.desc_cache, rx_desc); + } + return err; +} + +/* creates a new tx descriptor and adds header regd buffer */ +static void iser_create_send_desc(struct iscsi_iser_conn *p_iser_conn, + struct iser_desc *tx_desc) +{ + struct iser_regd_buf *p_regd_hdr = &tx_desc->hdr_regd_buf; + struct iser_dto *p_send_dto = &tx_desc->dto; + + memset(p_regd_hdr, 0, sizeof(struct iser_regd_buf)); + p_regd_hdr->p_adaptor = p_iser_conn->ib_conn->p_adaptor; + p_regd_hdr->virt_addr = tx_desc; /* == &tx_desc->iser_header */ + p_regd_hdr->data_size = ISER_TOTAL_HEADERS_LEN; + + p_send_dto->p_conn = p_iser_conn; + p_send_dto->notify_enable = 1; + p_send_dto->regd_vector_len = 0; + + memset(&tx_desc->iser_header, 0, ISER_HDR_LEN); + tx_desc->iser_header.flags = ISER_VER; + + iser_dto_add_regd_buff(p_send_dto, p_regd_hdr, + USE_NO_OFFSET, USE_ENTIRE_SIZE); +} + +/** + * iser_conn_set_full_featured_mode - (iSER API) + */ +int iser_conn_set_full_featured_mode(struct iscsi_iser_conn *p_iser_conn) +{ + int i, err = 0; + /* no need to keep it in a var, we are after login so if this should + * be negotiated, by now the result should be available here */ + int initial_post_recv_bufs_num = ISER_MAX_RX_MISC_PDUS; + + iser_dbg("Initially post: %d\n", initial_post_recv_bufs_num); + + /* Check that there is no posted recv or send buffers left - */ + /* they must be consumed during the login phase */ + if (atomic_read(&p_iser_conn->ib_conn->post_recv_buf_count) != 0) + iser_bug("Number of currently posted recv bufs non-zero\n"); + if (atomic_read(&p_iser_conn->ib_conn->post_send_buf_count) != 0) + iser_bug("Number of currently posted send bufs non-zero\n"); + + /* Initial post receive buffers */ + for (i = 0; i < initial_post_recv_bufs_num; i++) { + if (iser_post_receive_control(p_iser_conn) != 0) { + iser_err("Failed to post recv bufs at:%d conn:0x%p\n", + i, p_iser_conn); + err = -ENOMEM; + goto ffeatured_mode_failure; + } + } + iser_dbg("Posted %d post recv bufs, conn:0x%p\n", i, p_iser_conn); + return 0; + +ffeatured_mode_failure: + return err; +} + +static int +iser_check_xmit(struct iscsi_iser_conn *conn, void *task) +{ + int rc = 0; + + spin_lock_bh(&conn->lock); + if (atomic_read(&conn->ib_conn->post_send_buf_count) == + ISER_QP_MAX_REQ_DTOS) { + iser_dbg("%ld can't xmit task %p, suspending tx\n",jiffies,task); + set_bit(SUSPEND_BIT, &conn->suspend_tx); + rc = -EAGAIN; + } + spin_unlock_bh(&conn->lock); + return rc; +} + + +/** + * iser_send_command - send command PDU + */ +int iser_send_command(struct iscsi_iser_conn *p_iser_conn, + struct iscsi_iser_cmd_task *p_ctask) +{ + struct iser_dto *p_send_dto = NULL; + unsigned long edtl; + int err = 0; + struct iser_data_buf data_buf; + + struct iscsi_cmd *hdr = p_ctask->hdr; + struct scsi_cmnd *sc = p_ctask->sc; + + if (atomic_read(&p_iser_conn->ib_conn->state) != ISER_CONN_UP) { + iser_err("Failed to send, conn: 0x%p is not up\n", p_iser_conn->ib_conn); + return -EPERM; + } + if (iser_check_xmit(p_iser_conn, p_ctask)) + return -EAGAIN; + + edtl = ntohl(hdr->data_length); + + /* build the tx desc regd header and add it to the tx desc dto */ + p_ctask->desc.type = ISCSI_TX_SCSI_COMMAND; + p_send_dto = &p_ctask->desc.dto; + p_send_dto->p_task = p_ctask; + iser_create_send_desc(p_iser_conn, &p_ctask->desc); + + if (sc->use_sg) { /* using a scatter list */ + data_buf.p_buf = sc->request_buffer; + data_buf.size = sc->use_sg; + data_buf.type = ISER_BUF_TYPE_SCATTERLIST; + } else { /* using a single buffer */ + data_buf.p_buf = sc->request_buffer; + data_buf.size = sc->request_bufflen; + data_buf.type = ISER_BUF_TYPE_SINGLE; + } + + if (hdr->flags & ISCSI_FLAG_CMD_READ) { + err = iser_prepare_read_cmd(p_ctask, &data_buf, edtl); + if (err) + goto send_command_error; + } + if (hdr->flags & ISCSI_FLAG_CMD_WRITE) { + err = iser_prepare_write_cmd(p_ctask, &data_buf, + p_ctask->imm_count, + p_ctask->imm_count + + p_ctask->unsol_count, + edtl); + if (err) + goto send_command_error; + } + + iser_reg_single(p_iser_conn->ib_conn->p_adaptor, + p_send_dto->regd[0], DMA_TO_DEVICE); + + if (iser_post_receive_control(p_iser_conn) != 0) { + iser_err("post_recv failed!\n"); + err = -ENOMEM; + goto send_command_error; + } + + p_ctask->status = ISER_TASK_STATUS_STARTED; + + err = iser_post_send(&p_ctask->desc); + if (!err) + return 0; + +send_command_error: + if (p_send_dto != NULL) + iser_dto_buffs_release(p_send_dto); + iser_err("conn %p failed err %d\n",p_iser_conn, err); + return err; +} + +/** + * iser_send_data_out - send data out PDU + */ +int iser_send_data_out(struct iscsi_iser_conn *p_iser_conn, + struct iscsi_iser_cmd_task *p_ctask, + struct iscsi_data *hdr) +{ + struct iser_desc *tx_desc = NULL; + struct iser_dto *p_send_dto = NULL; + unsigned long buf_offset; + unsigned long data_seg_len; + unsigned int itt; + int err = 0; + + if (atomic_read(&p_iser_conn->ib_conn->state) != ISER_CONN_UP) { + iser_err("Failed to send, conn: 0x%p is not up\n", p_iser_conn->ib_conn); + return -EPERM; + } + + if (iser_check_xmit(p_iser_conn, p_ctask)) + return -EAGAIN; + + itt = ntohl(hdr->itt); + data_seg_len = ntoh24(hdr->dlength); + buf_offset = ntohl(hdr->offset); + + iser_dbg("%s itt %d dseg_len %d offset %d\n", + __func__,(int)itt,(int)data_seg_len,(int)buf_offset); + + tx_desc = kmem_cache_alloc(ig.desc_cache, GFP_KERNEL | __GFP_NOFAIL); + if (tx_desc == NULL) { + iser_err("Failed to alloc desc for post dataout\n"); + err = -ENOMEM; + goto send_data_out_error; + } + + tx_desc->type = ISCSI_TX_DATAOUT; + memcpy(&tx_desc->iscsi_header, hdr, sizeof(struct iscsi_hdr)); + + /* build the tx desc regd header and add it to the tx desc dto */ + p_send_dto = &tx_desc->dto; + p_send_dto->p_task = p_ctask; + iser_create_send_desc(p_iser_conn, tx_desc); + + iser_reg_single(p_iser_conn->ib_conn->p_adaptor, + p_send_dto->regd[0], DMA_TO_DEVICE); + + /* all data was registered for RDMA, we can use the lkey */ + iser_dto_add_regd_buff(p_send_dto, + &p_ctask->rdma_regd[ISER_DIR_OUT], + USE_OFFSET(buf_offset), + USE_SIZE(data_seg_len)); + + if (buf_offset + data_seg_len > p_ctask->data_len[ISER_DIR_OUT]) { + iser_err("Offset:%ld & DSL:%ld in Data-Out " + "inconsistent with total len:%ld, itt:%d\n", + buf_offset, data_seg_len, + p_ctask->data_len[ISER_DIR_OUT], itt); + err = -EINVAL; + goto send_data_out_error; + } + iser_dbg("data-out itt: %d, offset: %ld, sz: %ld\n", + itt, buf_offset, data_seg_len); + + + err = iser_post_send(tx_desc); + if (!err) + return 0; + +send_data_out_error: + if (p_send_dto != NULL) + iser_dto_buffs_release(p_send_dto); + if (tx_desc != NULL) + kmem_cache_free(ig.desc_cache, tx_desc); + iser_err("conn %p failed err %d\n",p_iser_conn, err); + return err; +} + +int iser_send_control(struct iscsi_iser_conn *p_iser_conn, + struct iscsi_iser_mgmt_task *p_mtask) +{ + struct iser_dto *p_send_dto = NULL; + unsigned int itt; + unsigned long data_seg_len; + int err = 0; + unsigned char opcode; + struct iser_regd_buf *p_regd_buf; + struct iser_adaptor *p_iser_adaptor; + + if (atomic_read(&p_iser_conn->ib_conn->state) != ISER_CONN_UP) { + iser_err("Failed to send, conn: 0x%p is not up\n", p_iser_conn->ib_conn); + return -EPERM; + } + + if (iser_check_xmit(p_iser_conn,p_mtask)) + return -EAGAIN; + + /* build the tx desc regd header and add it to the tx desc dto */ + p_mtask->desc.type = ISCSI_TX_CONTROL; + p_send_dto = &p_mtask->desc.dto; + p_send_dto->p_task = NULL; + iser_create_send_desc(p_iser_conn, &p_mtask->desc); + + p_iser_adaptor = p_iser_conn->ib_conn->p_adaptor; + + iser_reg_single(p_iser_adaptor, p_send_dto->regd[0], DMA_TO_DEVICE); + + itt = ntohl(p_mtask->hdr->itt); + opcode = p_mtask->hdr->opcode & ISCSI_OPCODE_MASK; + data_seg_len = ntoh24(p_mtask->hdr->dlength); + + if (data_seg_len > 0) { + p_regd_buf = &p_mtask->desc.data_regd_buf; + memset(p_regd_buf, 0, sizeof(struct iser_regd_buf)); + p_regd_buf->p_adaptor = p_iser_adaptor; + p_regd_buf->virt_addr = p_mtask->data; + p_regd_buf->data_size = p_mtask->data_count; + iser_reg_single(p_iser_adaptor, p_regd_buf, + DMA_TO_DEVICE); + iser_dto_add_regd_buff(p_send_dto, p_regd_buf, + USE_NO_OFFSET, + USE_SIZE(data_seg_len)); + } + + if (iser_post_receive_control(p_iser_conn) != 0) { + iser_err("post_rcv_buff failed!\n"); + err = -ENOMEM; + goto send_control_error; + } + + err = iser_post_send(&p_mtask->desc); + if (!err) + return 0; + +send_control_error: + if (p_send_dto != NULL) + iser_dto_buffs_release(p_send_dto); + iser_err("conn %p failed err %d\n",p_iser_conn, err); + return err; +} + +/** + * iser_rcv_dto_completion - recv DTO completion + */ +void iser_rcv_completion(struct iser_desc *p_rx_desc, + unsigned long dto_xfer_len) +{ + struct iscsi_iser_session *p_session; + struct iser_dto *p_dto = &p_rx_desc->dto; + struct iscsi_iser_conn *p_iser_conn = p_dto->p_conn; + struct iscsi_iser_cmd_task *p_iser_task = NULL; + struct iscsi_hdr *p_hdr; + char *rx_data = NULL; + int rc, rx_data_size = 0; + unsigned int itt; + unsigned char opcode; + + p_hdr = &p_rx_desc->iscsi_header; + + iser_dbg("op 0x%x itt 0x%x\n", p_hdr->opcode,p_hdr->itt); + + if (dto_xfer_len > ISER_TOTAL_HEADERS_LEN) { /* we have data */ + rx_data_size = dto_xfer_len - ISER_TOTAL_HEADERS_LEN; + rx_data = p_dto->regd[1]->virt_addr; + rx_data += p_dto->offset[1]; + } + + opcode = p_hdr->opcode & ISCSI_OPCODE_MASK; + + if (opcode == ISCSI_OP_SCSI_CMD_RSP) { + p_session = p_iser_conn->session; + itt = p_hdr->itt; + if (!(itt < p_session->cmds_max)) + iser_bug("itt can't be matched to task!!!" + "conn %p opcode %d cmds_max %d itt %d\n", + p_iser_conn,opcode,p_session->cmds_max,itt); + /* use the mapping given with the cmds array indexed by itt */ + p_iser_task = (struct iscsi_iser_cmd_task *)p_session->cmds[itt]; + iser_dbg("itt %d p_iser_task %p\n",itt,p_iser_task); + if (p_iser_task != NULL) { + if (p_iser_task->data_copy[ISER_DIR_IN].p_buf != NULL || + p_iser_task->data_copy[ISER_DIR_OUT].p_buf != NULL) + /* if we were reading, copy back to unaligned * + * sglist, anyway dma_unmap and free the copy */ + iser_finalize_rdma_unaligned_sg(p_iser_task); + + p_iser_task->status = ISER_TASK_STATUS_COMPLETED; + iser_ctask_rdma_finalize(p_iser_task); + } + } + + rc = iscsi_iser_hdr_recv(p_iser_conn, p_hdr, rx_data); + if (rc) + iscsi_iser_conn_failure(p_iser_conn, rc); + + iser_dto_buffs_release(p_dto); + kfree(p_rx_desc->data); + kmem_cache_free(ig.desc_cache, p_rx_desc); + + /* decrementing conn->post_recv_buf_count only --after-- freeing the * + * task eliminates the need to worry on tasks which are completed in * + * parallel to the execution of iser_conn_term. So the code that waits * + * for the posted rx bufs refcount to become zero handles everything */ + atomic_dec(&p_iser_conn->ib_conn->post_recv_buf_count); +} + +void iser_snd_completion(struct iser_desc *p_tx_desc) +{ + struct iser_dto *p_dto = &p_tx_desc->dto; + struct iscsi_iser_conn *p_iser_conn = p_dto->p_conn; + + iser_dbg("Initiator, Data sent p_dto=0x%p\n", p_dto); + + iser_dto_buffs_release(p_dto); + + if (p_tx_desc->type == ISCSI_TX_DATAOUT) + kmem_cache_free(ig.desc_cache, p_tx_desc); + + atomic_dec(&p_iser_conn->ib_conn->post_send_buf_count); + + spin_lock(&p_iser_conn->lock); + if (p_iser_conn->suspend_tx) { + iser_dbg("%ld resuming tx\n",jiffies); + clear_bit(SUSPEND_BIT, &p_iser_conn->suspend_tx); + schedule_work(&p_iser_conn->xmitwork); + } + spin_unlock(&p_iser_conn->lock); +} + +void iser_ctask_rdma_init(struct iscsi_iser_cmd_task *p_iser_task) + +{ + p_iser_task->status = ISER_TASK_STATUS_INIT; + + p_iser_task->dir[ISER_DIR_IN] = 0; + p_iser_task->dir[ISER_DIR_OUT] = 0; + + p_iser_task->data_len[ISER_DIR_IN] = 0; + p_iser_task->data_len[ISER_DIR_OUT] = 0; + + memset(&p_iser_task->rdma_regd[ISER_DIR_IN], 0, + sizeof(struct iser_regd_buf)); + memset(&p_iser_task->rdma_regd[ISER_DIR_OUT], 0, + sizeof(struct iser_regd_buf)); +} + +void iser_ctask_rdma_finalize(struct iscsi_iser_cmd_task *p_iser_task) +{ + int deferred; + + if (p_iser_task->dir[ISER_DIR_IN]) { + deferred = iser_regd_buff_release + (&p_iser_task->rdma_regd[ISER_DIR_IN]); + if (deferred) + iser_bug("References remain for BUF-IN rdma reg\n"); + } + + if (p_iser_task->dir[ISER_DIR_OUT]) { + deferred = iser_regd_buff_release + (&p_iser_task->rdma_regd[ISER_DIR_OUT]); + if (deferred) + iser_bug("References remain for BUF-OUT rdma reg\n"); + } + + iser_dma_unmap_task_data(p_iser_task); +} + +void iser_dto_buffs_release(struct iser_dto *p_dto) +{ + int i; + + for (i = 0; i < p_dto->regd_vector_len; i++) + iser_regd_buff_release(p_dto->regd[i]); +} From ogerlitz at voltaire.com Wed Feb 22 06:34:47 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 22 Feb 2006 16:34:47 +0200 (IST) Subject: [openib-general] [PATCH 4/6] [RFC] iser cma and verbs interaction In-Reply-To: Message-ID: --- /ulp/iser-x/iser_verbs.c 2006-02-22 15:06:59.000000000 +0200 +++ /ulp/iser/iser_verbs.c 2006-02-22 13:48:55.000000000 +0200 @@ -1 +1,784 @@ +/* + * Copyright (c) 2004, 2005, 2006 Voltaire, Inc. All rights reserved. + * Copyright (c) 2005, 2006 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: iser_verbs.c 5459 2006-02-22 11:00:48Z ogerlitz $ + */ +#include +#include +#include +#include +#include +#include +#include "iscsi_iser.h" +#include "iser_socket.h" + +#define ISCSI_ISER_MAX_CONN 8 +#define ISER_MAX_CQ_LEN ((ISER_QP_MAX_RECV_DTOS + \ + ISER_QP_MAX_REQ_DTOS) * \ + ISCSI_ISER_MAX_CONN) + +static void iser_cq_tasklet_fn(unsigned long data); +static void iser_cq_callback(struct ib_cq *cq, void *cq_context); +static void iser_comp_error_worker(void *data); +static void iser_conn_release(struct iser_conn *p_iser_conn); + +static void iser_cq_event_callback(struct ib_event *cause, void *context) +{ + iser_err("got cq event %d \n", cause->event); +} + +static void iser_qp_event_callback(struct ib_event *cause, void *context) +{ + iser_err("got qp event %d\n",cause->event); +} + +/** + * iser_create_adaptor_ib_res - creates Protection Domain (PD), Completion + * Queue (CQ), DMA Memory Region (DMA MR) with the device associated with + * the adapator. + * + * returns 0 on success, -1 on failure + */ +static int iser_create_adaptor_ib_res(struct iser_adaptor *p_iser_adaptor) +{ + struct ib_device *device = p_iser_adaptor->device; + + strcpy(p_iser_adaptor->name, device->name); + iser_dbg("setting device name %s as adaptor name\n", device->name); + + p_iser_adaptor->pd = ib_alloc_pd(device); + if (IS_ERR(p_iser_adaptor->pd)) + goto pd_err; + + p_iser_adaptor->cq = ib_create_cq(device, + iser_cq_callback, + iser_cq_event_callback, + (void *)p_iser_adaptor, + ISER_MAX_CQ_LEN); + if (IS_ERR(p_iser_adaptor->cq)) + goto cq_err; + + if (ib_req_notify_cq(p_iser_adaptor->cq, IB_CQ_NEXT_COMP)) + goto cq_arm_err; + + tasklet_init(&p_iser_adaptor->cq_tasklet, + iser_cq_tasklet_fn, + (unsigned long)p_iser_adaptor); + + p_iser_adaptor->mr = ib_get_dma_mr(p_iser_adaptor->pd, + IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(p_iser_adaptor->mr)) + goto dma_mr_err; + + return 0; + +dma_mr_err: + tasklet_kill(&p_iser_adaptor->cq_tasklet); +cq_arm_err: + ib_destroy_cq(p_iser_adaptor->cq); +cq_err: + ib_dealloc_pd(p_iser_adaptor->pd); +pd_err: + iser_err("failed to allocate an IB resource\n"); + return -1; +} + +/** + * iser_free_adaptor_ib_res - destory/dealloc/dereg the DMA MR, + * CQ and PD created with the device associated with the adapator. + * + * returns 0 on success, -1 on failure + */ +static int iser_free_adaptor_ib_res(struct iser_adaptor *p_iser_adaptor) +{ + BUG_ON(p_iser_adaptor->mr == NULL); + + tasklet_kill(&p_iser_adaptor->cq_tasklet); + + (void)ib_dereg_mr(p_iser_adaptor->mr); + (void)ib_destroy_cq(p_iser_adaptor->cq); + (void)ib_dealloc_pd(p_iser_adaptor->pd); + + p_iser_adaptor->mr = NULL; + p_iser_adaptor->cq = NULL; + p_iser_adaptor->pd = NULL; + return 0; +} + +/** + * iser_create_ib_conn_res - Creates FMR pool and Queue-Pair (QP) + * + * returns 0 on success, -1 on failure + */ +static int iser_create_ib_conn_res(struct iser_conn *p_iser_conn) +{ + struct iser_adaptor *p_iser_adaptor; + struct ib_qp_init_attr init_attr; + int ret; + struct ib_fmr_pool_param params; + + BUG_ON(p_iser_conn->p_adaptor == NULL); + + p_iser_adaptor = p_iser_conn->p_adaptor; + + params.page_shift = PAGE_SHIFT; + /* when the first/last SG element are not start/end * + * page aligned, the map whould be of N+1 pages */ + params.max_pages_per_fmr = ISCSI_ISER_SG_TABLESIZE + 1; + params.pool_size = ISCSI_ISER_XMIT_CMDS_MAX; + params.dirty_watermark = 32; + params.cache = 0; + params.flush_function = NULL; + params.access = (IB_ACCESS_LOCAL_WRITE | + IB_ACCESS_REMOTE_WRITE | + IB_ACCESS_REMOTE_READ); + + p_iser_conn->fmr_pool = ib_create_fmr_pool(p_iser_adaptor->pd, ¶ms); + if (IS_ERR(p_iser_conn->fmr_pool)) { + ret = PTR_ERR(p_iser_conn->fmr_pool); + goto fmr_pool_err; + } + + memset(&init_attr, 0, sizeof init_attr); + + init_attr.event_handler = iser_qp_event_callback; + init_attr.qp_context = (void *)p_iser_conn; + init_attr.send_cq = p_iser_adaptor->cq; + init_attr.recv_cq = p_iser_adaptor->cq; + init_attr.cap.max_send_wr = ISER_QP_MAX_REQ_DTOS; + init_attr.cap.max_recv_wr = ISER_QP_MAX_RECV_DTOS; + init_attr.cap.max_send_sge = MAX_REGD_BUF_VECTOR_LEN; + init_attr.cap.max_recv_sge = 2; + init_attr.sq_sig_type = IB_SIGNAL_REQ_WR; + init_attr.qp_type = IB_QPT_RC; + + ret = rdma_create_qp(p_iser_conn->cma_id, p_iser_adaptor->pd, &init_attr); + if (ret) + goto qp_err; + + p_iser_conn->qp = p_iser_conn->cma_id->qp; + iser_err("setting conn %p cma_id %p: fmr_pool %p qp %p\n", + p_iser_conn, p_iser_conn->cma_id, + p_iser_conn->fmr_pool, p_iser_conn->cma_id->qp); + return ret; + +qp_err: + (void)ib_destroy_fmr_pool(p_iser_conn->fmr_pool); +fmr_pool_err: + iser_err("unable to create fmr pool or qp for ib_conn: %d\n", ret); + return ret; +} + +/** + * iser_free_ib_conn_res - Releases the FMR pool, QP and CMA ID objects + * returns 0 on success, -1 on failure + */ +static int iser_free_ib_conn_res(struct iser_conn *p_iser_conn) +{ + BUG_ON(p_iser_conn == NULL); + + iser_err("freeing conn %p cma_id %p fmr pool %p qp %p\n", + p_iser_conn, p_iser_conn->cma_id, + p_iser_conn->fmr_pool, p_iser_conn->qp); + + /* qp is created only once both addr & route are resolved */ + if (p_iser_conn->fmr_pool != NULL) + ib_destroy_fmr_pool(p_iser_conn->fmr_pool); + + if (p_iser_conn->qp != NULL) + rdma_destroy_qp(p_iser_conn->cma_id); + + if (p_iser_conn->cma_id != NULL) + rdma_destroy_id(p_iser_conn->cma_id); + else + iser_bug("not supposed to be called twice\n"); + + p_iser_conn->fmr_pool = NULL; + p_iser_conn->qp = NULL; + p_iser_conn->cma_id = NULL; + + return 0; +} + +/** + * based on the resolved device node GUID see if there already allocated + * adaptor for this device. If there's no such, create one. + */ +static +struct iser_adaptor *iser_adaptor_find_by_device(struct rdma_cm_id *cma_id) +{ + struct list_head *p_list; + struct iser_adaptor *p_adaptor = NULL; + + mutex_lock(&ig.adaptor_list_mutex); + + p_list = ig.adaptor_list.next; + while (p_list != &ig.adaptor_list) { + p_adaptor = list_entry(p_list, struct iser_adaptor, ig_list); + /* find if there's a match using the node GUID */ + if (p_adaptor->device->node_guid == cma_id->device->node_guid) + break; + } + + if (p_adaptor == NULL) { + p_adaptor = kzalloc(sizeof *p_adaptor, GFP_KERNEL); + if (p_adaptor == NULL) + goto end; + /* assign this device to the adaptor */ + p_adaptor->device = cma_id->device; + /* init the adaptor and link it into ig adaptor list */ + if (iser_create_adaptor_ib_res(p_adaptor)) { + kfree(p_adaptor); + p_adaptor = NULL; + goto end; + } + list_add(&p_adaptor->ig_list, &ig.adaptor_list); + } +end: + BUG_ON(p_adaptor == NULL); + p_adaptor->refcount++; + mutex_unlock(&ig.adaptor_list_mutex); + return p_adaptor; +} + +/* if there's no demand for this adaptor, release it */ +static void iser_adaptor_try_release(struct iser_adaptor *p_adaptor) +{ + mutex_lock(&ig.adaptor_list_mutex); + p_adaptor->refcount--; + iser_err("adaptor %p refcount %d\n",p_adaptor,p_adaptor->refcount); + if (!p_adaptor->refcount) { + iser_free_adaptor_ib_res(p_adaptor); + list_del(&p_adaptor->ig_list); + kfree(p_adaptor); + } + mutex_unlock(&ig.adaptor_list_mutex); +} + +/** + * iser_conn_terminate - Triggers start of the disconnect procedures and wait + * for them to be done + */ +void iser_conn_terminate(struct iser_conn *ib_conn) +{ + int err = 0; + + atomic_set(&ib_conn->state, ISER_CONN_TERMINATING); + err = rdma_disconnect(ib_conn->cma_id); + if (err) + iser_bug("Failed to disconnect, conn: 0x%p err %d\n",ib_conn,err); + wait_event_interruptible(ib_conn->wait, + (atomic_read(&ib_conn->state) == ISER_CONN_DOWN)); + iser_conn_release(ib_conn); +} + +static void iser_connect_error(struct rdma_cm_id *cma_id) +{ + struct iser_conn *p_iser_conn; + p_iser_conn = (struct iser_conn *)cma_id->context; + + if (atomic_read(&p_iser_conn->state) == ISER_CONN_PENDING) { + atomic_set(&p_iser_conn->state, ISER_CONN_DOWN); + wake_up_interruptible(&p_iser_conn->wait); + } else + iser_err("Unexpected evt for conn.state: %d\n", + atomic_read(&p_iser_conn->state)); +} + +static void iser_addr_handler(struct rdma_cm_id *cma_id) +{ + struct iser_adaptor *p_iser_adaptor; + struct iser_conn *p_iser_conn; + int ret; + + p_iser_adaptor = iser_adaptor_find_by_device(cma_id); + p_iser_conn = (struct iser_conn *)cma_id->context; + p_iser_conn->p_adaptor = p_iser_adaptor; + + ret = rdma_resolve_route(cma_id, 1000); + if (ret) { + iser_err("resolve route failed: %d\n", ret); + iser_connect_error(cma_id); + } + return; +} + +static void iser_route_handler(struct rdma_cm_id *cma_id) +{ + struct rdma_conn_param conn_param; + int ret; + + ret = iser_create_ib_conn_res((struct iser_conn *)cma_id->context); + if (ret) + goto failure; + + iser_dbg("path.mtu is %d setting it to %d\n", + cma_id->route.path_rec->mtu, IB_MTU_1024); + + /* we must set the MTU to 1024 as this is what the target is assuming */ + if (cma_id->route.path_rec->mtu > IB_MTU_1024) + cma_id->route.path_rec->mtu = IB_MTU_1024; + + memset(&conn_param, 0, sizeof conn_param); + conn_param.responder_resources = 4; + conn_param.initiator_depth = 1; + conn_param.retry_count = 7; + conn_param.rnr_retry_count = 6; + + ret = rdma_connect(cma_id, &conn_param); + if (ret) { + iser_err("failure connecting: %d\n", ret); + goto failure; + } + + return; +failure: + iser_connect_error(cma_id); +} + +static void iser_connected_handler(struct rdma_cm_id *cma_id) +{ + struct iser_conn *p_iser_conn; + + p_iser_conn = (struct iser_conn *)cma_id->context; + atomic_set(&p_iser_conn->state, ISER_CONN_UP); + wake_up_interruptible(&p_iser_conn->wait); +} + +static void iser_disconnected_handler(struct rdma_cm_id *cma_id) +{ + struct iser_conn *p_iser_conn; + + p_iser_conn = (struct iser_conn *)cma_id->context; + p_iser_conn->disc_evt_flag = 1; + + /* If this event is unsolicited this means that the conn is being */ + /* terminated asynchronously from the iSCSI layer's perspective. */ + if (atomic_read(&p_iser_conn->state) == ISER_CONN_PENDING) { + atomic_set(&p_iser_conn->state, ISER_CONN_DOWN); + wake_up_interruptible(&p_iser_conn->wait); + } else { + if (atomic_read(&p_iser_conn->state) == ISER_CONN_UP) { + atomic_set(&p_iser_conn->state, ISER_CONN_TERMINATING); + iscsi_iser_conn_failure(p_iser_conn->p_iscsi_conn, + ISCSI_ERR_CONN_FAILED); + } + /* Complete the termination process if no posts are pending */ + if ((atomic_read(&p_iser_conn->post_recv_buf_count) == 0) && + (atomic_read(&p_iser_conn->post_send_buf_count) == 0)) { + atomic_set(&p_iser_conn->state, ISER_CONN_DOWN); + wake_up_interruptible(&p_iser_conn->wait); + } + } +} + +static int iser_cma_handler(struct rdma_cm_id *cma_id, struct rdma_cm_event *event) +{ + int ret = 0; + + iser_err("event %d conn %p id %p\n",event->event,cma_id->context,cma_id); + + switch (event->event) { + case RDMA_CM_EVENT_ADDR_RESOLVED: + iser_addr_handler(cma_id); + break; + case RDMA_CM_EVENT_ROUTE_RESOLVED: + iser_route_handler(cma_id); + break; + case RDMA_CM_EVENT_ESTABLISHED: + iser_connected_handler(cma_id); + break; + case RDMA_CM_EVENT_ADDR_ERROR: + case RDMA_CM_EVENT_ROUTE_ERROR: + case RDMA_CM_EVENT_CONNECT_ERROR: + case RDMA_CM_EVENT_UNREACHABLE: + case RDMA_CM_EVENT_REJECTED: + iser_err("event: %d, error: %d\n", event->event, event->status); + iser_connect_error(cma_id); + break; + case RDMA_CM_EVENT_DISCONNECTED: + iser_disconnected_handler(cma_id); + break; + case RDMA_CM_EVENT_DEVICE_REMOVAL: + iser_bug("device removal is not handled yet\n"); + break; + case RDMA_CM_EVENT_CONNECT_RESPONSE: + iser_bug("not expecting cma to deliver the REP!!!\n"); + break; + case RDMA_CM_EVENT_CONNECT_REQUEST: + default: + break; + } + return ret; +} + +void iser_conn_init(struct iser_conn *p_iser_conn) +{ + memset(p_iser_conn, 0, sizeof(struct iser_conn)); + atomic_set(&p_iser_conn->state, ISER_CONN_INIT); + init_waitqueue_head(&p_iser_conn->wait); + atomic_set(&p_iser_conn->post_recv_buf_count, 0); + atomic_set(&p_iser_conn->post_send_buf_count, 0); + INIT_WORK(&p_iser_conn->comperror_work, iser_comp_error_worker, + p_iser_conn); +} + + /** + * starts the process of connecting to the target + * sleeps untill the connection is established or rejected + */ +int iser_connect(struct iser_conn *p_iser_conn, + struct sockaddr_in *src_addr, + struct sockaddr_in *dst_addr) +{ + struct sockaddr *src, *dst; + int err = 0; + + sprintf(p_iser_conn->name,"%d.%d.%d.%d:%d", + NIPQUAD(dst_addr->sin_addr.s_addr), dst_addr->sin_port); + + /* the adaptor is known only --after-- address resolution */ + p_iser_conn->p_adaptor = NULL; + + iser_err("connecting to: %d.%d.%d.%d, port 0x%x\n", + NIPQUAD(dst_addr->sin_addr), dst_addr->sin_port); + + atomic_set(&p_iser_conn->state, ISER_CONN_PENDING); + + p_iser_conn->cma_id = rdma_create_id(iser_cma_handler, + (void *)p_iser_conn, + RDMA_PS_TCP); + if (IS_ERR(p_iser_conn->cma_id)) { + err = PTR_ERR(p_iser_conn->cma_id); + iser_err("rdma_create_id failed: %d\n", err); + goto connect_failure; + } + + src = (struct sockaddr *)src_addr; + dst = (struct sockaddr *)dst_addr; + err = rdma_resolve_addr(p_iser_conn->cma_id, src, dst, 1000); + if (err) { + iser_err("rdma_resolve_addr failed: %d\n", err); + rdma_destroy_id(p_iser_conn->cma_id); + goto connect_failure; + } + + wait_event_interruptible(p_iser_conn->wait, + atomic_read(&p_iser_conn->state) != ISER_CONN_PENDING); + + if (atomic_read(&p_iser_conn->state) != ISER_CONN_UP) { + iser_conn_release(p_iser_conn); + err = -EIO; + goto connect_failure; + } + return 0; + +connect_failure: + atomic_set(&p_iser_conn->state, ISER_CONN_DOWN); + return err; +} + + +/** + * Frees all conn objects and deallocs conn descriptor + */ +static void iser_conn_release(struct iser_conn *p_iser_conn) +{ + struct iser_adaptor *p_iser_adaptor = p_iser_conn->p_adaptor; + + if (atomic_read(&p_iser_conn->state) == ISER_CONN_DOWN) { + iser_free_ib_conn_res(p_iser_conn); /* qp/id freed only once */ + p_iser_conn->p_adaptor = NULL; + /* on EVENT_ADDR_ERROR there's no adaptor yet for this conn */ + if (p_iser_adaptor != NULL) + iser_adaptor_try_release(p_iser_adaptor); + } else + iser_err("conn %p state is %d doing nothing\n", + p_iser_conn,atomic_read(&p_iser_conn->state)); +} + + +/** + * iser_reg_page_vec - Register physical memory + * + * returns: 0 on success, errno code on failure + */ +int iser_reg_page_vec(struct iser_conn *p_iser_conn, + struct iser_page_vec *page_vec, + struct iser_mem_reg *mem_reg) +{ + struct ib_pool_fmr *mem; + u64 io_addr; + u64 *page_list; + int status; + + page_list = page_vec->pages; + io_addr = page_list[0]; + + mem = ib_fmr_pool_map_phys(p_iser_conn->fmr_pool, + page_list, + page_vec->length, + &io_addr); + + if (IS_ERR(mem)) { + status = (int)PTR_ERR(mem); + iser_err("ib_fmr_pool_map_phys failed: %d\n", status); + return status; + } + + mem_reg->lkey = mem->fmr->lkey; + mem_reg->rkey = mem->fmr->rkey; + mem_reg->len = page_vec->length * PAGE_SIZE; + mem_reg->va = io_addr; + mem_reg->mem_h = (void *)mem; + + mem_reg->va += page_vec->offset; + mem_reg->len = page_vec->data_size; + + iser_dbg("PHYSICAL Mem.register, [PHYS p_array: 0x%p, sz: %d, " + "entry[0]: (0x%08lx,%ld)] -> " + "[lkey: 0x%08X mem_h: 0x%p va: 0x%08lX sz: %ld]\n", + page_vec, page_vec->length, + (unsigned long)page_vec->pages[0], + (unsigned long)page_vec->data_size, + (unsigned int)mem_reg->lkey, mem_reg->mem_h, + (unsigned long)mem_reg->va, (unsigned long)mem_reg->len); + return 0; +} + +/** + * Unregister (previosuly registered) memory. + */ +void iser_unreg_mem(struct iser_mem_reg *reg) +{ + int ret; + + iser_dbg("PHYSICAL Mem.Unregister mem_h %p\n",reg->mem_h); + + ret = ib_fmr_pool_unmap((struct ib_pool_fmr *)reg->mem_h); + if (ret) + iser_err("ib_fmr_pool_unmap failed %d\n", ret); + + reg->mem_h = NULL; +} + +/** + * iser_dto_to_iov - builds IOV from a dto descriptor + */ +static void iser_dto_to_iov(struct iser_dto *p_dto, struct ib_sge *iov, int iov_len) +{ + int i; + struct ib_sge *sge; + struct iser_regd_buf *p_regd_buf; + + if (p_dto->regd_vector_len > iov_len) + iser_bug("iov size %d too small for posting dto of len %d\n", + iov_len, p_dto->regd_vector_len); + + for (i = 0; i < p_dto->regd_vector_len; i++) { + sge = &iov[i]; + p_regd_buf = p_dto->regd[i]; + + sge->addr = p_regd_buf->reg.va; + sge->length = p_regd_buf->reg.len; + sge->lkey = p_regd_buf->reg.lkey; + + if (p_dto->used_sz[i] > 0) /* Adjust size */ + sge->length = p_dto->used_sz[i]; + + /* offset and length should not exceed the regd buf length */ + if (sge->length + p_dto->offset[i] > p_regd_buf->reg.len) { + iser_bug("Used len:%ld + offset:%d, exceed reg.buf.len:" + "%ld in dto:0x%p [%d], va:0x%08lX\n", + (unsigned long)sge->length, p_dto->offset[i], + (unsigned long)p_regd_buf->reg.len, p_dto, i, + (unsigned long)sge->addr); + } + + sge->addr += p_dto->offset[i]; /* Adjust offset */ + } +} + +/** + * iser_post_recv - Posts a receive buffer. + * + * returns 0 on success, -1 on failure + */ +int iser_post_recv(struct iser_desc *p_rx_desc) +{ + int ib_ret, ret_val = 0; + struct ib_recv_wr recv_wr, *recv_wr_failed; + struct ib_sge iov[2]; + struct iscsi_iser_conn *p_iser_conn; + struct iser_dto *p_recv_dto = &p_rx_desc->dto; + + /* Retrieve conn */ + p_iser_conn = p_recv_dto->p_conn; + if (p_iser_conn == NULL) + iser_bug("NULL p_conn in dto: 0x%p\n", p_recv_dto); + + iser_dto_to_iov(p_recv_dto, iov, 2); + + recv_wr.next = NULL; + recv_wr.sg_list = iov; + recv_wr.num_sge = p_recv_dto->regd_vector_len; + recv_wr.wr_id = (unsigned long)p_rx_desc; + + atomic_inc(&p_iser_conn->ib_conn->post_recv_buf_count); + ib_ret = ib_post_recv (p_iser_conn->ib_conn->qp, &recv_wr, &recv_wr_failed); + if (ib_ret) { + iser_err("ib_post_recv failed ret=%d\n", ib_ret); + atomic_dec(&p_iser_conn->ib_conn->post_recv_buf_count); + ret_val = -1; + } + + return ret_val; +} + +/** + * iser_start_send - Initiate a Send DTO operation + * + * returns 0 on success, -1 on failure + */ +int iser_post_send(struct iser_desc *p_tx_desc) +{ + int ib_ret, ret_val = 0; + struct ib_send_wr send_wr, *send_wr_failed; + struct ib_sge iov[MAX_REGD_BUF_VECTOR_LEN]; + struct iscsi_iser_conn *p_iser_conn; + struct iser_dto *p_dto = &p_tx_desc->dto; + + p_iser_conn = p_dto->p_conn; + if (p_iser_conn == NULL) + iser_bug("NULL p_conn in dto: 0x%p\n", p_dto); + + iser_dto_to_iov(p_dto, iov, MAX_REGD_BUF_VECTOR_LEN); + + send_wr.next = NULL; + send_wr.wr_id = (unsigned long)p_tx_desc; + send_wr.sg_list = iov; + send_wr.num_sge = p_dto->regd_vector_len; + send_wr.opcode = IB_WR_SEND; + send_wr.send_flags = p_dto->notify_enable ? IB_SEND_SIGNALED : 0; + + atomic_inc(&p_iser_conn->ib_conn->post_send_buf_count); + + ib_ret = ib_post_send(p_iser_conn->ib_conn->qp, &send_wr, &send_wr_failed); + if (ib_ret) { + iser_err("Failed to start SEND DTO, p_dto: 0x%p, IOV len: %d\n", + p_dto, p_dto->regd_vector_len); + iser_err("ib_post_send failed, ret:%d\n", ib_ret); + atomic_dec(&p_iser_conn->ib_conn->post_send_buf_count); + ret_val = -1; + } + + return ret_val; +} + +static void iser_comp_error_worker(void *data) +{ + struct iser_conn *p_iser_conn = data; + + if (atomic_read(&p_iser_conn->state) == ISER_CONN_UP) { + atomic_set(&p_iser_conn->state, ISER_CONN_TERMINATING); + iscsi_iser_conn_failure(p_iser_conn->p_iscsi_conn, + ISCSI_ERR_CONN_FAILED); + } + + /* complete the termination process if disconnect event was delivered * + * note there are no more non completed posts to the QP */ + if (p_iser_conn->disc_evt_flag) { + atomic_set(&p_iser_conn->state, ISER_CONN_DOWN); + wake_up_interruptible(&p_iser_conn->wait); + } +} + +static void iser_handle_comp_error(struct iser_desc *p_desc) +{ + struct iser_dto *p_dto = &p_desc->dto; + struct iser_conn *p_iser_conn = p_dto->p_conn->ib_conn; + + iser_dto_buffs_release(p_dto); + + if (p_desc->type == ISCSI_RX) { + kfree(p_desc->data); + kmem_cache_free(ig.desc_cache, p_desc); + atomic_dec(&p_iser_conn->post_recv_buf_count); + } else { /* type is TX control/command/dataout */ + if (p_desc->type == ISCSI_TX_DATAOUT) + kmem_cache_free(ig.desc_cache, p_desc); + atomic_dec(&p_iser_conn->post_send_buf_count); + } + + if (atomic_read(&p_iser_conn->post_recv_buf_count) == 0 && + atomic_read(&p_iser_conn->post_send_buf_count) == 0) + schedule_work(&p_iser_conn->comperror_work); +} + +static void iser_cq_tasklet_fn(unsigned long data) +{ + struct iser_adaptor *p_iser_adaptor = (struct iser_adaptor *)data; + struct ib_cq *cq = p_iser_adaptor->cq; + struct ib_wc wc; + struct iser_desc *p_desc; + unsigned long xfer_len; + + + while (ib_poll_cq(cq, 1, &wc) == 1) { + p_desc = (struct iser_desc *) (unsigned long) wc.wr_id; + + if (p_desc == NULL) + iser_bug("NULL p_desc\n"); + + if (wc.status == IB_WC_SUCCESS) { + if (p_desc->type == ISCSI_RX) { + xfer_len = (unsigned long)wc.byte_len; + iser_rcv_completion(p_desc, xfer_len); + } else /* type == ISCSI_TX_CONTROL/SCSI_CMD/DOUT */ + iser_snd_completion(p_desc); + } else { + iser_err("comp w. error op %d status %d\n",p_desc->type,wc.status); + iser_handle_comp_error(p_desc); + } + } +/* #warning "it is assumed here that arming CQ only once its empty would not" + * "cause interrupts to be missed" */ + ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); +} + +static void iser_cq_callback(struct ib_cq *cq, void *cq_context) +{ + struct iser_adaptor *p_iser_adaptor = (struct iser_adaptor *)cq_context; + + tasklet_schedule(&p_iser_adaptor->cq_tasklet); +} From ogerlitz at voltaire.com Wed Feb 22 06:35:59 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 22 Feb 2006 16:35:59 +0200 (IST) Subject: [openib-general] [PATCH 5/6] [RFC] iser handling of memory for RDMA In-Reply-To: Message-ID: + the code has the ability to handle the case of SG lists which are not aligned for RDMA in the sense that one VA and RKEY pair can be produced for them by any of ib verbs memory registration apis. + from our experience such lists are very rare and over time less then 0.1% of the data sentdown by the SCSI ML is represented by such SGs + the unaligned SG flow need to be fixed such that dma mapping/unmapping takes place after/before the CPU last/first touching of the data + one planned change here is to always convert SGs to page vector of 4K elements no matter what is the system PAGE_SIZE. This is expected towards 2.6.17 which the merging of the change in the ib_fmr_pool api --- /ulp/iser-x/iser_memory.c 2006-02-22 15:06:53.000000000 +0200 +++ /ulp/iser/iser_memory.c 2006-02-22 13:48:55.000000000 +0200 @@ -1 +1,491 @@ +/* + * Copyright (c) 2004, 2005, 2006 Voltaire, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: iser_memory.c 5459 2006-02-22 11:00:48Z ogerlitz $ + */ +#include +#include +#include +#include +#include +#include +#include "iscsi_iser.h" + +#define ISER_KMALLOC_THRESHOLD 0x20000 /* 128K - kmalloc limit */ +/** + * Decrements the reference count for the + * registered buffer & releases it + * + * returns 0 if released, 1 if deferred + */ +int iser_regd_buff_release(struct iser_regd_buf *p_regd_buf) +{ + struct device *dma_device; + + if ((atomic_read(&p_regd_buf->ref_count) == 0) || + atomic_dec_and_test(&p_regd_buf->ref_count)) { + /* if we used the dma mr, unreg is just NOP */ + if (p_regd_buf->reg.rkey != 0) + iser_unreg_mem(&p_regd_buf->reg); + + if (p_regd_buf->dma_addr) { + dma_device = p_regd_buf->p_adaptor->device->dma_device; + dma_unmap_single(dma_device, + p_regd_buf->dma_addr, + p_regd_buf->data_size, + p_regd_buf->direction); + } + /* else this regd buf is associated with task which we */ + /* dma_unmap_single/sg later */ + return 0; + } else { + iser_dbg("Release deferred, regd.buff: 0x%p\n", p_regd_buf); + return 1; + } +} + +/** + * iser_reg_single - fills registered buffer descriptor with + * registration information + */ +void iser_reg_single(struct iser_adaptor *p_iser_adaptor, + struct iser_regd_buf *p_regd_buf, + enum dma_data_direction direction) +{ + dma_addr_t dma_addr; + + dma_addr = dma_map_single(p_iser_adaptor->device->dma_device, + p_regd_buf->virt_addr, + p_regd_buf->data_size, direction); + if (dma_mapping_error(dma_addr)) + iser_bug("dma_map_single failed at %p\n", p_regd_buf->virt_addr); + + p_regd_buf->reg.lkey = p_iser_adaptor->mr->lkey; + p_regd_buf->reg.rkey = 0; /* indicate there's no need to unreg */ + p_regd_buf->reg.len = p_regd_buf->data_size; + p_regd_buf->reg.va = dma_addr; + + p_regd_buf->dma_addr = dma_addr; + p_regd_buf->direction = direction; +} + + +/** + * iser_sg_size - returns the total data length in an sg list + */ +int iser_sg_size(struct iser_data_buf *p_data) +{ + struct scatterlist *p_sg = (struct scatterlist *)p_data->p_buf; + int i, total_len=0; + + for (i = 0; i < p_data->dma_nents; i++) + total_len += sg_dma_len(&p_sg[i]); + return total_len; +} + +/** + * iser_start_rdma_unaligned_sg + */ +void iser_start_rdma_unaligned_sg(struct iscsi_iser_cmd_task *p_iser_task, + enum iser_data_dir cmd_dir) +{ + dma_addr_t dma_addr; + struct device *dma_device; + char *mem = NULL; + struct iser_data_buf *p_mem = &p_iser_task->data[cmd_dir]; + unsigned long cmd_data_len = p_iser_task->data_len[cmd_dir]; + + if (cmd_data_len > ISER_KMALLOC_THRESHOLD) + mem = (void *)__get_free_pages(GFP_KERNEL | __GFP_NOFAIL, + long_log2(roundup_pow_of_two(cmd_data_len)) - PAGE_SHIFT); + else + mem = kmalloc(cmd_data_len, GFP_KERNEL | __GFP_NOFAIL); + + if (mem == NULL) { + iser_bug("Failed to allocate mem size %d %d for copying sglist\n", + p_mem->size,(int)cmd_data_len); + } + + if (cmd_dir == ISER_DIR_OUT) { + /* copy the unaligned sg the buffer which is used for RDMA */ + struct scatterlist *p_sg = (struct scatterlist *)p_mem->p_buf; + int i; + char *p; + + for (p = mem, i = 0; i < p_mem->size; i++) { + memcpy(p, + page_address(p_sg[i].page) + p_sg[i].offset, + p_sg[i].length); + p += p_sg[i].length; + } + } + + p_iser_task->data_copy[cmd_dir].p_buf = mem; + p_iser_task->data_copy[cmd_dir].size = cmd_data_len; + p_iser_task->data_copy[cmd_dir].type = ISER_BUF_TYPE_SINGLE; + + dma_device = p_iser_task->conn->ib_conn->p_adaptor->device->dma_device; + + if (cmd_dir == ISER_DIR_OUT) + dma_addr = dma_map_single(dma_device, mem, cmd_data_len, + DMA_TO_DEVICE); + else + dma_addr = dma_map_single(dma_device, mem, cmd_data_len, + DMA_FROM_DEVICE); + + if (dma_mapping_error(dma_addr)) + iser_bug("dma_map_single failed at %p\n", mem); + + p_iser_task->data_copy[cmd_dir].dma_addr = dma_addr; +} + +/** + * iser_finalize_rdma_unaligned_sg + */ +void iser_finalize_rdma_unaligned_sg(struct iscsi_iser_cmd_task *p_iser_task) +{ + struct device *dma_device; + struct iser_data_buf *p_mem_copy; + unsigned int size; + dma_addr_t dma_addr; + + dma_device = p_iser_task->conn->ib_conn->p_adaptor->device->dma_device; + + if (p_iser_task->dir[ISER_DIR_IN]) { + char *mem; + struct scatterlist *p_sg; + unsigned char *p; + unsigned int sg_size; + int i; + + p_mem_copy = &p_iser_task->data_copy[ISER_DIR_IN]; + size = p_mem_copy->size; + dma_addr = p_mem_copy->dma_addr; + + dma_unmap_single(dma_device, dma_addr, size, DMA_FROM_DEVICE); + /* copy back read RDMA to unaligned sg */ + mem = p_mem_copy->p_buf; + p_sg = (struct scatterlist *)p_iser_task->data[ISER_DIR_IN].p_buf; + sg_size = p_iser_task->data[ISER_DIR_IN].size; + + for (p = mem, i = 0; i < sg_size; i++){ + memcpy(page_address(p_sg[i].page) + p_sg[i].offset, + p, + p_sg[i].length); + p += p_sg[i].length; + } + + if (size > ISER_KMALLOC_THRESHOLD) + free_pages((unsigned long)p_mem_copy->p_buf, + long_log2(roundup_pow_of_two((int)size)) - PAGE_SHIFT); + else + kfree(p_mem_copy->p_buf); + p_mem_copy->p_buf = NULL; + } + + if (p_iser_task->dir[ISER_DIR_OUT]) { + p_mem_copy = &p_iser_task->data_copy[ISER_DIR_OUT]; + size = p_mem_copy->size; + dma_addr = p_mem_copy->dma_addr; + dma_unmap_single(dma_device, dma_addr, size, DMA_TO_DEVICE); + if (size > ISER_KMALLOC_THRESHOLD) + free_pages((unsigned long)p_mem_copy->p_buf, + long_log2(roundup_pow_of_two((int)size)) - PAGE_SHIFT); + else + kfree(p_mem_copy->p_buf); + p_mem_copy->p_buf = NULL; + } +} + +/** + * iser_sg_to_page_vec - Translates scatterlist entries to physical addresses + * and returns the length of resulting physical address array (may be less than + * the original due to possible compaction). + * + * we build a "page vec" under the assumption that the SG meets the RDMA + * alignment requirements. Other then the first and last SG elements, all + * the "internal" elements can be compacted into a list whose elements are + * dma addresses of physical pages. The code supports also the weird case + * where --few fragments of the same page-- are present in the SG as + * consecutive elements. Also, it handles one entry SG. + */ +static int iser_sg_to_page_vec(struct iser_data_buf *p_data, + struct iser_page_vec *page_vec) +{ + struct scatterlist *p_sg = (struct scatterlist *)p_data->p_buf; + dma_addr_t first_addr, last_addr, page; + int start_aligned, end_aligned; + unsigned int cur_page = 0; + unsigned long total_sz = 0; + int i; + + /* compute the offset of first element */ + /* FIXME page_vec->offset type should be dma_addr_t */ + page_vec->offset = (u64) p_sg[0].offset; + + for (i = 0; i < p_data->dma_nents; i++) { + total_sz += sg_dma_len(&p_sg[i]); + + first_addr = sg_dma_address(&p_sg[i]); + last_addr = first_addr + sg_dma_len(&p_sg[i]); + + start_aligned = !(first_addr & ~PAGE_MASK); + end_aligned = !(last_addr & ~PAGE_MASK); + + /* continue to collect page fragments till aligned or SG ends */ + while (!end_aligned && (i + 1 < p_data->dma_nents)) { + i++; + total_sz += sg_dma_len(&p_sg[i]); + last_addr = sg_dma_address(&p_sg[i]) + sg_dma_len(&p_sg[i]); + end_aligned = !(last_addr & ~PAGE_MASK); + } + + first_addr = first_addr & PAGE_MASK; + + for (page = first_addr; page < last_addr; page += PAGE_SIZE) + page_vec->pages[cur_page++] = page; + + } + page_vec->data_size = total_sz; + iser_dbg("page_vec->data_size:%d cur_page %d\n", page_vec->data_size,cur_page); + return cur_page; +} + +/** + * iser_single_to_page_vec - + */ +static int iser_single_to_page_vec(struct iser_data_buf *p_data, + struct iser_page_vec *page_vec) +{ + u64 fpage, lpage, page; + int i; + + iser_dbg("Translating data:0x%p, single virt:0x%p, data sz: %d\n", + p_data, p_data->p_buf, p_data->size); + + fpage = (u64) p_data->dma_addr & PAGE_MASK; + lpage = (u64) (p_data->dma_addr + p_data->size - 1 + PAGE_SIZE) + & PAGE_MASK; + + page_vec->offset = (u64) (p_data->dma_addr - (long)fpage); + + for (i = 0, page = fpage; page < lpage; page += PAGE_SIZE, i++) { + page_vec->pages[i] = page; + iser_dbg( + "SINGLE VIRT ADDED page[%d]=0x%lX at page_vec %p\n", + i, (long)page, page_vec); + } + + page_vec->data_size = p_data->size; + iser_dbg("page_vec->data_size=%d\n", page_vec->data_size); + + return i; +} + + +#define MASK_4K ((1UL << 12) - 1) /* 0xFFF */ +#define IS_4K_ALIGNED(addr) ((((unsigned long)addr) & MASK_4K) == 0) + +/** + * iser_data_buf_aligned_len - Tries to determine the maximal correctly aligned + * for RDMA sub-list of a scatter-gather list of memory buffers, and returns + * the number of entries which are aligned correctly. Supports the case where + * consecutive SG elements are actually fragments of the same physcial page. + */ +static unsigned int iser_data_buf_aligned_len(struct iser_data_buf *p_data) +{ + struct scatterlist *p_sg; + dma_addr_t end_addr, next_addr; + int i, cnt; + unsigned int ret_len = 0; + + p_sg = (struct scatterlist *)p_data->p_buf; + + for (cnt = 0, i = 0; i < p_data->dma_nents; i++, cnt++) { + /* iser_dbg("Checking sg iobuf [%d]: phys=0x%08lX " + "offset: %ld sz: %ld\n", i, + (unsigned long)page_to_phys(p_sg[i].page), + (unsigned long)p_sg[i].offset, + (unsigned long)p_sg[i].length); */ + end_addr = sg_dma_address(&p_sg[i]) + + sg_dma_len(&p_sg[i]); + /* iser_dbg("Checking sg iobuf end address " + "0x%08lX\n", end_addr); */ + if (i + 1 < p_data->dma_nents) { + next_addr = sg_dma_address(&p_sg[i+1]); + /* are i, i+1 fragments of the same page? */ + if (end_addr == next_addr) + continue; + else if (!IS_4K_ALIGNED(end_addr)) { + ret_len = cnt + 1; + break; + } + } + } + if (i == p_data->dma_nents) + ret_len = cnt; /* loop ended */ + iser_dbg("Found %d aligned entries out of %d in sg:0x%p\n", + ret_len, p_data->dma_nents, p_data); + return ret_len; +} + +static void iser_data_buf_dump(struct iser_data_buf *p_data) +{ + if (p_data->type == ISER_BUF_TYPE_SINGLE) + iser_err("single addr:0x%p sz:%d\n", + p_data->p_buf, p_data->size); + else { + struct scatterlist *p_sg = + (struct scatterlist *)p_data->p_buf; + int i; + + for (i = 0; i < p_data->size; i++) + iser_err("sg[%d] dma_addr:0x%lX page:0x%p " + "off:%d sz:%d dma_len:%d\n", + i, (unsigned long)sg_dma_address(&p_sg[i]), + p_sg[i].page, p_sg[i].offset, + p_sg[i].length,sg_dma_len(&p_sg[i])); + } +} + +/** + * iser_page_vec_alloc - allocate page_vec covering a given data buffer + */ +static struct iser_page_vec *iser_page_vec_alloc(struct iser_data_buf *p_data, + int total_size) +{ + struct iser_page_vec *page_vec; + int npages; + + npages = total_size / PAGE_SIZE + 2; + + page_vec = kmalloc(sizeof(struct iser_page_vec) + + (sizeof(u64) * npages), + GFP_KERNEL | __GFP_NOFAIL); + if (page_vec != NULL) { + page_vec->pages = (u64 *) (page_vec + 1); + page_vec->data_size = total_size; + page_vec->length = 0; + page_vec->offset = 0; + iser_dbg("Allocated page_vec:%p, %d pages for size:%d\n", + page_vec, npages, total_size); + } else + iser_err("Failed to alloc %d pages for size:%d\n", + npages, total_size); + return page_vec; +} + + +static void iser_dump_page_vec(struct iser_page_vec *page_vec) +{ + int i; + + iser_err("page vec length %d data size %d\n", + page_vec->length, page_vec->data_size); + for (i = 0; i < page_vec->length; i++) + iser_err("%d %lx\n",i,(unsigned long)page_vec->pages[i]); +} + +static void iser_page_vec_build(struct iser_data_buf *p_data, + struct iser_page_vec *page_vec) +{ + int page_vec_len = 0; + + if (p_data->type == ISER_BUF_TYPE_SINGLE) { + iser_dbg("Translating single sz: %d\n", p_data->size); + page_vec_len = iser_single_to_page_vec(p_data, page_vec); + } else { + iser_dbg("Translating sg sz: %d\n", p_data->dma_nents); + page_vec_len = iser_sg_to_page_vec(p_data,page_vec); + iser_dbg("sg len %d page_vec_len %d\n", + p_data->dma_nents,page_vec_len); + } + page_vec->length = page_vec_len; + + if (page_vec_len * 4096 < page_vec->data_size) { + if (p_data->type == ISER_BUF_TYPE_SCATTERLIST) { + iser_err("dumping sg\n"); + iser_data_buf_dump(p_data); + } + iser_dump_page_vec(page_vec); + iser_bug("page_vec too short to hold this SG\n"); + } +} + +/** + * iser_reg_rdma_mem - Registers memory intended for RDMA, + * obtaining rkey and va + * + * returns 0 on success, errno code on failure + */ +int iser_reg_rdma_mem(struct iscsi_iser_cmd_task *p_iser_task, + enum iser_data_dir cmd_dir) +{ + struct iser_conn *p_iser_conn = p_iser_task->conn->ib_conn; + struct iser_data_buf *p_mem = &p_iser_task->data[cmd_dir]; + struct iser_page_vec *page_vec; + struct iser_regd_buf *p_regd_buf; + int aligned_len; + int err; + + p_regd_buf = &p_iser_task->rdma_regd[cmd_dir]; + + iser_dbg("p_mem %p p_mem->type %d\n", p_mem,p_mem->type); + + if (p_mem->type != ISER_BUF_TYPE_SINGLE) { + aligned_len = iser_data_buf_aligned_len(p_mem); + if (aligned_len != p_mem->size) { + iser_err("rdma alignment violation %d/%d aligned\n", + aligned_len, p_mem->size); + iser_data_buf_dump(p_mem); + /* allocate copy buf, if we are writing, copy the */ + /* unaligned scatterlist, dma map the copy */ + iser_start_rdma_unaligned_sg(p_iser_task, cmd_dir); + p_mem = &p_iser_task->data_copy[cmd_dir]; + } + } + + page_vec = iser_page_vec_alloc(p_mem, p_iser_task->data_len[cmd_dir]); + if(!page_vec) + return -ENOMEM; + + iser_page_vec_build(p_mem, page_vec); + err = iser_reg_page_vec(p_iser_conn,page_vec,&p_regd_buf->reg); + kfree(page_vec); + if(err) + return err; + + /* take a reference on this regd buf such that it will not be released * + * (eg in send dto completion) before we get the scsi response */ + atomic_inc(&p_regd_buf->ref_count); + return 0; +} From ogerlitz at voltaire.com Wed Feb 22 06:37:24 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 22 Feb 2006 16:37:24 +0200 (IST) Subject: [openib-general] [PATCH 6/6] [RFC] iser socket In-Reply-To: Message-ID: + note that data is never moved on the socket via send/recv but only by calls from iscsi_iser.c to iser_send_control/command/dataout + data originting/resuling in user space (eg login request/respose) is moved down/up by open iscsi using netlink --- /ulp/iser-x/iser_socket.c 2006-02-22 15:07:03.000000000 +0200 +++ /ulp/iser/iser_socket.c 2006-02-22 13:39:07.000000000 +0200 @@ -1 +1,214 @@ +/* + * Copyright (c) 2005, 2006 Voltaire, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: iser_socket.c 5442 2006-02-19 09:05:17Z ogerlitz $ + */ +#include +#include +#include + +#include "iscsi_iser.h" +#include "iser_socket.h" + +#define PF_ISER AF_ISER + +static int iser_sock_create(struct socket *, int); +static int iser_sock_release(struct socket *); +static int iser_sock_connect(struct socket *, struct sockaddr *, int, int); +static int iser_sock_shutdown(struct socket *,int); +static int iser_sock_getsockopt(struct socket *,int,int,char __user *,int __user *); +static unsigned int iser_sock_poll(struct file *,struct socket *, + struct poll_table_struct *); + +struct iser_sock { + struct sock sock; + struct iser_conn iser_conn; +}; + +static struct net_proto_family iser_proto_family = { + .family = PF_ISER, + .create = iser_sock_create, + .authentication = 0, + .encryption = 0, + .encrypt_net = 0 +}; + +static struct proto_ops iser_proto_ops = { + .family = AF_ISER, + .owner = THIS_MODULE, + + .connect = iser_sock_connect, + .release = iser_sock_release, + .shutdown = iser_sock_shutdown, + + .bind = sock_no_bind, + .poll = iser_sock_poll, + .socketpair = sock_no_socketpair, + .accept = sock_no_accept, + .getname = sock_no_getname, + .ioctl = sock_no_ioctl, + .listen = sock_no_listen, + .setsockopt = sock_setsockopt, + .getsockopt = iser_sock_getsockopt, + .sendmsg = sock_no_sendmsg, + .recvmsg = sock_no_recvmsg, + .mmap = sock_no_mmap, + .sendpage = sock_no_sendpage +}; + +static struct proto iser_sock_proto = { + .name = "ib_iser", + .owner = THIS_MODULE, + .obj_size = sizeof(struct iser_sock) +}; + +struct iser_conn *iser_conn_from_sock(struct socket *sock) +{ + struct iser_sock *iser_sk = (struct iser_sock *)sock->sk; + + return &iser_sk->iser_conn; +} + +struct socket *iser_conn_to_sock(struct iser_conn *p_iser_conn) +{ + struct iser_sock *iser_sk; + iser_sk = container_of(p_iser_conn, struct iser_sock, iser_conn); + + return iser_sk->sock.sk_socket; +} + +int iser_register_sockets(void) +{ + int error; + + error = proto_register(&iser_sock_proto, 1); + if (error < 0) { + iser_err("proto_register failed (%d)\n", error); + return error; + } + + error = sock_register(&iser_proto_family); + if (error < 0) { + iser_err("sock_register failed (%d)\n", error); + proto_unregister(&iser_sock_proto); + } + + return 0; +} + +void iser_unreg_sockets(void) +{ + sock_unregister(PF_ISER); + proto_unregister(&iser_sock_proto); +} + +static int iser_sock_create(struct socket *sock, int protocol) +{ + struct iser_sock *iser_sk = NULL; + + if (sock->type != SOCK_STREAM) + return -ESOCKTNOSUPPORT; + + iser_sk = (struct iser_sock *)sk_alloc(PF_INET, GFP_KERNEL, + &iser_sock_proto, 1); + if (iser_sk == NULL) + return -ENOBUFS; + + sock_init_data(sock, &iser_sk->sock); + iser_sk->sock.sk_destruct = NULL; + iser_sk->sock.sk_family = PF_ISER; + iser_sk->sock.sk_sndbuf = 64*1024; + + iser_conn_init(&iser_sk->iser_conn); + + sock->ops = &iser_proto_ops; + sock->state = SS_UNCONNECTED; + sock_graft(&iser_sk->sock, sock); + + return 0; +} + +int iser_sock_connect(struct socket *sock, struct sockaddr *uservaddr, + int sockaddr_len, int flags) +{ + struct sockaddr_in *dst_addr = (struct sockaddr_in *)uservaddr; + struct iser_sock *iser_sk = (struct iser_sock *)sock->sk; + struct iser_conn *p_iser_conn = &iser_sk->iser_conn; + int err = 0; + + iser_err("dst_addr ip %.8x (%d.%d.%d.%d) port %.4x=%d\n", + dst_addr->sin_addr.s_addr, NIPQUAD(dst_addr->sin_addr), + dst_addr->sin_port, dst_addr->sin_port); + + err = iser_connect(p_iser_conn, NULL, dst_addr); + if (err) + iser_err("conn_establish failed: %d\n", err); + return err; +} + +static inline void iser_sock_free(struct socket *sock) +{ + struct sock *sk = sock->sk; + sock->sk = NULL; + sock_orphan(sk); + sk_free(sk); +} + +int iser_sock_release(struct socket *sock) +{ + struct iser_sock *iser_sock = (struct iser_sock *)sock->sk; + struct iser_conn *p_iser_conn = &iser_sock->iser_conn; + int iser_err = 0; + + if (atomic_read(&p_iser_conn->state) == ISER_CONN_DOWN) + iser_sock_free(sock); + else + iser_err = -EPERM; + return iser_err; +} + +int iser_sock_shutdown(struct socket *sock, int how) +{ + return 0; +} + +static int iser_sock_getsockopt(struct socket *sock, int level, int optname, + char __user *optval, int __user *optlen) +{ + return 0; +} + +static unsigned int iser_sock_poll(struct file *file, struct socket *sock, + struct poll_table_struct *wait) +{ + return POLLOUT; +} --- /ulp/iser-x/iser_socket.h 2006-02-22 15:07:04.000000000 +0200 +++ /ulp/iser/iser_socket.h 2006-02-22 13:39:06.000000000 +0200 @@ -1 +1,49 @@ +/* + * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: iser_socket.h 5314 2006-02-06 15:47:06Z ogerlitz $ + */ +#ifndef __ISER_SOCKETS_H__ +#define __ISER_SOCKETS_H__ + +#include + +struct iser_conn; + +#define AF_ISER 28 /* to be defined properly */ + +int iser_register_sockets(void); +void iser_unreg_sockets(void); + +struct iser_conn *iser_conn_from_sock(struct socket *sock); +struct socket *iser_conn_to_sock(struct iser_conn *p_iser_conn); +#endif /* __ISER_SOCKETS_H__ */ From mamidala at cse.ohio-state.edu Wed Feb 22 07:11:01 2006 From: mamidala at cse.ohio-state.edu (amith rajith mamidala) Date: Wed, 22 Feb 2006 10:11:01 -0500 (EST) Subject: [openib-general] IPoIB broadcast MC group membership In-Reply-To: <20060222011745.GL2853@greglaptop.internal.keyresearch.com> Message-ID: Hi, > * MVAPICH uses a multicast group for some MPI collectives > * This can be turned off by setting env var DISABLE_HARDWARE_MCST MVAPICH uses h/w multicast for MPI_Bcast. We plan to extend this to other collectives in future, Thanks, Amith > > * An IB multicast group has to use ports of the same speed > * This one was a surprise to me > > Ergo, when you mix 1X, 4X SDR, and 4X DDR hosts, it behaves > differently from a homogeneous network. > > -- greg > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From dotanb at mellanox.co.il Wed Feb 22 07:50:42 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Wed, 22 Feb 2006 17:50:42 +0200 Subject: [openib-general] 1/2 libibverbs: update init attribute in create_srq Message-ID: <43FC8852.8020304@mellanox.co.il> The init attributes are being updated with the actual SRQ values in the uverbs. Signed-off-by: Dotan Barak Index: last_stable/src/userspace/libibverbs/include/infiniband/kern-abi.h =================================================================== --- last_stable.orig/src/userspace/libibverbs/include/infiniband/kern-abi.h 2006-02-21 17:00:25.000000000 +0200 +++ last_stable/src/userspace/libibverbs/include/infiniband/kern-abi.h 2006-02-22 16:19:34.000000000 +0200 @@ -48,7 +48,7 @@ * The minimum and maximum kernel ABI that we can handle. */ #define IB_USER_VERBS_MIN_ABI_VERSION 1 -#define IB_USER_VERBS_MAX_ABI_VERSION 5 +#define IB_USER_VERBS_MAX_ABI_VERSION 6 enum { IB_USER_VERBS_CMD_GET_CONTEXT, @@ -708,6 +708,8 @@ struct ibv_create_srq { struct ibv_create_srq_resp { __u32 srq_handle; + __u32 max_wr; + __u32 max_sge; }; struct ibv_modify_srq { Index: last_stable/src/userspace/libibverbs/src/cmd.c =================================================================== --- last_stable.orig/src/userspace/libibverbs/src/cmd.c 2006-02-21 17:00:25.000000000 +0200 +++ last_stable/src/userspace/libibverbs/src/cmd.c 2006-02-22 16:08:25.000000000 +0200 @@ -435,6 +435,11 @@ int ibv_cmd_create_srq(struct ibv_pd *pd srq->handle = resp->srq_handle; + if (abi_ver > 5) { + attr->attr.max_wr = resp->max_wr; + attr->attr.max_sge = resp->max_sge; + } + return 0; } From dotanb at mellanox.co.il Wed Feb 22 07:52:15 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Wed, 22 Feb 2006 17:52:15 +0200 Subject: [openib-general] 2/2 core: update init attribute in create_srq Message-ID: <43FC88AF.4080503@mellanox.co.il> The init attributes are being updated with the actual SRQ values in the core for the uverbs. Signed-off-by: Dotan Barak Index: last_stable/drivers/infiniband/core/uverbs_cmd.c =================================================================== --- last_stable.orig/drivers/infiniband/core/uverbs_cmd.c 2006-02-21 17:04:10.000000000 +0200 +++ last_stable/drivers/infiniband/core/uverbs_cmd.c 2006-02-22 16:16:48.000000000 +0200 @@ -1864,6 +1864,8 @@ retry: goto err_destroy; resp.srq_handle = uobj->uobject.id; + resp.max_wr = attr.attr.max_wr; + resp.max_sge = attr.attr.max_sge; if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) { Index: last_stable/drivers/infiniband/include/rdma/ib_user_verbs.h =================================================================== --- last_stable.orig/drivers/infiniband/include/rdma/ib_user_verbs.h 2006-02-21 17:04:07.000000000 +0200 +++ last_stable/drivers/infiniband/include/rdma/ib_user_verbs.h 2006-02-22 16:17:22.000000000 +0200 @@ -44,7 +44,7 @@ * Increment this value if any changes that break userspace ABI * compatibility are made. */ -#define IB_USER_VERBS_ABI_VERSION 5 +#define IB_USER_VERBS_ABI_VERSION 6 enum { IB_USER_VERBS_CMD_GET_CONTEXT, @@ -643,6 +643,8 @@ struct ib_uverbs_create_srq { struct ib_uverbs_create_srq_resp { __u32 srq_handle; + __u32 max_wr; + __u32 max_sge; }; struct ib_uverbs_modify_srq { From hch at lst.de Wed Feb 22 08:19:01 2006 From: hch at lst.de (Christoph Hellwig) Date: Wed, 22 Feb 2006 17:19:01 +0100 Subject: [openib-general] [PATCH 1/6] [RFC] iscsi_iser header file In-Reply-To: References: Message-ID: <20060222161901.GA24303@lst.de> On Wed, Feb 22, 2006 at 04:25:49PM +0200, Or Gerlitz wrote: > + some of the defines here replicate thos in drivers/scsi/iscsi_tcp.h so > merging them is possible > > + a cleanup we plan is to reduce the usage of the iser dbg/err/bug macros, > convert the remaining iser_bug calls into standard BUG() calls. > > + the wire structures are iser_hdr (below) and variuos iscsi_hdr > (from scsi/iscsi_proto.h) iser_adaptor is misspelled ;-) seriously, I think iser_device might be a better name for this. Please don't use volatile but an atomic_t or bitops for the session state field. From hch at lst.de Wed Feb 22 08:25:07 2006 From: hch at lst.de (Christoph Hellwig) Date: Wed, 22 Feb 2006 17:25:07 +0100 Subject: [openib-general] [PATCH 3/6] [RFC] iser initiator In-Reply-To: References: Message-ID: <20060222162507.GB24303@lst.de> > +/* Constant PDU lengths calculations */ > +#define ISER_HDR_LEN sizeof (struct iser_hdr) > +#define ISER_PDU_BHS_LENGTH sizeof (struct iscsi_hdr) these two macros are just use in ISER_TOTAL_HEADERS_LEN below, just kill them. > +#define USE_OFFSET(offset) (offset) > +#define USE_NO_OFFSET 0 > +#define USE_SIZE(size) (size) > +#define USE_ENTIRE_SIZE 0 please kill these macros. > +/* iser_dto_add_regd_buff - increments the reference count for * > + * the registered buffer & adds it to the DTO object */ > +static void iser_dto_add_regd_buff(struct iser_dto *p_dto, > + struct iser_regd_buf *p_regd_buf, > + unsigned long use_offset, > + unsigned long use_size) > +{ > + int add_idx; > + > + atomic_inc(&p_regd_buf->ref_count); Please kill the p_ prefix for pointer types all over the code. > +static int iser_dma_map_task_data(struct iscsi_iser_cmd_task *p_iser_task, > + struct iser_data_buf *p_data, > + enum iser_data_dir iser_dir, > + enum dma_data_direction dma_dir) > +{ > + struct device *dma_device; > + dma_addr_t dma_addr; > + int dma_nents; > + > + p_iser_task->dir[iser_dir] = 1; > + dma_device = p_iser_task->conn->ib_conn->p_adaptor->device->dma_device; > + > + if (p_data->type == ISER_BUF_TYPE_SINGLE) { > + p_iser_task->data_len[iser_dir] = p_data->size; > + dma_addr = dma_map_single(dma_device,p_data->p_buf, p_data->size, > + dma_dir); > + if (dma_mapping_error(dma_addr)) { > + iser_err("dma_map_single failed at %p\n", p_data->p_buf); > + return -EINVAL; > + } > + p_data->dma_addr = dma_addr; > + } else { I'd say kill the non-SG case. We're in the progress of removing non-SG commands in the scsi midlayer, and I'm pretty sure they won't exist anymore before the iser code merged. > +static int iser_post_receive_control(struct iscsi_iser_conn *p_iser_conn) > +{ > + struct iser_desc *rx_desc; > + struct iser_regd_buf *p_regd_hdr; > + struct iser_regd_buf *p_regd_data; > + struct iser_dto *p_recv_dto = NULL; > + struct iser_adaptor *p_iser_adaptor = p_iser_conn->ib_conn->p_adaptor; > + int rx_data_size, err = 0; > + > + rx_desc = kmem_cache_alloc(ig.desc_cache, > + GFP_KERNEL | __GFP_NOFAIL); __GFP_NOFAIL doesn't work for slab (kmem_cache_alloc/kmalloc/kzalloc/kcalloc) allocations > +send_data_out_error: > + if (p_send_dto != NULL) > + iser_dto_buffs_release(p_send_dto); > + if (tx_desc != NULL) > + kmem_cache_free(ig.desc_cache, tx_desc); could you please do the same goto-unwinding style we use elsewhere in the kernel? That is one label before each unwind step and jump directly to that instead of adding tons of conditionals in the error path. From hch at lst.de Wed Feb 22 08:29:03 2006 From: hch at lst.de (Christoph Hellwig) Date: Wed, 22 Feb 2006 17:29:03 +0100 Subject: [openib-general] [PATCH 5/6] [RFC] iser handling of memory for RDMA In-Reply-To: References: Message-ID: <20060222162903.GC24303@lst.de> > + if (cmd_dir == ISER_DIR_OUT) { > + /* copy the unaligned sg the buffer which is used for RDMA */ > + struct scatterlist *p_sg = (struct scatterlist *)p_mem->p_buf; > + int i; > + char *p; > + > + for (p = mem, i = 0; i < p_mem->size; i++) { > + memcpy(p, > + page_address(p_sg[i].page) + p_sg[i].offset, > + p_sg[i].length); > + p += p_sg[i].length; pages you get sent down in a sg list don't have to be kernel mapped, you need to use kmap or kmap_atomic to access them. From hch at lst.de Wed Feb 22 08:30:49 2006 From: hch at lst.de (Christoph Hellwig) Date: Wed, 22 Feb 2006 17:30:49 +0100 Subject: [openib-general] [PATCH 6/6] [RFC] iser socket In-Reply-To: References: Message-ID: <20060222163049.GD24303@lst.de> On Wed, Feb 22, 2006 at 04:37:24PM +0200, Or Gerlitz wrote: > + note that data is never moved on the socket via send/recv but > only by calls from iscsi_iser.c to iser_send_control/command/dataout > > + data originting/resuling in user space (eg login request/respose) > is moved down/up by open iscsi using netlink So what do the iser sockets do? They look like noop stubs to me. From jackm at mellanox.co.il Wed Feb 22 08:36:05 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Wed, 22 Feb 2006 18:36:05 +0200 Subject: [openib-general] Re: [PATCH] ib_mad: prevent duplicate outstanding MADtransactions with same TID Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3017BBEA2@mtlexch01.mtl.com> The issue is complex, and two-fold: A -- 1. We should PREVENT sending a new duplicate identical request MADs while the previous MAD has not yet timed out (but allow RMPP ACK/NACK packets, which have the identical TID/GID/class as the original request packet). 2. Similarly, we should PREVENT sending a new duplicate RMPP mad from sender side (usually an RMPP response) while the previous RMPP session is still in progress. B -- We should ALLOW sending duplicate response MADs (or duplicate RMPP response sessions) having the same transaction ID, but going to different destinations. ---- Regarding A.2 and B: Normal (non-RMPP) responses do not have timeouts, whereas RMPP responses do have timeouts per segment (via the RMPP protocol). However, these timeouts are visible only after the call to ib_post_send_mad() (which is the natural place to put duplication detection). In the current OpenSM implementation, all response MADs are passed from user-space to kernel space with a timeout set to zero -- and this 0-timeout is passed to ib_post_send_request() by ib_umad_write. If an RMPP response is indicated, the timeout is changed in mad_rmpp.c, send_next_seg() just before calling ib_send_mad(). Thus, when the segment is sent and the send_completion is received, the mad transaction is transferred to the send wait-queue to await a response packet (since the timeout is non-zero at that point). In order to comply with all the restrictions above, we need to do the following: When SENDING: If RESPONSE bit of method is set: Need to check TID/GID/class of all responses in list to verify that this is not a duplicate. Otherwise: Need to check TID/class of all requests in list. NOTE: Currently, struct ib_mad_send_wr_private holds only the address handle pointer, NOT the address handle attributes. We need the AH attribute data to check GID, LID, and grh. To extract this Info we can either add it to the private struct (requiring changes in ib_create_send_mad, and affecting lots of code), or we can change the verb ib_query_ah() to be mandatory (it is optional in the IB Spec). When RECEIVING: If RESPONSE bit is set: Need to check TID/class against outstanding requests. Otherwise: Need to check TID/GID/class against outstanding responses (RMPP) GID is important here, because responder may have several RMPP sessions active with same TID, but involving different Destination hosts. Comments? (especially regarding either requiring ib_query_ah vs. impacting lots of existing code) Jack -----Original Message----- From: Sean Hefty [mailto:mshefty at ichips.intel.com] Sent: Monday, January 23, 2006 11:08 PM To: Michael S. Tsirkin Cc: Jack Morgenstein Subject: Re: [openib-general] Re: [PATCH] ib_mad: prevent duplicate outstanding MADtransactions with same TID Michael S. Tsirkin wrote: > I think you are right. > Hmm, how come mad.c does request/response matching simply by calling > ib_find_send_mad which only gets a tid? > ib_find_send_mad checks against request MADs only - those with a timeout. We have a more complicated issue here. With a response MAD, we can have multiple TIDs that are the same, as long as they're going to different destinations (mgmt class or dlid/dgid). So, request MADs have a limitation, whereas response MADs don't? - Sean From caitlinb at broadcom.com Wed Feb 22 09:03:08 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 22 Feb 2006 09:03:08 -0800 Subject: [openib-general] [PATCH 6/6] [RFC] iser socket Message-ID: <54AD0F12E08D1541B826BE97C98F99F1297203@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > On Wed, Feb 22, 2006 at 04:37:24PM +0200, Or Gerlitz wrote: >> + note that data is never moved on the socket via send/recv but >> only by calls from iscsi_iser.c to > iser_send_control/command/dataout >> >> + data originting/resuling in user space (eg login request/respose) >> is moved down/up by open iscsi using netlink > > So what do the iser sockets do? They look like noop stubs to me. > Good question. I am guessing that they are exactly noop stubs, and the real point is to have a socket associated with an iSER RDMA connection. My question is why? Unless the attempt is to allow upgrading an iSCSI stream connection to an iSER connection I don't see why an iSER RDMA connection is in any more of a need for having a proxy socket than any other RDMA connection. I really don't see the benefit of having a "socket" that is not truly integrated with the host stack. What socket attributes are being sought? And how is it unique to iSER as opposed to RDMA in general? From kschoche at scl.ameslab.gov Wed Feb 22 09:27:41 2006 From: kschoche at scl.ameslab.gov (Kyle Schochenmaier) Date: Wed, 22 Feb 2006 11:27:41 -0600 Subject: [openib-general] create QP failure Message-ID: <43FC9F0D.9040304@scl.ameslab.gov> Yesterday we updated to the 2.6.15 kernel release from 2.6.14, as well as from libibverbs rc5 to rc7. Some code that previously worked now returns an error on a second call to ibv_create_qp(**). Here's a snippet: c->qp = ibv_create_qp(c->pd, &att); if(!c->qp) error(1, "%s: create QP", __func__); c->qp_ack = ibv_create_qp(c->pd, &att); if (!c->qp_ack) error(1, "%s: create QP ack", __func__); The second ibv_create_qp() call fails for some reason, even though it's being initialized with the same attributes as the first call/qp. I didnt see anything in particular that said this should break on updating to the latest rc's. Have I missed a major change which would cause this to break now? thanks, - Kyle -- Kyle Schochenmaier kschoche at scl.ameslab.gov Research Assistant, Dr. Brett Bode AmesLab - US Dept.Energy Scalable Computing Laboratory From ftillier at silverstorm.com Wed Feb 22 09:53:34 2006 From: ftillier at silverstorm.com (Fabian Tillier) Date: Wed, 22 Feb 2006 09:53:34 -0800 Subject: [openib-general] IPoIB broadcast MC group membership In-Reply-To: <20060222082635.GE1902@greglaptop.hsd1.ca.comcast.net> References: <79ae2f320602210815o62eff7a5qc00744c4b755c8d7@mail.gmail.com> <79ae2f320602210908j1256615fl480a6af71dd0ebc5@mail.gmail.com> <1140542552.28051.6480.camel@hal.voltaire.com> <20060222011745.GL2853@greglaptop.internal.keyresearch.com> <20060222022817.GA5391@greglaptop.internal.keyresearch.com> <79ae2f320602212017i7feafbb0se3d75e196ea734f0@mail.gmail.com> <20060222055254.GA1554@greglaptop.hsd1.ca.comcast.net> <79ae2f320602212340j35663f90p39d63ff71de6bb92@mail.gmail.com> <20060222082635.GE1902@greglaptop.hsd1.ca.comcast.net> Message-ID: <79ae2f320602220953g1b06ebc1v995a9bcb4eb73e21@mail.gmail.com> On 2/22/06, Greg Lindahl wrote: > > On Tue, Feb 21, 2006 at 11:40:53PM -0800, Fabian Tillier wrote: > > > You'd have to make the group 1X. Note that the group being 1X doesn't > > limit unicast traffic to 1X rates, since the rate for unicast traffic > > would be set based on the rate reported in the path records for the > > various endpoints. > > > > So 4X SDR and 4X DDR nodes would have to set their inter-packet delay > > for the broadcast group to end up with a 1X packet injection rate. > > So, basically, MVAPICH doesn't have code that does either the group > creation properly when there is a mixture of HCA bandwidths, or limit > the packet injection rate. And IPoIB could violate this rule depending > on how user programs use it, e.g. if I did a lot of broadcasting, I > could easily exceed 1X's bandwidth. > > So this is more than just a "fix OpenSM" issue. It's more of a "fix > the spec" issue, if I'm understanding it correctly. No, the spec is fine. This is a "fix the SW" issue. If OpenSM rejected join requests of nodes for which the MC group is unrealizable (that is, some setting of the requestor conflict with the existing group, such as the rate), such nodes would not be able to join the broadcast group and thus not have IPoIB connectivity. When the SA responds to the MC join request, the response includes the rate. The recipient of the response should create an address vector for the MC group that takes the rate into account, which would cause the hardware to honor the injection rate such as to not flood the group. I haven't looked at MVAPICH, so I can't tell you if what it does is correct. IPoIB does seem to do the right thing, though. - Fab From ftillier at silverstorm.com Wed Feb 22 09:56:25 2006 From: ftillier at silverstorm.com (Fabian Tillier) Date: Wed, 22 Feb 2006 09:56:25 -0800 Subject: [openib-general] IPoIB broadcast MC group membership In-Reply-To: <1140607947.28051.13869.camel@hal.voltaire.com> References: <79ae2f320602202210s7a668b52if968e1fad058019c@mail.gmail.com> <79ae2f320602210908j1256615fl480a6af71dd0ebc5@mail.gmail.com> <1140542552.28051.6480.camel@hal.voltaire.com> <20060222011745.GL2853@greglaptop.internal.keyresearch.com> <20060222022817.GA5391@greglaptop.internal.keyresearch.com> <79ae2f320602212017i7feafbb0se3d75e196ea734f0@mail.gmail.com> <20060222055254.GA1554@greglaptop.hsd1.ca.comcast.net> <79ae2f320602212340j35663f90p39d63ff71de6bb92@mail.gmail.com> <1140607947.28051.13869.camel@hal.voltaire.com> Message-ID: <79ae2f320602220956x1ef1130aqdafa72b0554023cf@mail.gmail.com> On 22 Feb 2006 06:32:29 -0500, Hal Rosenstock wrote: > On Wed, 2006-02-22 at 02:40, Fabian Tillier wrote: > > On 2/21/06, Greg Lindahl wrote: > > > On Tue, Feb 21, 2006 at 08:17:02PM -0800, Fabian Tillier wrote: > > > > > > > The node joining or creating the multicast group doesn't need to > > > > specify the rate - the SA can figure out the rate to use based on the > > > > requestor (for creation), or validate that the requestor supports the > > > > existing group's rate (for joining). > > > > > > Um, but that gets back to my point: I want 1X, 4X SDR, and 4X DDR > > > nodes running IPoIB to share a multicast group. Are you saying this > > > can be done by making the group a 1X group? Or that it's impossible > > > to have such a group? Or that everyone would have to drop to 1X to > > > make such a group? > > > > You'd have to make the group 1X. Note that the group being 1X doesn't > > limit unicast traffic to 1X rates, since the rate for unicast traffic > > would be set based on the rate reported in the path records for the > > various endpoints. > > It does, however, limit all other (IB) multicast groups in that > partition to the same rate as the IPoIB broadcast group. That may be the > correct choice of the admin (and 1x nodes would be refused). I agree that 1X nodes are likely erroneous, and it's fine (and probably preferable) for an admin to cause these to be refused. I started this thread with the intent on figuring out how to distinguish refusal from some other failures, so that IPoIB could log the appropriate event and a system administrator could identify that there was a bad link in the fabric. The IB spec doesn't provide for a mechanism to do that, and Hal and I will work to change that. - Fab From mshefty at ichips.intel.com Wed Feb 22 10:07:46 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 22 Feb 2006 10:07:46 -0800 Subject: [openib-general] RE: [PATCH 3 of 3] mad: large RMPP support, Round 2 In-Reply-To: <200602220937.56443.jackm@mellanox.co.il> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3016C4BCC@mtlexch01.mtl.com> <43FB5571.1060000@ichips.intel.com> <200602220937.56443.jackm@mellanox.co.il> Message-ID: <43FCA872.50605@ichips.intel.com> Jack Morgenstein wrote: > Either way is OK by me. Note that ib_create_send_mad does allocate the "base" > MAD with kzalloc. Its just the RMPP segments that are allocated with kmalloc > and are not initially cleared. I have a patch that updates the comments. The comments states that the code clears the MAD headers, formats any RMPP header, and clears any padding. > I just thought that by returning the segment size in the procedure call, we > preserve the option to easily support multiple segment sizes within a single > RMPP message. However, if you think that we will never need such an option, > I don't object to exposing the segment size in the ib_mad_send_buf structure. Hmm... this makes sense. What I have done currently is add hdr_len, seg_count, and seg_size fields to ib_mad_send_buf. The original MAD code didn't require that the user allocate a MAD using a helper function. That has since been changed by adding ib_mad_send_buf, and the code wasn't ever cleaned up as a result of this. Since users must call ib_create_send_mad(), there are a few places in the code where we can be more efficient when handling RMPP, and eliminate some duplicated RMPP information. For example, by saving the hdr_len and seg_count, we avoid having to recalculate these later, and we can format the RMPP paylen to the correct value, rather than the more confusing value that it's set to now. Reporting the segment size when retrieving a segment may make sense. I'm not sure if we'll ever support variable sized segments, since doing so complicates the RMPP code. - Sean From mamidala at cse.ohio-state.edu Wed Feb 22 11:47:01 2006 From: mamidala at cse.ohio-state.edu (amith rajith mamidala) Date: Wed, 22 Feb 2006 14:47:01 -0500 (EST) Subject: [openib-general] IPoIB broadcast MC group membership In-Reply-To: <79ae2f320602220953g1b06ebc1v995a9bcb4eb73e21@mail.gmail.com> Message-ID: I have a question related to the mixture of HCA bandwidths in the fabric. For an upper layer, like MVAPICH, "negotiating" for a rate so that all the ports are "involved" can be quite expensive, especially if the code falls in the critical path. Can any additional support be provided by the underlying SA interface, so that the upper protocal layers can do the job in a minimum time. This kind of support can be used not only for MPI but for other stacks as well, Thanks, Amith On Wed, 22 Feb 2006, Fabian Tillier wrote: > On 2/22/06, Greg Lindahl wrote: > > > > On Tue, Feb 21, 2006 at 11:40:53PM -0800, Fabian Tillier wrote: > > > > > You'd have to make the group 1X. Note that the group being 1X doesn't > > > limit unicast traffic to 1X rates, since the rate for unicast traffic > > > would be set based on the rate reported in the path records for the > > > various endpoints. > > > > > > So 4X SDR and 4X DDR nodes would have to set their inter-packet delay > > > for the broadcast group to end up with a 1X packet injection rate. > > > > So, basically, MVAPICH doesn't have code that does either the group > > creation properly when there is a mixture of HCA bandwidths, or limit > > the packet injection rate. And IPoIB could violate this rule depending > > on how user programs use it, e.g. if I did a lot of broadcasting, I > > could easily exceed 1X's bandwidth. > > > > So this is more than just a "fix OpenSM" issue. It's more of a "fix > > the spec" issue, if I'm understanding it correctly. > > No, the spec is fine. This is a "fix the SW" issue. If OpenSM > rejected join requests of nodes for which the MC group is unrealizable > (that is, some setting of the requestor conflict with the existing > group, such as the rate), such nodes would not be able to join the > broadcast group and thus not have IPoIB connectivity. > > When the SA responds to the MC join request, the response includes the > rate. The recipient of the response should create an address vector > for the MC group that takes the rate into account, which would cause > the hardware to honor the injection rate such as to not flood the > group. I haven't looked at MVAPICH, so I can't tell you if what it > does is correct. IPoIB does seem to do the right thing, though. > > - Fab > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From ftillier at silverstorm.com Wed Feb 22 11:54:00 2006 From: ftillier at silverstorm.com (Fabian Tillier) Date: Wed, 22 Feb 2006 11:54:00 -0800 Subject: [openib-general] IPoIB broadcast MC group membership In-Reply-To: References: <79ae2f320602220953g1b06ebc1v995a9bcb4eb73e21@mail.gmail.com> Message-ID: <79ae2f320602221154h1e4a4c3dp2e32d40f15a48291@mail.gmail.com> On 2/22/06, amith rajith mamidala wrote: > > I have a question related to the mixture of HCA bandwidths > in the fabric. For an upper layer, like MVAPICH, "negotiating" > for a rate so that all the ports are "involved" can be quite > expensive, especially if the code falls in the critical path. > Can any additional support be provided by the underlying > SA interface, so that the upper protocal layers can do the job > in a minimum time. This kind of support can be used not only for > MPI but for other stacks as well, All the rate information can be figured out at process launch time. If at launch time each MPI process used the SA to query for path information to connect to all the other processes, then the application could also look at the returned set of paths and figure out what the slowest rate in the fabric is and use that for multicast traffic. Note that point-to-point traffice, whether reliable connected or unreliable datagram, would use the rate in the path record returned by the SA. Only multicast traffic must be aware of the rate limitations of the whole job. Hopefully that helps. - Fab From halr at voltaire.com Wed Feb 22 13:56:55 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 22 Feb 2006 16:56:55 -0500 Subject: [openib-general] IPoIB broadcast MC group membership In-Reply-To: <79ae2f320602221154h1e4a4c3dp2e32d40f15a48291@mail.gmail.com> References: <79ae2f320602220953g1b06ebc1v995a9bcb4eb73e21@mail.gmail.com> <79ae2f320602221154h1e4a4c3dp2e32d40f15a48291@mail.gmail.com> Message-ID: <1140645410.28051.19222.camel@hal.voltaire.com> On Wed, 2006-02-22 at 14:54, Fabian Tillier wrote: > On 2/22/06, amith rajith mamidala wrote: > > > > I have a question related to the mixture of HCA bandwidths > > in the fabric. For an upper layer, like MVAPICH, "negotiating" > > for a rate so that all the ports are "involved" can be quite > > expensive, especially if the code falls in the critical path. > > Can any additional support be provided by the underlying > > SA interface, so that the upper protocal layers can do the job > > in a minimum time. This kind of support can be used not only for > > MPI but for other stacks as well, > > All the rate information can be figured out at process launch time. > If at launch time each MPI process used the SA to query for path > information to connect to all the other processes, then the > application could also look at the returned set of paths and figure > out what the slowest rate in the fabric is and use that for multicast > traffic. Note that point-to-point traffice, whether reliable > connected or unreliable datagram, would use the rate in the path > record returned by the SA. Only multicast traffic must be aware of > the rate limitations of the whole job. You can also get rate information on multicast groups too from the SA. -- Hal > Hopefully that helps. > > - Fab > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Wed Feb 22 14:01:43 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 22 Feb 2006 17:01:43 -0500 Subject: [openib-general] IPoIB broadcast MC group membership In-Reply-To: <79ae2f320602220953g1b06ebc1v995a9bcb4eb73e21@mail.gmail.com> References: <79ae2f320602210815o62eff7a5qc00744c4b755c8d7@mail.gmail.com> <79ae2f320602210908j1256615fl480a6af71dd0ebc5@mail.gmail.com> <1140542552.28051.6480.camel@hal.voltaire.com> <20060222011745.GL2853@greglaptop.internal.keyresearch.com> <20060222022817.GA5391@greglaptop.internal.keyresearch.com> <79ae2f320602212017i7feafbb0se3d75e196ea734f0@mail.gmail.com> <20060222055254.GA1554@greglaptop.hsd1.ca.comcast.net> <79ae2f320602212340j35663f90p39d63ff71de6bb92@mail.gmail.com> <20060222082635.GE1902@greglaptop.hsd1.ca.comcast.net> <79ae2f320602220953g1b06ebc1v995a9bcb4eb73e21@mail.gmail.com> Message-ID: <1140645573.28051.19245.camel@hal.voltaire.com> On Wed, 2006-02-22 at 12:53, Fabian Tillier wrote: > On 2/22/06, Greg Lindahl wrote: > > > > On Tue, Feb 21, 2006 at 11:40:53PM -0800, Fabian Tillier wrote: > > > > > You'd have to make the group 1X. Note that the group being 1X doesn't > > > limit unicast traffic to 1X rates, since the rate for unicast traffic > > > would be set based on the rate reported in the path records for the > > > various endpoints. > > > > > > So 4X SDR and 4X DDR nodes would have to set their inter-packet delay > > > for the broadcast group to end up with a 1X packet injection rate. > > > > So, basically, MVAPICH doesn't have code that does either the group > > creation properly when there is a mixture of HCA bandwidths, or limit > > the packet injection rate. And IPoIB could violate this rule depending > > on how user programs use it, e.g. if I did a lot of broadcasting, I > > could easily exceed 1X's bandwidth. > > > > So this is more than just a "fix OpenSM" issue. It's more of a "fix > > the spec" issue, if I'm understanding it correctly. > > No, the spec is fine. This is a "fix the SW" issue. If OpenSM > rejected join requests of nodes for which the MC group is unrealizable > (that is, some setting of the requestor conflict with the existing > group, such as the rate), such nodes would not be able to join the > broadcast group and thus not have IPoIB connectivity. > > When the SA responds to the MC join request, the response includes the > rate. The recipient of the response should create an address vector > for the MC group that takes the rate into account, which would cause > the hardware to honor the injection rate such as to not flood the > group. I haven't looked at MVAPICH, so I can't tell you if what it > does is correct. IPoIB does seem to do the right thing, though. I thought there were 2 issues here: 1. OpenSM not checking the realizability of the join request and 2. The spec issue with discerning the reason for SA refusal of the joing request -- Hal > > - Fab > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From ftillier at silverstorm.com Wed Feb 22 14:16:42 2006 From: ftillier at silverstorm.com (Fabian Tillier) Date: Wed, 22 Feb 2006 14:16:42 -0800 Subject: [openib-general] IPoIB broadcast MC group membership In-Reply-To: <1140645573.28051.19245.camel@hal.voltaire.com> References: <79ae2f320602210815o62eff7a5qc00744c4b755c8d7@mail.gmail.com> <20060222011745.GL2853@greglaptop.internal.keyresearch.com> <20060222022817.GA5391@greglaptop.internal.keyresearch.com> <79ae2f320602212017i7feafbb0se3d75e196ea734f0@mail.gmail.com> <20060222055254.GA1554@greglaptop.hsd1.ca.comcast.net> <79ae2f320602212340j35663f90p39d63ff71de6bb92@mail.gmail.com> <20060222082635.GE1902@greglaptop.hsd1.ca.comcast.net> <79ae2f320602220953g1b06ebc1v995a9bcb4eb73e21@mail.gmail.com> <1140645573.28051.19245.camel@hal.voltaire.com> Message-ID: <79ae2f320602221416l6e280475g6d3b4006c529ded4@mail.gmail.com> On 22 Feb 2006 17:01:43 -0500, Hal Rosenstock wrote: > On Wed, 2006-02-22 at 12:53, Fabian Tillier wrote: > > On 2/22/06, Greg Lindahl wrote: > > > > > > On Tue, Feb 21, 2006 at 11:40:53PM -0800, Fabian Tillier wrote: > > > > > > > You'd have to make the group 1X. Note that the group being 1X doesn't > > > > limit unicast traffic to 1X rates, since the rate for unicast traffic > > > > would be set based on the rate reported in the path records for the > > > > various endpoints. > > > > > > > > So 4X SDR and 4X DDR nodes would have to set their inter-packet delay > > > > for the broadcast group to end up with a 1X packet injection rate. > > > > > > So, basically, MVAPICH doesn't have code that does either the group > > > creation properly when there is a mixture of HCA bandwidths, or limit > > > the packet injection rate. And IPoIB could violate this rule depending > > > on how user programs use it, e.g. if I did a lot of broadcasting, I > > > could easily exceed 1X's bandwidth. > > > > > > So this is more than just a "fix OpenSM" issue. It's more of a "fix > > > the spec" issue, if I'm understanding it correctly. > > > > No, the spec is fine. This is a "fix the SW" issue. If OpenSM > > rejected join requests of nodes for which the MC group is unrealizable > > (that is, some setting of the requestor conflict with the existing > > group, such as the rate), such nodes would not be able to join the > > broadcast group and thus not have IPoIB connectivity. > > I thought there were 2 issues here: > 1. OpenSM not checking the realizability of the join request > and > 2. The spec issue with discerning the reason for SA refusal of the joing > request That's correct, but neither of the above indicates a fatal flaw in multicast handling in the IB specification. Though I guess I overstated things when I said "the spec is fine". The spec could be better, but it's not broken. How's that? ;) - Fab From rdreier at cisco.com Wed Feb 22 14:48:48 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 22 Feb 2006 14:48:48 -0800 Subject: [openib-general] Plans for libibverbs 1.0, 1.1 and beyond In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E301075D8C@mtlexch01.mtl.com> (Gil Bloch's message of "Wed, 22 Feb 2006 13:08:49 +0200") References: <6AB138A2AB8C8E4A98B9C0C3D52670E301075D8C@mtlexch01.mtl.com> Message-ID: Gil> Roland, I believe we should add support for a resize WQ Gil> command (as a part of modify QP) to enable changing the WQ Gil> size. On a very large scale cluster, with many operating Gil> QPs, the work queue memory consumption might be Gil> expansive. Thus the MPI implementation should tradeoff for Gil> pipelining requests vs. WQ memory consumption. The resize WQ Gil> will allow on-demand adaptive WQ setting instead of static Gil> allocation of the memory resource, which I believe can Gil> increase performance and save memory at the same time. Does Mellanox HW support resizing a WQ after a QP is created? If so would you be willing to contribute an implementation? Thanks, Roland From rdreier at cisco.com Wed Feb 22 14:54:42 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 22 Feb 2006 14:54:42 -0800 Subject: [openib-general] create QP failure In-Reply-To: <43FC9F0D.9040304@scl.ameslab.gov> (Kyle Schochenmaier's message of "Wed, 22 Feb 2006 11:27:41 -0600") References: <43FC9F0D.9040304@scl.ameslab.gov> Message-ID: Kyle> The second ibv_create_qp() call fails for some reason, even Kyle> though it's being initialized with the same attributes as Kyle> the first call/qp. I didnt see anything in particular that Kyle> said this should break on updating to the latest rc's. Have Kyle> I missed a major change which would cause this to break now? It should still work. There was a change to the interface, so that the kernel now returns the real QP capacities so libibverbs can give you the true QP attributes that your QP was created with. This probably introduced a bug somewhere. Can you trim down your code to a simple testcase that shows this failure? Failing that, can you capture the output of running your program with "strace -ewrite -ewrite=all" up to the point where the second QP create fails? Thanks, Roland From caitlinb at broadcom.com Wed Feb 22 15:12:01 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 22 Feb 2006 15:12:01 -0800 Subject: [openib-general] Plans for libibverbs 1.0, 1.1 and beyond Message-ID: <54AD0F12E08D1541B826BE97C98F99F129729D@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > Gil> Roland, I believe we should add support for a resize WQ > Gil> command (as a part of modify QP) to enable changing the WQ > Gil> size. On a very large scale cluster, with many operating > Gil> QPs, the work queue memory consumption might be > Gil> expansive. Thus the MPI implementation should tradeoff for > Gil> pipelining requests vs. WQ memory consumption. The resize WQ > Gil> will allow on-demand adaptive WQ setting instead of static > Gil> allocation of the memory resource, which I believe can > Gil> increase performance and save memory at the same time. > > Does Mellanox HW support resizing a WQ after a QP is created? > If so would you be willing to contribute an implementation? > This is an API question, not an implementation question. We can reasonably anticipate that a) some devices could actually implement a work queue resize and in doing so free on-chip resources, and b) that some devices the only resources that would be freed by changing work queue sizes would be on the host (and that the synchronization required would typically not justify the benefit of releasing space for work requests alone). The resource in question (the work queues) are inherently device specific. An implementation that resizes Brand X work queues is of no real benefit to any other device. We need to distinquish between the two rationale for changing work queue sizes: reducing resource usage, and seeking hardware assist in enforcing ULP constraints. Freeing resources for real is trickier, and could easily be something best done at an opportunistic time (such as the next time that the work queue wraps around to its base). While any resizing that is supposed to result in a throttling effect should take effect (at least logically) immediately. So the real question is whether there are devices that need resize support to truly allow adjustments to on-chip resources, and secondly whether applications should be given the expectation that resizing will a) release on-chip resources such that they can be used for another QP and b) will actually throttle applications if the new limits are exceeded. From ftillier at silverstorm.com Wed Feb 22 15:13:33 2006 From: ftillier at silverstorm.com (Fabian Tillier) Date: Wed, 22 Feb 2006 15:13:33 -0800 Subject: [openib-general] create QP failure In-Reply-To: References: <43FC9F0D.9040304@scl.ameslab.gov> Message-ID: <79ae2f320602221513l25121359ya4d7fc9e5e4613c0@mail.gmail.com> On 2/22/06, Roland Dreier wrote: > Kyle> The second ibv_create_qp() call fails for some reason, even > Kyle> though it's being initialized with the same attributes as > Kyle> the first call/qp. I didnt see anything in particular that > Kyle> said this should break on updating to the latest rc's. Have > Kyle> I missed a major change which would cause this to break now? > > It should still work. There was a change to the interface, so that > the kernel now returns the real QP capacities so libibverbs can give > you the true QP attributes that your QP was created with. This > probably introduced a bug somewhere. Roland, is the attribute structure now being used as an output parameter when it wasn't before? If so, what happens if the output from one create_qp call gets passed as input into another? And Kyle, did the application reset the attributes before the second call, or just pass the output of the first call as input to the second? - Fab From bos at pathscale.com Wed Feb 22 15:35:26 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 22 Feb 2006 15:35:26 -0800 Subject: [openib-general] Towards a 1.0 release of OpenIB In-Reply-To: <000001c63801$0b097610$6aa1070a@amr.corp.intel.com> References: <000001c63801$0b097610$6aa1070a@amr.corp.intel.com> Message-ID: <1140651326.9011.9.camel@localhost.localdomain> On Wed, 2006-02-22 at 14:40 -0800, Bob Woodruff wrote: > The easier it is to install the release, the less > support and questions we will have to manage. I would say that > if a typical IT guy cannot easily install and get the stuff > working, then it is probably too hard. Fair enough. > I'd rather see everyone run different tests. That way there is more > coverage. It's fine with me if people run different tests; I'd like as many people as possible to actually be *able* to run the same ones, though, since this makes reproducing a problem that someone else finds easier. > Most of my customers need everything to be installed > from a binary RPM. They typically do not build their own kernels. What distros do they run? > At the workshop, Doug said he would build source and binary RPMs > for EL4 from the openib release branch so I think that one will be > covered. I can build RPMs for FC4 and SUSE10. From robert.j.woodruff at intel.com Wed Feb 22 15:41:43 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Wed, 22 Feb 2006 15:41:43 -0800 Subject: [openib-general] Towards a 1.0 release of OpenIB Message-ID: <1AC79F16F5C5284499BB9591B33D6F0006F66A60@orsmsx408> Bryan wrote, > Most of my customers need everything to be installed > from a binary RPM. They typically do not build their own kernels. What distros do they run? EL4, but that is because there are RPMs available for openib for EL4. Some would also like SLES. woody From bill.boas at gmail.com Wed Feb 22 17:41:54 2006 From: bill.boas at gmail.com (Bill Boas) Date: Wed, 22 Feb 2006 17:41:54 -0800 Subject: [openib-general] =?windows-1252?q?Subject=3A_Don=92t_miss_the_Op?= =?windows-1252?q?en_Fabrics_Symposium_on_March_6?= Message-ID: <19a929370602221741o75a28004q9a04678f52719846@mail.gmail.com> *Subject: Don't miss the Open Fabrics Symposium on March 6 * * * Join the OpenIB Alliance and the InfiniBandSM Trade Association for the Open Fabrics Symposium on *Monday, March 6, 2006*. In this day long event, industry leaders will share with you the current state, as well as the vision of data center fabric architectures for computing and storage, including: - How *RDMA-based fabric solutions deliver quantum leaps in performance* and flexibility for your data center - How to *innovate with fabric technologies* that enable new levels of efficiency for clustered computing and storage applications - How to *align the capabilities* and features of your data center products to the needs of your customers and end-users - How open source RDMA software *accelerates time-to-solution* of mission-critical applications How 10Gb/s+ fabric technologies deployed in data centers *save money*, even compared to their existing 1Gb/s infrastructure Register and learn more at http://www.acteva.com/booking.cfm?bevaid=104352. You will pay only $175 until February 28. Also on hand will be some of the top vertical industry leaders from financial, oil and manufacturing companies with whom you can discuss the impact of fabric technology on the successful deployment of grid computing, virtualization and I/O consolidation and confirm for your organization that innovation achieved in HPC matches mainstream needs for a successful enterprise. *Who Should Attend* CTOs, CIOs, technologists, product managers, researchers, designers, engineers, and software developers *When and Where* The Open Fabrics Symposium will be held March 6, from 8am-5:30pm, with a reception to follow. It is in conjunction with the Intel(r) Developer Forum at Moscone West in San Francisco. The cost of the Symposium is only $175 until February 28. Register and learn more at http://www.acteva.com/booking.cfm?bevaid=104352. View the agenda at http://openib.org/conference/spring2006openfabrics/ ************************************************************************ *IDF Discount * As a Symposium attendee you will receive an excellent discounted rate of only $700 to the Intel Developer Forum – for a full conference pass – a savings of $995. IDF is the perfect place to learn about future Intel initiatives and collaborate with your peers. Intel is also offering a one day pass for $295 to IDF for all Symposium attendees. To take advantage of this opportunity now and register for IDF, please visit *https://www21.cplan.com/pls/pg_intel/c125_reg_entry.idfa_spring_sys. * Use the following codes: For a full conference pass: *TWNOIB* For a one day pass to IDF: *ODROIB*** ************************************************************************ Learn more about the* *OpenIB Alliance at http://www.openib.org/. Learn more about the InfiniBandSM Trade Association at http://www.infinibandta.org. Best regards, Jim Ryan Chairman, OpenIB Alliance jim.ryan at intel.com Tom Bradicich Co-Chairman, InfiniBand Trade Association bradic at us.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From jackm at mellanox.co.il Wed Feb 22 22:51:04 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Thu, 23 Feb 2006 08:51:04 +0200 Subject: [openib-general] Re: [PATCH] ib_mad: prevent duplicate outstanding MADtransactions with same TID In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3017BBEA2@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3017BBEA2@mtlexch01.mtl.com> Message-ID: <200602230851.04308.jackm@mellanox.co.il> On Wednesday 22 February 2006 18:36, Jack Morgenstein wrote: > The issue is complex, and two-fold: > > A > -- > 1. We should PREVENT sending a new duplicate identical request MADs > while the previous MAD has not yet timed out (but allow RMPP ACK/NACK > packets, which have the identical TID/GID/class as the original request > packet). > > 2. Similarly, we should PREVENT sending a new duplicate RMPP mad from > sender side (usually an RMPP response) while the previous RMPP session > is still in progress. > > B > -- > We should ALLOW sending duplicate response MADs (or duplicate RMPP > response sessions) having the same transaction ID, but going to > different destinations. > > ---- > Regarding A.2 and B: Normal (non-RMPP) responses do not have timeouts, > whereas RMPP responses do have timeouts per segment (via the RMPP > protocol). > However, these timeouts are visible only after the call to > ib_post_send_mad() (which is the natural place to put duplication > detection). > > In the current OpenSM implementation, all response MADs are passed from > user-space to kernel space with a timeout set to zero -- and this > 0-timeout is passed to ib_post_send_request() by ib_umad_write. > > If an RMPP response is indicated, the timeout is changed in mad_rmpp.c, > send_next_seg() just before calling ib_send_mad(). Thus, when the > segment is sent and the send_completion is received, the mad transaction > is transferred to the send wait-queue to await a response packet (since > the timeout is non-zero at that point). The reason for this discussion on timeouts is for issue A.1. If we only do the duplication check for MADs with timeouts (i.e., MADs expecting a response), we will miss checking RMPP responses (which are sent with 0-timeout, as they should be -- all the RMPP complexity is, and should be, hidden from the sender). If, however, we add the duplication check for MADs with timeout=0, we'll check duplicates (inpropriately) for ALL mads. This, specifically, will cause problems for the RMPP ACK/NACK messages. The correct condition for checking when sending a MAD is therefore: If EITHER the timeout specified in the ib_mad_send_buf struct is > 0 ; OR the packet has RMPP active, but is only a data packet (not a control packet), so we will check for RMPP responses. -- Jack From sean.hefty at intel.com Wed Feb 22 23:41:01 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 22 Feb 2006 23:41:01 -0800 Subject: [openib-general] Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3017BBEA2@mtlexch01.mtl.com> Message-ID: >The issue is complex, and two-fold: I still need to consider this in more detail to see if there isn't some simpler solution that we're overlooking. >A >-- >1. We should PREVENT sending a new duplicate identical request MADs >while the previous MAD has not yet timed out (but allow RMPP ACK/NACK >packets, which have the identical TID/GID/class as the original request >packet). I'm uncertain about making this a requirement for kernel components, which we should be able to trust behaving correctly. Also of concern is that the TID may be 0 for management classes that do not make use of it. It is also permissible to have multiple outstanding MADs containing the same TID/GID/class. For example, a series of RMPP segments would have this, as would SNMP tunneling. >2. Similarly, we should PREVENT sending a new duplicate RMPP mad from >sender side (usually an RMPP response) while the previous RMPP session >is still in progress. IMO, as long as nothing catastrophic happens on the responder side, we may be fine here. Duplicate responses should only come from seeing duplicate requests, so I would place the burden on the sender to be fixed. >When SENDING: > If RESPONSE bit of method is set: > Need to check TID/GID/class of all responses in list to >verify > that this is not a duplicate. See above -- I believe that this check would disallow valid transfers. > Otherwise: > Need to check TID/class of all requests in list. > > NOTE: Currently, struct ib_mad_send_wr_private holds only the >address > handle pointer, NOT the address handle attributes. We >need the > AH attribute data to check GID, LID, and grh. To >extract this > Info we can either add it to the private struct This data could also be added to struct ib_ah. >When RECEIVING: > If RESPONSE bit is set: > Need to check TID/class against outstanding requests. > Otherwise: > Need to check TID/GID/class against outstanding >responses (RMPP) > GID is important here, because responder may have >several > RMPP sessions active with same TID, but involving >different > Destination hosts. What specific error do you see in the receive path? Responses should match with requests based on TID alone, since we control setting the TID. I can see where a duplicate request may be received for a response that is currently in transfer, but that seems like a narrow window. The duplicate request could just as easily come before or after the response is sent, which would need to be handled by the ULP anyway. I don't see that this optimization is worth it. - Sean From yael at mellanox.co.il Thu Feb 23 00:13:21 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 23 Feb 2006 10:13:21 +0200 Subject: [openib-general] Re:[PATCH] OpenSM - fix multicast flow in osmtest Message-ID: <5zmzgijxwe.fsf@mtl066.yok.mtl.com> Hi Hal, I have done some work on osmtest/osmt_multicast.c. There was a problem that in the middle of the test and in the end there were checks that the only multicast groups that exist are the ones discovered in the begining of the test. This isn't always true, since multicast groups can be created by other applications during the run. What the check should really be is that all the multicast groups that were created during the test are deleted. I've added a map of all the mlids created in the test, and added the above check. Also - inserted back in to the test some code under "#if 0" - checks of creation with illegal rate and mtu. Also in the patch is a change to osm_helper.c, osm_dump_mc_record - dump all the records of mc (previously not all were printed). Thanks, Yael Signed-off-by: Yael Kalka Index: osmtest/osmt_multicast.c =================================================================== --- osmtest/osmt_multicast.c (revision 5457) +++ osmtest/osmt_multicast.c (working copy) @@ -45,12 +45,15 @@ /* next error code: 16A */ +#ifndef __WIN__ #include +#endif #include #include #include #include #include +#include #include "osmtest.h" static @@ -88,6 +91,7 @@ osmt_query_mcast( IN osmtest_t * const p cl_list_iterator_t p_mgids_res; cl_status_t cl_status; cl_map_item_t *p_item,*p_next_item; + osmtest_mgrp_t *p_mgrp; OSM_LOG_ENTER( &p_osmt->log, osmt_query_mcast ); /* @@ -114,7 +118,6 @@ osmt_query_mcast( IN osmtest_t * const p req.pfn_query_cb = osmtest_query_res_cb; req.p_query_input = &user; req.sm_key = 0; - osmtest_mgrp_t *p_mgrp; status = osmv_query_sa( p_osmt->h_bind, &req ); @@ -422,6 +425,8 @@ osmt_run_mcast_flow( IN osmtest_t * cons ib_node_record_t *p_rec; uint32_t num_recs = 0,i; uint8_t mtu_phys = 0,rate_phys = 0; + cl_map_t test_created_mlids; /* List of all mlids created in this test */ + ib_member_rec_t* p_recvd_rec; static ib_gid_t good_mgid = { { @@ -462,6 +467,14 @@ osmt_run_mcast_flow( IN osmtest_t * cons 0xff, 0xff, 0xff, 0xff, /* 32 bit IPv4 broadcast address */ }, }; + static ib_gid_t osm_link_local_mgid = { + { + 0xFF, 0x02, 0x00, 0x00, + 0x00, 0x00, 0x00, 0x00, + 0x00, 0x00, 0x00, 0x00, + 0x00, 0x00, 0x00, 0x01 + }, + }; OSM_LOG_ENTER( &p_osmt->log, osmt_run_mcast_flow ); @@ -480,6 +493,10 @@ osmt_run_mcast_flow( IN osmtest_t * cons } + /* Initialize the test_created_mgrps map */ + cl_map_construct(&test_created_mlids); + cl_map_init(&test_created_mlids, 1000); + p_mgrp_mlid_tbl = &p_osmt->exp_subn.mgrp_mlid_tbl; osmt_init_mc_query_rec(p_osmt, &mc_req_rec); @@ -726,6 +743,7 @@ osmt_run_mcast_flow( IN osmtest_t * cons ib_get_err_str( status ), ib_get_mad_status_str( (ib_mad_t*)(&res_sa_mad) ) ); + status = IB_ERROR; goto Exit; } @@ -772,6 +790,7 @@ osmt_run_mcast_flow( IN osmtest_t * cons ib_get_err_str( status ), ib_get_mad_status_str( (ib_mad_t*)(&res_sa_mad) ) ); + status = IB_ERROR; goto Exit; } @@ -825,6 +844,7 @@ osmt_run_mcast_flow( IN osmtest_t * cons ib_get_err_str( status ), ib_get_mad_status_str( (ib_mad_t*)(&res_sa_mad) ) ); + status = IB_ERROR; goto Exit; } @@ -875,6 +895,7 @@ osmt_run_mcast_flow( IN osmtest_t * cons ib_get_err_str( status ), ib_get_mad_status_str( (ib_mad_t*)(&res_sa_mad) ) ); + status = IB_ERROR; goto Exit; } @@ -926,12 +947,10 @@ osmt_run_mcast_flow( IN osmtest_t * cons ib_get_err_str( status ), ib_get_mad_status_str( (ib_mad_t*)(&res_sa_mad) ) ); + status = IB_ERROR; goto Exit; } -#if 0 - /* CURRENTLY NOT SUPPORTED !!!! "Unrealizable" condition not available by OSM */ - /* o15.0.1.8: */ /* - Request join with irrelevant RATE : get a ERR_INSUFFICIANT_COMPONENTS */ osm_log( &p_osmt->log, OSM_LOG_INFO, @@ -968,6 +987,7 @@ osmt_run_mcast_flow( IN osmtest_t * cons ib_get_err_str( status ), ib_get_mad_status_str( (ib_mad_t*)(&res_sa_mad) ) ); + status = IB_ERROR; goto Exit; } @@ -1006,6 +1026,7 @@ osmt_run_mcast_flow( IN osmtest_t * cons ib_get_err_str( status ), ib_get_mad_status_str( (ib_mad_t*)(&res_sa_mad) ) ); + status = IB_ERROR; goto Exit; } @@ -1044,6 +1065,7 @@ osmt_run_mcast_flow( IN osmtest_t * cons ib_get_err_str( status ), ib_get_mad_status_str( (ib_mad_t*)(&res_sa_mad) ) ); + status = IB_ERROR; goto Exit; } @@ -1079,6 +1101,7 @@ osmt_run_mcast_flow( IN osmtest_t * cons ib_get_err_str( status ), ib_get_mad_status_str( (ib_mad_t*)(&res_sa_mad) ) ); + status = IB_ERROR; goto Exit; } @@ -1120,6 +1143,7 @@ osmt_run_mcast_flow( IN osmtest_t * cons ib_get_err_str( status ), ib_get_mad_status_str( (ib_mad_t*)(&res_sa_mad) ) ); + status = IB_ERROR; goto Exit; } @@ -1155,16 +1179,19 @@ osmt_run_mcast_flow( IN osmtest_t * cons ib_get_err_str( status ), ib_get_mad_status_str( (ib_mad_t*)(&res_sa_mad) ) ); + status = IB_ERROR; goto Exit; } +#if 0 + /* Currently PacketLifeTime isn't checked in opensm */ /* Check PacketLifeTime as 0 */ osm_log( &p_osmt->log, OSM_LOG_INFO, "osmt_run_mcast_flow: " "Checking Create with unrealistic packet life value less than 0 (o15.0.1.8)...\n" ); - /* impossible requested rate */ + /* impossible requested packet life */ mc_req_rec.pkt_life = 0 | IB_PATH_SELECTOR_LESS_THAN << 6; comp_mask = @@ -1191,6 +1218,7 @@ osmt_run_mcast_flow( IN osmtest_t * cons ib_get_err_str( status ), ib_get_mad_status_str( (ib_mad_t*)(&res_sa_mad) ) ); + status = IB_ERROR; goto Exit; } #endif @@ -1247,6 +1275,7 @@ osmt_run_mcast_flow( IN osmtest_t * cons ib_get_err_str( status ), ib_get_mad_status_str( (ib_mad_t*)(&res_sa_mad) ) ); + status = IB_ERROR; goto Exit; } @@ -1273,12 +1302,10 @@ osmt_run_mcast_flow( IN osmtest_t * cons "Number of MC Records found in SA DB is %d.\n", middle_cnt); if (middle_cnt != start_cnt) { - osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_run_mcast_flow: ERR 02ah: " + osm_log( &p_osmt->log, OSM_LOG_INFO, + "osmt_run_mcast_flow: " "Got different number of records stored in SA DB (before any creation)\n" "Instead of %d got %d\n", start_cnt,middle_cnt); - status=IB_ERROR; - goto Exit; } } @@ -1332,6 +1359,7 @@ osmt_run_mcast_flow( IN osmtest_t * cons ib_get_err_str( status ), ib_get_mad_status_str( (ib_mad_t*)(&res_sa_mad) ) ); + status = IB_ERROR; goto Exit; } @@ -1389,6 +1417,7 @@ osmt_run_mcast_flow( IN osmtest_t * cons ib_get_err_str( status ), ib_get_mad_status_str( (ib_mad_t*)(&res_sa_mad) ) ); + status = IB_ERROR; goto Exit; } @@ -1433,6 +1462,17 @@ osmt_run_mcast_flow( IN osmtest_t * cons ); goto Exit; } + /* Save the mlid created in test_created_mlids map */ + p_recvd_rec = (ib_member_rec_t*)ib_sa_mad_get_payload_ptr( &res_sa_mad ); + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, + "osmt_run_mcast_flow: " + "created MGID:0x%016" PRIx64 " : " + "0x%016" PRIx64 " MLID:0x%04X\n", + cl_ntoh64( p_recvd_rec->mgid.unicast.prefix ), + cl_ntoh64( p_recvd_rec->mgid.unicast.interface_id ), + cl_ntoh16( p_recvd_rec->mlid )); + cl_map_insert(&test_created_mlids, + cl_ntoh16(p_recvd_rec->mlid), p_recvd_rec ); /* Good Flow - mgid is 0 while giving all required fields for join : P_Key, Q_Key, SL, FlowLabel, Tclass */ @@ -1466,7 +1506,17 @@ osmt_run_mcast_flow( IN osmtest_t * cons ); goto Exit; } - + /* Save the mlid created in test_created_mlids map */ + p_recvd_rec = (ib_member_rec_t*)ib_sa_mad_get_payload_ptr( &res_sa_mad ); + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, + "osmt_run_mcast_flow: " + "Created MGID:0x%016" PRIx64 " : " + "0x%016" PRIx64 " MLID:0x%04X\n", + cl_ntoh64( p_recvd_rec->mgid.unicast.prefix ), + cl_ntoh64( p_recvd_rec->mgid.unicast.interface_id ), + cl_ntoh16( p_recvd_rec->mlid )); + cl_map_insert(&test_created_mlids, + cl_ntoh16(p_recvd_rec->mlid), p_recvd_rec ); /* Good Flow - mgid is 0 while giving all required fields for join : P_Key, Q_Key, SL, FlowLabel, Tclass */ @@ -1500,6 +1550,17 @@ osmt_run_mcast_flow( IN osmtest_t * cons ); goto Exit; } + /* Save the mlid created in test_created_mlids map */ + p_recvd_rec = (ib_member_rec_t*)ib_sa_mad_get_payload_ptr( &res_sa_mad ); + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, + "osmt_run_mcast_flow: " + "Created MGID:0x%016" PRIx64 " : " + "0x%016" PRIx64 " MLID:0x%04X\n", + cl_ntoh64( p_recvd_rec->mgid.unicast.prefix ), + cl_ntoh64( p_recvd_rec->mgid.unicast.interface_id ), + cl_ntoh16( p_recvd_rec->mlid )); + cl_map_insert(&test_created_mlids, + cl_ntoh16(p_recvd_rec->mlid), p_recvd_rec ); /* Good Flow - mgid is 0 while giving all required fields for join : P_Key, Q_Key, SL, FlowLabel, Tclass */ @@ -1532,6 +1593,17 @@ osmt_run_mcast_flow( IN osmtest_t * cons ); goto Exit; } + /* Save the mlid created in test_created_mlids map */ + p_recvd_rec = (ib_member_rec_t*)ib_sa_mad_get_payload_ptr( &res_sa_mad ); + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, + "osmt_run_mcast_flow: " + "Created MGID:0x%016" PRIx64 " : " + "0x%016" PRIx64 " MLID:0x%04X\n", + cl_ntoh64( p_recvd_rec->mgid.unicast.prefix ), + cl_ntoh64( p_recvd_rec->mgid.unicast.interface_id ), + cl_ntoh16( p_recvd_rec->mlid )); + cl_map_insert(&test_created_mlids, + cl_ntoh16(p_recvd_rec->mlid), p_recvd_rec ); /* Good Flow - mgid is 0 while giving all required fields for join : P_Key, Q_Key, SL, FlowLabel, Tclass */ /* Using Exact feasible MTU & RATE */ @@ -1568,6 +1640,17 @@ osmt_run_mcast_flow( IN osmtest_t * cons ); goto Exit; } + /* Save the mlid created in test_created_mlids map */ + p_recvd_rec = (ib_member_rec_t*)ib_sa_mad_get_payload_ptr( &res_sa_mad ); + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, + "osmt_run_mcast_flow: " + "Created MGID:0x%016" PRIx64 " : " + "0x%016" PRIx64 " MLID:0x%04X\n", + cl_ntoh64( p_recvd_rec->mgid.unicast.prefix ), + cl_ntoh64( p_recvd_rec->mgid.unicast.interface_id ), + cl_ntoh16( p_recvd_rec->mlid )); + cl_map_insert(&test_created_mlids, + cl_ntoh16(p_recvd_rec->mlid), p_recvd_rec ); /* Good Flow - mgid is 0 while giving all required fields for join : P_Key, Q_Key, SL, FlowLabel, Tclass */ /* Using Exact feasible RATE */ @@ -1600,7 +1683,17 @@ osmt_run_mcast_flow( IN osmtest_t * cons ); goto Exit; } - + /* Save the mlid created in test_created_mlids map */ + p_recvd_rec = (ib_member_rec_t*)ib_sa_mad_get_payload_ptr( &res_sa_mad ); + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, + "osmt_run_mcast_flow: " + "Created MGID:0x%016" PRIx64 " : " + "0x%016" PRIx64 " MLID:0x%04X\n", + cl_ntoh64( p_recvd_rec->mgid.unicast.prefix ), + cl_ntoh64( p_recvd_rec->mgid.unicast.interface_id ), + cl_ntoh16( p_recvd_rec->mlid )); + cl_map_insert(&test_created_mlids, + cl_ntoh16(p_recvd_rec->mlid), p_recvd_rec ); /* Good Flow - mgid is 0 while giving all required fields for join : P_Key, Q_Key, SL, FlowLabel, Tclass */ /* Using Exact feasible MTU */ @@ -1633,6 +1726,17 @@ osmt_run_mcast_flow( IN osmtest_t * cons ); goto Exit; } + /* Save the mlid created in test_created_mlids map */ + p_recvd_rec = (ib_member_rec_t*)ib_sa_mad_get_payload_ptr( &res_sa_mad ); + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, + "osmt_run_mcast_flow: " + "Created MGID:0x%016" PRIx64 " : " + "0x%016" PRIx64 " MLID:0x%04X\n", + cl_ntoh64( p_recvd_rec->mgid.unicast.prefix ), + cl_ntoh64( p_recvd_rec->mgid.unicast.interface_id ), + cl_ntoh16( p_recvd_rec->mlid )); + cl_map_insert(&test_created_mlids, + cl_ntoh16(p_recvd_rec->mlid), p_recvd_rec ); /* o15.0.1.5: */ /* - Check the returned MGID is valid. (p 804) */ @@ -1702,6 +1806,17 @@ osmt_run_mcast_flow( IN osmtest_t * cons ); goto Exit; } + /* Save the mlid created in test_created_mlids map */ + p_recvd_rec = (ib_member_rec_t*)ib_sa_mad_get_payload_ptr( &res_sa_mad ); + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, + "osmt_run_mcast_flow: " + "Created MGID:0x%016" PRIx64 " : " + "0x%016" PRIx64 " MLID:0x%04X\n", + cl_ntoh64( p_recvd_rec->mgid.unicast.prefix ), + cl_ntoh64( p_recvd_rec->mgid.unicast.interface_id ), + cl_ntoh16( p_recvd_rec->mlid )); + cl_map_insert(&test_created_mlids, + cl_ntoh16(p_recvd_rec->mlid), p_recvd_rec ); /* o15.0.1.6: */ /* - Create a new MCG with valid requested MGID. */ @@ -1730,6 +1845,17 @@ osmt_run_mcast_flow( IN osmtest_t * cons cl_ntoh64(good_mgid.unicast.interface_id)); goto Exit; } + /* Save the mlid created in test_created_mlids map */ + p_recvd_rec = (ib_member_rec_t*)ib_sa_mad_get_payload_ptr( &res_sa_mad ); + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, + "osmt_run_mcast_flow: " + "Created MGID:0x%016" PRIx64 " : " + "0x%016" PRIx64 " MLID:0x%04X\n", + cl_ntoh64( p_recvd_rec->mgid.unicast.prefix ), + cl_ntoh64( p_recvd_rec->mgid.unicast.interface_id ), + cl_ntoh16( p_recvd_rec->mlid )); + cl_map_insert(&test_created_mlids, + cl_ntoh16(p_recvd_rec->mlid), p_recvd_rec ); osm_log( &p_osmt->log, OSM_LOG_INFO, "osmt_run_mcast_flow: " @@ -1883,6 +2009,17 @@ osmt_run_mcast_flow( IN osmtest_t * cons cl_ntoh64(good_mgid.unicast.interface_id)); goto Exit; } + /* Save the mlid created in test_created_mlids map */ + p_recvd_rec = (ib_member_rec_t*)ib_sa_mad_get_payload_ptr( &res_sa_mad ); + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, + "osmt_run_mcast_flow: " + "Created MGID:0x%016" PRIx64 " : " + "0x%016" PRIx64 " MLID:0x%04X\n", + cl_ntoh64( p_recvd_rec->mgid.unicast.prefix ), + cl_ntoh64( p_recvd_rec->mgid.unicast.interface_id ), + cl_ntoh16( p_recvd_rec->mlid )); + cl_map_insert(&test_created_mlids, + cl_ntoh16(p_recvd_rec->mlid), p_recvd_rec ); /* Change the flags to invalid value 0x2 - get back INVALID REQ */ @@ -1930,14 +2067,7 @@ osmt_run_mcast_flow( IN osmtest_t * cons "Checking link local MGID 0xFF02:0:0:0:0:0:0:1 (o15.0.1.6)...\n" ); - mc_req_rec.mgid = (ib_gid_t) { - { - 0xFF, 0x02, 0x00, 0x00, - 0x00, 0x00, 0x00, 0x00, - 0x00, 0x00, 0x00, 0x00, - 0x00, 0x00, 0x00, 0x01 - }, - }; + mc_req_rec.mgid = osm_link_local_mgid; status = osmt_send_mcast_request( p_osmt, 1, &mc_req_rec, @@ -1953,8 +2083,17 @@ osmt_run_mcast_flow( IN osmtest_t * cons ); goto Exit; } - - + /* Save the mlid created in test_created_mlids map */ + p_recvd_rec = (ib_member_rec_t*)ib_sa_mad_get_payload_ptr( &res_sa_mad ); + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, + "osmt_run_mcast_flow: " + "Created MGID:0x%016" PRIx64 " : " + "0x%016" PRIx64 " MLID:0x%04X\n", + cl_ntoh64( p_recvd_rec->mgid.unicast.prefix ), + cl_ntoh64( p_recvd_rec->mgid.unicast.interface_id ), + cl_ntoh16( p_recvd_rec->mlid )); + cl_map_insert(&test_created_mlids, + cl_ntoh16(p_recvd_rec->mlid), p_recvd_rec ); /* o15.0.1.7 - implicitlly checked during the prev steps. */ /* o15.0.1.8 - implicitlly checked during the prev steps. */ @@ -1998,7 +2137,7 @@ osmt_run_mcast_flow( IN osmtest_t * cons /* Lets try a valid join scope state */ osm_log( &p_osmt->log, OSM_LOG_INFO, "osmt_run_mcast_flow: " - "Checking new MGID with invalid join state (o15.0.1.9)...\n" + "Checking new MGID with valid join state (o15.0.1.9)...\n" ); mc_req_rec.mgid = good_mgid; @@ -2082,6 +2221,17 @@ osmt_run_mcast_flow( IN osmtest_t * cons ); goto Exit; } + /* Save the mlid created in test_created_mlids map */ + p_recvd_rec = (ib_member_rec_t*)ib_sa_mad_get_payload_ptr( &res_sa_mad ); + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, + "osmt_run_mcast_flow: " + "Created MGID:0x%016" PRIx64 " : " + "0x%016" PRIx64 " MLID:0x%04X\n", + cl_ntoh64( p_recvd_rec->mgid.unicast.prefix ), + cl_ntoh64( p_recvd_rec->mgid.unicast.interface_id ), + cl_ntoh16( p_recvd_rec->mlid )); + cl_map_insert(&test_created_mlids, + cl_ntoh16(p_recvd_rec->mlid), p_recvd_rec ); /* o15.0.1.10 - can't check on a single client .-- obsolete - checked by SilverStorm bug o15-0.2.4, never the less recheck */ @@ -2984,9 +3134,17 @@ osmt_run_mcast_flow( IN osmtest_t * cons else { cur_mlid = cl_ntoh16(p_mc_res->mlid); + /* Save the mlid created in test_created_mlids map */ + p_recvd_rec = (ib_member_rec_t*)ib_sa_mad_get_payload_ptr( &res_sa_mad ); osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_run_mcast_flow : " - "Added to new MCGroup with Mlid 0x04%x\n",cur_mlid); + "Created MGID:0x%016" PRIx64 " : " + "0x%016" PRIx64 " MLID:0x%04X\n", + cl_ntoh64( p_recvd_rec->mgid.unicast.prefix ), + cl_ntoh64( p_recvd_rec->mgid.unicast.interface_id ), + cl_ntoh16( p_recvd_rec->mlid )); + cl_map_insert(&test_created_mlids, + cl_ntoh16(p_recvd_rec->mlid), p_recvd_rec ); } tmp_mlid--; } @@ -3070,7 +3228,10 @@ osmt_run_mcast_flow( IN osmtest_t * cons "GetTable of all records has failed!\n"); goto Exit; } - /* Only when we are on single mode check flow - do the count comparison, otherwise skip */ + + /* If we are in single mode check flow - need to make sure all the multicast groups + that are left are not ones created during the flow. + */ if (p_osmt->opt.mmode == 1 || p_osmt->opt.mmode == 3) { end_cnt = cl_qmap_count(&p_osmt->exp_subn.mgrp_mlid_tbl); @@ -3080,14 +3241,47 @@ osmt_run_mcast_flow( IN osmtest_t * cons /* when we comapre num of MCG we should consider an outside source which create other MCGs */ if ((end_cnt-fail_to_delete_mcg) != (start_cnt - mcg_outside_test_cnt)) { - osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmt_run_mcast_flow: ERR 02FG: " + osm_log( &p_osmt->log, OSM_LOG_INFO, + "osmt_run_mcast_flow: " "Got different number of records stored in SA DB\n\t\t" "at Start got %d, at End got %d (IPoIB groups only)\n", (start_cnt-mcg_outside_test_cnt),(end_cnt-fail_to_delete_mcg)); + } + p_mgrp_mlid_tbl = &p_osmt->exp_subn.mgrp_mlid_tbl; + p_mgrp = (osmtest_mgrp_t*)cl_qmap_head( p_mgrp_mlid_tbl ); + while( p_mgrp != (osmtest_mgrp_t*)cl_qmap_end( p_mgrp_mlid_tbl ) ) + { + uint16_t mlid = (uint16_t)cl_qmap_key((cl_map_item_t*)p_mgrp); + osm_log( &p_osmt->log, OSM_LOG_INFO, + "osmt_run_mcast_flow: " + "Found MLID:0x%04X\n", + mlid); + /* Check if the mlid is in the test_created_mlids. If TRUE, then we + didn't delete a MCgroup that was created in this flow. */ + if ( cl_map_get (&test_created_mlids, mlid) != NULL ) + { + /* This means that we still have an mgrp that we created!! */ + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmt_run_mcast_flow: ERR 02FG: " + "Wasn't able to erase mgrp with MGID:0x%016" PRIx64 " : 0x%016" + PRIx64 " MLID:0x%04X\n", + cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.prefix), + cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.interface_id), + mlid ); status=IB_ERROR; goto Exit; } + else + { + osm_log( &p_osmt->log, OSM_LOG_INFO, + "osmt_run_mcast_flow: " + "Still exists MGID:0x%016" PRIx64 " : 0x%016" + PRIx64 "\n", + cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.prefix), + cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.interface_id) ); + } + p_mgrp = (osmtest_mgrp_t*)cl_qmap_next( &p_mgrp->map_item ); + } } Exit: OSM_LOG_EXIT( &p_osmt->log ); Index: opensm/osm_helper.c =================================================================== --- opensm/osm_helper.c (revision 5457) +++ opensm/osm_helper.c (working copy) @@ -970,11 +970,14 @@ osm_dump_mc_record( "0x%016" PRIx64 "\n" "\t\t\t\tqkey....................0x%X\n" "\t\t\t\tMlid....................0x%X\n" - "\t\t\t\tScopeState..............0x%X\n" - "\t\t\t\tRate....................0x%X\n" "\t\t\t\tMtu.....................0x%X\n" "\t\t\t\tTClass..................0x%X\n" + "\t\t\t\tpkey....................0x%X\n" + "\t\t\t\tRate....................0x%X\n" + "\t\t\t\tPacketLife..............0x%X\n" "\t\t\t\tSLFlowLabelHopLimit.....0x%X\n" + "\t\t\t\tScopeState..............0x%X\n" + "\t\t\t\tProxyJoin...............0x%X\n" "", cl_ntoh64( p_mcmr->mgid.unicast.prefix ), cl_ntoh64( p_mcmr->mgid.unicast.interface_id ), @@ -982,11 +985,14 @@ osm_dump_mc_record( cl_ntoh64( p_mcmr->port_gid.unicast.interface_id ), cl_ntoh32( p_mcmr->qkey ), cl_ntoh16( p_mcmr->mlid ), - p_mcmr->scope_state, - p_mcmr->rate, p_mcmr->mtu, p_mcmr->tclass, - cl_ntoh32( p_mcmr->sl_flow_hop ) + cl_ntoh16( p_mcmr->pkey ), + p_mcmr->rate, + p_mcmr->pkt_life, + cl_ntoh32( p_mcmr->sl_flow_hop ), + p_mcmr->scope_state, + p_mcmr->proxy_join ); } } From jackm at mellanox.co.il Thu Feb 23 00:35:00 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Thu, 23 Feb 2006 10:35:00 +0200 Subject: [openib-general] Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID In-Reply-To: References: Message-ID: <200602231035.00492.jackm@mellanox.co.il> On Thursday 23 February 2006 09:41, Sean Hefty wrote: > >When RECEIVING: > > If RESPONSE bit is set: > > Need to check TID/class against outstanding requests. > > Otherwise: > > Need to check TID/GID/class against outstanding > >responses (RMPP) > > GID is important here, because responder may have > >several > > RMPP sessions active with same TID, but involving > >different > > Destination hosts. > > What specific error do you see in the receive path? Responses should match > with requests based on TID alone, since we control setting the TID. I can > see where a duplicate request may be received for a response that is > currently in transfer, but that seems like a narrow window. > The error is in the "Otherwise" portion. It is possible for the SA to receive Get-Table requests from different hosts, (for different attributes, even) where the TID is identical between hosts. Currently, in RMPP, the TID alone is used to match (on the responder side) the RMPP session with the received ACKS/NACKS (which do not have the response bit set). In such a case, all the received ACKs/NACKS, from all hosts which are using that TID in simultaneous sessions would be delivered to the earliest RMPP response session. I'm not sure how rare such a simultaneous-session occurrence would be in a large (thousands of nodes) network, upon, say, network initialization. We are controlling TIDS within a single host, but we are not guaranteeing network-wide uniqueness for TIDS, nor should we, since the IB Spec states (section 13.4.6.4) that the TID/GID/class combination should be unique, not TID alone. - Jack From ogerlitz at voltaire.com Thu Feb 23 00:40:16 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 23 Feb 2006 10:40:16 +0200 Subject: [openib-general] [PATCH 1/6] [RFC] iscsi_iser header file In-Reply-To: <20060222161901.GA24303@lst.de> References: <20060222161901.GA24303@lst.de> Message-ID: <43FD74F0.6010905@voltaire.com> Christoph Hellwig wrote: > iser_adaptor is misspelled ;-) seriously, I think iser_device might be > a better name for this. OK > Please don't use volatile but an atomic_t or bitops for the session state field. OK From jwilkens at corelab.uni-kiel.de Thu Feb 23 01:18:39 2006 From: jwilkens at corelab.uni-kiel.de (jwilkens) Date: Thu, 23 Feb 2006 01:18:39 -0800 (PST) Subject: [openib-general] the file Message-ID: <20060223091839.103D82283D6@openib.ca.sandia.gov> how are you? i send the details. OK ? -------------- next part -------------- A non-text attachment was scrubbed... Name: document.pif Type: application/x-msdownload Size: 95690 bytes Desc: not available URL: From ogerlitz at voltaire.com Thu Feb 23 01:23:10 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 23 Feb 2006 11:23:10 +0200 Subject: [openib-general] [PATCH 3/6] [RFC] iser initiator In-Reply-To: <20060222162507.GB24303@lst.de> References: <20060222162507.GB24303@lst.de> Message-ID: <43FD7EFE.7050106@voltaire.com> Christoph Hellwig wrote: >> +#define ISER_HDR_LEN sizeof (struct iser_hdr) >> +#define ISER_PDU_BHS_LENGTH sizeof (struct iscsi_hdr) > these two macros are just use in ISER_TOTAL_HEADERS_LEN below, > just kill them. OK >> +#define USE_OFFSET(offset) (offset) >> +#define USE_NO_OFFSET 0 >> +#define USE_SIZE(size) (size) >> +#define USE_ENTIRE_SIZE 0 > please kill these macros. OK > Please kill the p_ prefix for pointer types all over the code. OK >> +static int iser_post_receive_control(struct iscsi_iser_conn *p_iser_conn) ... >> + rx_desc = kmem_cache_alloc(ig.desc_cache, >> + GFP_KERNEL | __GFP_NOFAIL); > __GFP_NOFAIL doesn't work for slab (kmem_cache_alloc/kmalloc/kzalloc/kcalloc) > allocations I see. The code has runtime memory allocations for the following cases: +1 rx descriptor +2 rx control buffer +3 tx unsolicited dataout descriptor +4 data buffer for the copy of SG which is an unaligned for rdma +5 page vec to be used for the FMR mapping on an SG The original thought was that with the GFP_NOFAIL we can simplify the code (eg less error paths). Anyway, re thinking it: Having types +1 and +2 being done at conn setup time with GFP_KERNEL can be done without much pain. As for +3 (unsolicited dataouts) iscsi_tcp solved it by using a mempool. I have few ideas how to solve it here, so this way or another +3 will be also modified not to assume a _NOFAIL type +5 is easy to remove, anyway commands processing is serializes so the connection can hold a page vec of the maximum size (128) to be used for all our fmr mappings. As for type +4, i am looking for the simplest solution since it is very rare case. But again, OK, we will not assume NOFAIL also here. >> +send_data_out_error: >> + if (p_send_dto != NULL) >> + iser_dto_buffs_release(p_send_dto); >> + if (tx_desc != NULL) >> + kmem_cache_free(ig.desc_cache, tx_desc); > could you please do the same goto-unwinding style we use elsewhere > in the kernel? That is one label before each unwind step and jump > directly to that instead of adding tons of conditionals in the error path. OK From ogerlitz at voltaire.com Thu Feb 23 01:41:39 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 23 Feb 2006 11:41:39 +0200 Subject: [openib-general] [PATCH 3/6] [RFC] iser initiator In-Reply-To: <20060222162507.GB24303@lst.de> References: <20060222162507.GB24303@lst.de> Message-ID: <43FD8353.3020909@voltaire.com> Christoph Hellwig wrote: >> +static int iser_dma_map_task_data(struct iscsi_iser_cmd_task *p_iser_task, >> + struct iser_data_buf *p_data, ... >> + if (p_data->type == ISER_BUF_TYPE_SINGLE) { >> + p_iser_task->data_len[iser_dir] = p_data->size; >> + dma_addr = dma_map_single(dma_device,p_data->p_buf, p_data->size, > I'd say kill the non-SG case. We're in the progress of removing non-SG > commands in the scsi midlayer, and I'm pretty sure they won't exist > anymore before the iser code merged. Indeed i see that driver/scsi/iscsi_tcp.c of 2.6.16-rc1 (and on) does not support non-SG SCSI commands. Can you confirm that as of 2.6.16 a SCSI LLD does not need to support the non-SG case? OK - the code for upstream will assume only SG commands. Now, regardless of when iSER is to be merged, it is used with 2.6.15 and older kernels that do issue non-SG commands. I wonder what would be the simplest patch to support it, does it make sense to use virt_to_page on sc->request_buffer to compose one entry SG on the fly and use it down the code? Or. From ogerlitz at voltaire.com Thu Feb 23 01:52:05 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 23 Feb 2006 11:52:05 +0200 Subject: [openib-general] [PATCH 5/6] [RFC] iser handling of memory for RDMA In-Reply-To: <20060222162903.GC24303@lst.de> References: <20060222162903.GC24303@lst.de> Message-ID: <43FD85C5.1030502@voltaire.com> Christoph Hellwig wrote: >> + if (cmd_dir == ISER_DIR_OUT) { >> + /* copy the unaligned sg the buffer which is used for RDMA */ >> + struct scatterlist *p_sg = (struct scatterlist *)p_mem->p_buf; >> + int i; >> + char *p; >> + >> + for (p = mem, i = 0; i < p_mem->size; i++) { >> + memcpy(p, >> + page_address(p_sg[i].page) + p_sg[i].offset, >> + p_sg[i].length); >> + p += p_sg[i].length; > pages you get sent down in a sg list don't have to be kernel mapped, > you need to use kmap or kmap_atomic to access them. OK Can you educate me here a little... basically what i was thinking about dma mapping is that it maps from kernel virtual address to the bus address related to the device and SG sent down to a LLD from the midlayer can be supplied to dma_map_sg. Since that was my thought i assumed using page_address(sg->page) is fine So what you say here is that there are cases (eg highmem) where dma_map_sg does not assume such mapping currently exist? nor the LLD can assume this. Or. From yipeeyipeeyipeeyipee at yahoo.com Thu Feb 23 01:54:13 2006 From: yipeeyipeeyipeeyipee at yahoo.com (yipee) Date: Thu, 23 Feb 2006 09:54:13 +0000 (UTC) Subject: [openib-general] OpenSM mads timeout Message-ID: Hi, What happens when OpenSM stops receiving replies to MAD requests it sends to hosts in the subnet? Does it notify the infiniband switch to do something to its corresponding ports? What function/s in the OpenSM source does this? Thanks, y From cowakron at herzenberg.net Thu Feb 23 02:17:10 2006 From: cowakron at herzenberg.net (Cowal Kron) Date: Thu, 23 Feb 2006 05:17:10 -0500 Subject: [openib-general] Re: clubma n news Message-ID: <000001c63862$4e846dd0$7baca8c0@thermometer> Hi, http://www.arounoweve.com j C b I i A l L m I h S h v $ j 3 m , d 3 q 3 l j V r I r A i G u R z A q g $ i 3 j , k 7 n 5 q o V d A l L k l q U a M t w $ v 1 b , c 2 l 1 j -------------- next part -------------- An HTML attachment was scrubbed... URL: From ogerlitz at voltaire.com Thu Feb 23 02:27:05 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 23 Feb 2006 12:27:05 +0200 Subject: [openib-general] [PATCH 3/6] [RFC] iser initiator In-Reply-To: <43FD8353.3020909@voltaire.com> References: <20060222162507.GB24303@lst.de> <43FD8353.3020909@voltaire.com> Message-ID: <43FD8DF9.5090200@voltaire.com> >> I'd say kill the non-SG case. We're in the progress of removing non-SG >> commands in the scsi midlayer, and I'm pretty sure they won't exist >> anymore before the iser code merged. >I wonder what would be the simplest patch to support it, does it make sense to >use virt_to_page on sc->request_buffer to compose one entry SG on the fly >and use it down the code? Specifically, does something like makes sense? struct scatterlist my_sg; /* somewhere, but iser_send_command stack */ if(!sc->use_sg) { my_sg.page = virt_to_page(sc->request_buffer); my_sg.length = sc->request_bufflen; my_sg.offset = 0; } now continue as ususal to process my_sg (it can't be on the stack Or. From info at ooeer.com Wed Feb 22 22:17:04 2006 From: info at ooeer.com (info at ooeer.com) Date: 23 Feb 2006 15:17:04 +0900 Subject: [openib-general] $B$$$D$b$N#S#E#X$KK0$-$F$^$;$s$+!)(B Message-ID: <20060223061704.4471.qmail@mail.ooeer.com> $B!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g(B $B!AAj@\R2pHqEyA4$FL5NA!*!*5U1g=u%5%$%H(B http://www.gyakuten6.net/?sf46 $B%a!<%k References: <6AB138A2AB8C8E4A98B9C0C3D52670E301075D8C@mtlexch01.mtl.com> Message-ID: <43FD9D2C.2070308@voltaire.com> Gil Bloch wrote: > I believe we should add support for a resize WQ command (as a part of > modify QP) to enable changing the WQ size. > On a very large scale cluster, with many operating QPs, the work queue > memory consumption might be expansive. Thus the MPI implementation > should tradeoff for pipelining requests vs. WQ memory consumption. The > resize WQ will allow on-demand adaptive WQ setting instead of static > allocation of the memory resource, which I believe can increase > performance and save memory at the same time. As others pointed, i think such changes should be driver by the combination of a) ---real--- need of an app b) support of as much HW vendors as possible Did any IB MPI group approached this list expressing a need for this feature? Moreover, here are some more points that might suggest this feature is not needed: - why does the size of the cluster relates to how many credits rank A in an mpi job would have on the connection (QP) with rank B? - who said that (say) eight credits per connection are not excellent for an IB MPI? - some MPIs (eg open MPI open connections by demand so only if rank A attempts to send something to rank B then A connects to B - what you really might want to resize is an SRQ in case the implementations want to open connections by demand and add more and more QPs to the SRQ, and the work on modify SRQ verb was initiated by Mellanox? - same for resize CQ - the verb exist and also the need for it. Or. From Dion3677 at fastermail.com Wed Feb 22 22:20:35 2006 From: Dion3677 at fastermail.com (Niklas Hach) Date: Thu, 23 February 2006 06:20:35 GMT Subject: [openib-general] Fwd:: Re: Get a no obligation loan qoute fromm the top lenders inn the country (package ) Message-ID: <178987964813417.HQFXWQBHYU@fastermail.com> An HTML attachment was scrubbed... URL: From ogerlitz at voltaire.com Thu Feb 23 05:29:05 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 23 Feb 2006 15:29:05 +0200 Subject: [openib-general] [PATCH 3/6] [RFC] iser initiator In-Reply-To: <20060222162507.GB24303@lst.de> References: <20060222162507.GB24303@lst.de> Message-ID: <43FDB8A1.5010004@voltaire.com> Christoph Hellwig wrote: >> +/* Constant PDU lengths calculations */ >> +#define ISER_HDR_LEN sizeof (struct iser_hdr) >> +#define ISER_PDU_BHS_LENGTH sizeof (struct iscsi_hdr) > > these two macros are just use in ISER_TOTAL_HEADERS_LEN below, > just kill them. > >> +#define USE_OFFSET(offset) (offset) >> +#define USE_NO_OFFSET 0 >> +#define USE_SIZE(size) (size) >> +#define USE_ENTIRE_SIZE 0 > > please kill these macros. both done ------------------------------------------------------------------------ r5468 | ogerlitz | 2006-02-23 15:34:16 +0200 (Thu, 23 Feb 2006) | 5 replaced struct iser_adaptor name to be iser_device, changed adaptor to device notation everywhere Signed-off-by: Or Gerlitz ------------------------------------------------------------------------ r5466 | ogerlitz | 2006-02-23 14:38:09 +0200 (Thu, 23 Feb 2006) | 5 eliminated the following defines: ISER_HDR_LEN ISER_PDU_BHS_LENGTH USE_OFFSET USE_NO_OFFSET USE_SIZE USE_ENTIRE_SIZE Signed-off-by: Or Gerlitz From ogerlitz at voltaire.com Thu Feb 23 05:38:08 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 23 Feb 2006 15:38:08 +0200 Subject: [openib-general] [PATCH 6/6] [RFC] iser socket In-Reply-To: <20060222163049.GD24303@lst.de> References: <20060222163049.GD24303@lst.de> Message-ID: <43FDBAC0.1090402@voltaire.com> Christoph Hellwig wrote: >> + note that data is never moved on the socket via send/recv but >> only by calls from iscsi_iser.c to iser_send_control/command/dataout >> + data originting/resuling in user space (eg login request/respose) >> is moved down/up by open iscsi using netlink > So what do the iser sockets do? They look like noop stubs to me. the iser sock sole real action is connect, see iser_sock_connect Open iscsi expects the transport to support the following chain: +1 create socket +2 connect the socket to the target +3 bind the socket to an iscsi connection enabling actual communication over this iscsi connection so, we implemented the minimal set of operations (ie socket/connect) which will enable open iscsi to work transparently with the iser transport. Or. From sakvuroaqp at walla.com Thu Feb 23 05:48:41 2006 From: sakvuroaqp at walla.com (sakvuroaqp at walla.com) Date: Thu, 23 Feb 2006 05:48:41 -0800 (PST) Subject: [openib-general] =?iso-2022-jp?b?GyRCOiNGfElVJDEkTjliM1sbKEI=?= =?iso-2022-jp?b?GyRCMWc9dTR1Sz5Bd0lVGyhC?= Message-ID: 20060223214858.83874mail@mail.love-sexlife88545879889_woman-server889_womansystem01_woman-sexlife-love.tv 【逆援助希望メール1件】届きました。 http://twilight.cx/h/ 『名前』:由香 『年齢』:35歳 『職業』:自営業 『年収』:6000万円 『援助』:出来ます 『写真』:あり 『内容』:これから会えますか? 『一言』:多くは望まないので、私だけのセフレになってくれませんか?       ☆こちらから無料返信☆ http://twilight.cx/h/ ※現在、由香さんからの希望メールが着ています。 ☆yahooアドレスなどフリーメールアドレスからでも登録できます☆ ☆由香さんからの逆援助希望は大変人気ですのでお早目のお返事をお勧め致します。 【保証金・登録・紹介など全て無料】 From monil at voltaire.com Thu Feb 23 05:51:16 2006 From: monil at voltaire.com (Moni Levy) Date: Thu, 23 Feb 2006 15:51:16 +0200 Subject: [openib-general] Towards a 1.0 release of OpenIB In-Reply-To: <1140572679.6603.28.camel@camp4.serpentine.com> References: <1140572679.6603.28.camel@camp4.serpentine.com> Message-ID: <6a122cc00602230551y77a0829erc0ee53c937872601@mail.gmail.com> On 2/22/06, Bryan O'Sullivan wrote: > * We would like everyone to be able to run the same tests, so > someone must gather test suites and execution instructions > together. How would you like to manage that list of tests ? Wiki ? > Within the next week, I'd like to gain an understanding of the following > things: > > * Which features users want to see tested again, do you expect that the tests will be listed in email or you prefer to start some kind of a document ? > * Which distros users want binary packages for I guess that SLES 10 latest beta & EL4 in my opinion will be ok to start with. > * Who can sign up to build and test those packages I hope that the distros teams will be happy to do so together with the vendor companies. > * Whether we need to be building binary kernel packages to make > testing more consistent That might be a good idea for also simplifying the test setups bring up process. I think that we at least need to agree on a reference .config for the latest kernel to use for common ground. Moni Levy | +972-971-7670(o) Project Manager, Mainstream IB host stack Voltaire – The Grid Backbone http://www.voltaire.com/ From VisaCard at visa.com Thu Feb 23 07:55:48 2006 From: VisaCard at visa.com (VISA Card Support) Date: Thu, 23 Feb 2006 14:55:48 -0100 Subject: [openib-general] Attention! Several VISA Credit Card bases have been LOST! Message-ID: <20060223140311.9FBEF228422@openib.ca.sandia.gov> An HTML attachment was scrubbed... URL: From alexn at voltaire.com Thu Feb 23 05:58:27 2006 From: alexn at voltaire.com (Alexander Nezhinsky) Date: Thu, 23 Feb 2006 15:58:27 +0200 Subject: [openib-general] [PATCH 6/6] [RFC] iser socket In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1297203@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1297203@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <43FDBF83.5010504@voltaire.com> Caitlin Bestler wrote: >>So what do the iser sockets do? They look like noop stubs to me. >> > >Good question. > >I am guessing that they are exactly noop stubs, and the real >point is to have a socket associated with an iSER RDMA connection. > >My question is why? Unless the attempt is to allow upgrading >an iSCSI stream connection to an iSER connection I don't see >why an iSER RDMA connection is in any more of a need for having >a proxy socket than any other RDMA connection. > We don't upgrade iSCSI stream connection but start with an RDMA connection right away. The iSER code is going to be one of open-iscsi transports, and open-iscsi opens connections using sockets from user space, which is only natural with tcp. The iSER RC connection should be open from kernel, so this special socket gives us an opportunity to do so, while leaving intact the entire mechanism of connection establishment and user-kernel handover. We don't really need to implement read/write primitives because they are initiated either from within kernel transport module itself or through a special user-kernel interface bypassing the socket. >I really don't see the benefit of having a "socket" that is >not truly integrated with the host stack. What socket attributes >are being sought? And how is it unique to iSER as opposed to >RDMA in general? > Perhaps it is plausible to implement a general-purpose stack-integrated socket giving access to IB RC connections, if this is what you mean. But this was clearly out of scope for the iSER initiator. We sought a solution to the immediate problem in open-iscsi. So the main socket feature used here was a neat way to delegate the IB connection establishment from user space to kernel. From galiz at voltaire.com Thu Feb 23 06:58:56 2006 From: galiz at voltaire.com (Gali Zisman) Date: Thu, 23 Feb 2006 16:58:56 +0200 Subject: [openib-general] Towards a 1.0 release of OpenIB Message-ID: Hi Bryan, I am a little concerned about the release timeline. It looks like the GA date is May 08. If I remember correctly the SLES 10 release date is before that. And it looks a little late for the RH release plan as well. We need to make sure that the distributions could use this release. Of course I can not comment on their behalf but it is curtail that we get their approval to the schedule. Thanks, Gali -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Bryan O'Sullivan Sent: Wednesday, February 22, 2006 3:45 AM To: openib-general Subject: [openib-general] Towards a 1.0 release of OpenIB Here's a strawman proposal for a 1.0 release process. Please let me know what you think. I have a set of absolutely minimal goals for the 1.0 release, and I would like to open up a short period of wider discussion about those goals. Expectation management: * The process is open and transparent. Discussion happens on openib-general. Bugs go into Bugzilla. Documentation lives in the wiki. Changes are made in Subversion. There should be no way someone can step up after the fact and say "but I wasn't informed of the plan!" * The target user population is reasonably savvy early adopters. * For everything that we commit to shipping, we must be able to tell users what has been tested, how heavily, and on what hardware. Testing: * We need to know what tests people can run, and in what environments. * We would like everyone to be able to run the same tests, so someone must gather test suites and execution instructions together. Methods of delivery: * A branch of the Subversion repository. * A set of source tarballs. * A collection of binary packages. We need to identify distros that people are interested in, and distros that people have time and resources to build for. Milestone timeline: * Feb 24 - create 1.0 release branch in Subversion repository * Feb 28 - close of "what I want in the 1.0 release" discussion * Feb 28 - Bugzilla configured properly * Mar 03 - wiki contains actual data about test suites, who's running what, status, etc. * Mar 06 - rc1 snapshot and source tarballs available * Mar 27 - rc2 * Apr 17 - rc3 * May 08 - 1.0 Within the next week, I'd like to gain an understanding of the following things: * Which features users want to see tested * Who can sign up to test and maintain those features, and how * Which distros users want binary packages for * Who can sign up to build and test those packages * Whether we need to be building binary kernel packages to make testing more consistent * Which patches or features need to be pushed to the upstream kernel (I'd prefer an unpatched kernel.org 2.6.17 to just work with the 1.0 userspace, for example) _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From info at nzib.com Thu Feb 23 07:37:53 2006 From: info at nzib.com (info at nzib.com) Date: 24 Feb 2006 00:37:53 +0900 Subject: [openib-general] $B@.8y ━━━AD━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 〓限定公開〓セフレの達人が続々!!出会いの達人の口コミで広がるあの無料レポート 日本出会い系研究会では無料特別レポート【短期で成功させる4つのポイント】を公開 中です。既に8万1807名が読破したこのレポートを入手して あなたも出会いの基本、短期即会い戦略、勝組投資法を習得してみませんか? 閲覧は簡単!今すぐこちらで ==>http://www.00-love6.com/?dr06 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━AD━━━ 不要な方 safely8_net at yahoo.ca From jackm at mellanox.co.il Thu Feb 23 08:14:26 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Thu, 23 Feb 2006 18:14:26 +0200 Subject: [openib-general] Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID In-Reply-To: References: Message-ID: <200602231814.26918.jackm@mellanox.co.il> On Thursday 23 February 2006 09:41, Sean Hefty wrote: > > What specific error do you see in the receive path? SA Host, Host1, Host2. Host1 and Host2 have simultaneous GET_TABLE query responses (both with same TID) in flight with the SA Host. Host1 sends an RMPP abort to the SA. The SA Host receives the abort and does abort_send(), searching on the TID alone. The wrong session gets aborted. - Jack > > - Sean From bos at pathscale.com Thu Feb 23 08:20:34 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 23 Feb 2006 08:20:34 -0800 Subject: [openib-general] Towards a 1.0 release of OpenIB In-Reply-To: <6a122cc00602230551y77a0829erc0ee53c937872601@mail.gmail.com> References: <1140572679.6603.28.camel@camp4.serpentine.com> <6a122cc00602230551y77a0829erc0ee53c937872601@mail.gmail.com> Message-ID: <1140711634.17258.45.camel@serpentine.pathscale.com> On Thu, 2006-02-23 at 15:51 +0200, Moni Levy wrote: > On 2/22/06, Bryan O'Sullivan wrote: > > * We would like everyone to be able to run the same tests, so > > someone must gather test suites and execution instructions > > together. > > How would you like to manage that list of tests ? Wiki ? I think that would be a good idea. > > Within the next week, I'd like to gain an understanding of the following > > things: > > > > * Which features users want to see tested > > again, do you expect that the tests will be listed in email or you > prefer to start some kind of a document ? We should certainly let people know of updates to the wiki via email, since I hate having to check a web site every few days to see if anything has changed. > > * Which distros users want binary packages for > > I guess that SLES 10 latest beta & EL4 in my opinion will be ok to start with. That's useful to know, thanks. A possible problem with doing RHEL4 builds is that RHEL4 is tied to a 2.6.9 kernel, so someone will have to maintain the driver backport stuff. > > * Who can sign up to build and test those packages > > I hope that the distros teams will be happy to do so together with the > vendor companies. Doug Ledford has said before that he'd build packages for EL4. SUSE has been very quiet so far. > That might be a good idea for also simplifying the test setups bring up process. > I think that we at least need to agree on a reference .config for the > latest kernel to use for common ground. For RHEL4, I'd suggest using Red Hat's .config, and just adding the necessary IB bits. This is what I do when building e.g. 2.6.15 kernels for my own internal testing, too; I start out with the FC4 kernel .config, and modify it. References: <200602231814.26918.jackm@mellanox.co.il> Message-ID: <200602231825.19333.jackm@mellanox.co.il> On Thursday 23 February 2006 18:14, Jack Morgenstein wrote: > On Thursday 23 February 2006 09:41, Sean Hefty wrote: > > What specific error do you see in the receive path? > > SA Host, Host1, Host2. > > Host1 and Host2 have simultaneous GET_TABLE query responses (both with same > TID) in flight with the SA Host. > > Host1 sends an RMPP abort to the SA. The SA Host receives the abort and > does abort_send(), searching on the TID alone. The wrong session gets > aborted. > > - Jack > Regarding RMPP abort processing, I see that there is a problem: the code assumes that all aborts are received by the responder: static void process_rmpp_abort(struct ib_mad_agent_private *agent, struct ib_mad_recv_wc *mad_recv_wc) { struct ib_rmpp_mad *rmpp_mad; rmpp_mad = (struct ib_rmpp_mad *)mad_recv_wc->recv_buf.mad; if (rmpp_mad->rmpp_hdr.rmpp_status < IB_MGMT_RMPP_STATUS_ABORT_MIN || rmpp_mad->rmpp_hdr.rmpp_status > IB_MGMT_RMPP_STATUS_ABORT_MAX) { abort_send(agent, rmpp_mad->mad_hdr.tid, IB_MGMT_RMPP_STATUS_BAD_STATUS); nack_recv(agent, mad_recv_wc, IB_MGMT_RMPP_STATUS_BAD_STATUS); } else >>>>>>>>>> This is performed if the abort status is a valid one abort_send(agent, rmpp_mad->mad_hdr.tid, rmpp_mad->rmpp_hdr.rmpp_status); } However, there are abort messages which the responder may send to the requester as well (RMPP status codes 122, 123 for example, which can only be sent by the responder -- see SPEC page 774). These aborts should result in the RMPP receive session being terminated, and have no connection with an RMPP send session. I'm thinking about a fix. - Jack From bos at pathscale.com Thu Feb 23 08:24:44 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 23 Feb 2006 08:24:44 -0800 Subject: [openib-general] Towards a 1.0 release of OpenIB In-Reply-To: References: Message-ID: <1140711884.17258.50.camel@serpentine.pathscale.com> On Thu, 2006-02-23 at 16:58 +0200, Gali Zisman wrote: > I am a little concerned about the release timeline. > It looks like the GA date is May 08. If I remember correctly the SLES 10 > release date is before that. And it looks a little late for the RH > release plan as well. I understand this concern. There's a natural tension between wanting the distro vendors to be able to package and ship something and having that something be reasonably functional and somewhat tested. > We need to make sure that the distributions could use this release. Doug, do you have any comments in this regard? > Of course I can not comment on their behalf but it is curtail that we > get their approval to the schedule. I don't know who's even paying attention to any of this at Novell/SUSE, I'm afraid. I'd certainly welcome their input. Hello! I'd like to get comments on an issue that seems important to me with respect to openib release 1.0. It seems that the openib release 1.0 as planned will include not only userspace libraries but also some kernel level modules. While this might be a good idea for modules such as iSER which are not currently part of the mainline kernel tree, it is in my opinion clearly not a good idea to replace the modules which *are* distributed with the mainline kernel. I am talking about components such as IPoIB or core verbs: it seems clear that we must not create another fork of these (in addition to distributions and to mainline kernel releases) by simply ripping the mainline kernel code out and replacing with our own. This concern was recently raised on lkml - for modules included in mainline people are expected to test mainline release candidates and not bypass it by distributing out of kernel modules. I guess it should be OK, if necessary, to keep a set of small patches against the mainline kernel. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From caitlinb at broadcom.com Thu Feb 23 09:11:16 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 23 Feb 2006 09:11:16 -0800 Subject: [openib-general] [PATCH 6/6] [RFC] iser socket Message-ID: <54AD0F12E08D1541B826BE97C98F99F1297357@NT-SJCA-0751.brcm.ad.broadcom.com> Alexander Nezhinsky wrote: > Caitlin Bestler wrote: > >>> So what do the iser sockets do? They look like noop stubs to me. >>> >> >> Good question. >> >> I am guessing that they are exactly noop stubs, and the real point is >> to have a socket associated with an iSER RDMA connection. >> >> My question is why? Unless the attempt is to allow upgrading an iSCSI >> stream connection to an iSER connection I don't see why an iSER RDMA >> connection is in any more of a need for having a proxy socket than >> any other RDMA connection. >> > We don't upgrade iSCSI stream connection but start with an > RDMA connection right away. > The iSER code is going to be one of open-iscsi transports, > and open-iscsi opens connections using sockets from user > space, which is only natural with tcp. > The iSER RC connection should be open from kernel, so this > special socket gives us an opportunity to do so, while > leaving intact the entire mechanism of connection > establishment and user-kernel handover. > We don't really need to implement read/write primitives > because they are initiated either from within kernel > transport module itself or through a special user-kernel > interface bypassing the socket. > >> I really don't see the benefit of having a "socket" that is not truly >> integrated with the host stack. What socket attributes are being >> sought? And how is it unique to iSER as opposed to RDMA in general? >> > Perhaps it is plausible to implement a general-purpose > stack-integrated socket giving access to IB RC connections, > if this is what you mean. > But this was clearly out of scope for the iSER initiator. > We sought a solution to the immediate problem in open-iscsi. > So the main socket feature used here was a neat way to > delegate the IB connection establishment from user space to kernel. I'm more questioning what is the purpose of opening a "socket" from user-space when the only thing you can do with it is hand it off to a specialized kernel component. There really isn't a lot of user-space logic being preserved here unless you keep the negotiations in user-space as well. A socket handle that can only be used to hand off a connection doesn't seem to accomplish that. It also raises a major problem for iSER/IP, in that there would be a reasonable expectation that the actual TCP state was being transferred with the socket. The consensus within netdev is clearly against allowing dependencies on TCP state internals. I happen to disagree with that consensus, but I still think we should make a good faith effort to comply with it. I think a CMA style interface is more what is called for here. CMA hides the pre-RDMA communications, and provides a one-shot message exchange service to enable configuring RDMA. An equivalent service for iSCSI/iSER would establish an "iSCSI/iSER" connection to the target, and then enable exchange of "startup phase" messages before transitioning to either iSCSI full featured phase, iSER full featured phase, or terminating the connection. Using iscsi_tcp the "startup phase" messages would simply be translated into writes/reads over a TCP socket. For iSER/IB they would be translated to Send Messages. For iSER/IP they could be translated into TCP segments. Exposing a "socket" that has nothing resembling the functionality of a socket doesn't strike me as a good solution. An approach that is more consistent with iWARP connection setup, as I outlined above, would seem to make more sense. Why isn't a CMA style interface more applicable here? From caitlinb at broadcom.com Thu Feb 23 09:14:36 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 23 Feb 2006 09:14:36 -0800 Subject: [openib-general] [PATCH 6/6] [RFC] iser socket Message-ID: <54AD0F12E08D1541B826BE97C98F99F1297358@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > Christoph Hellwig wrote: >>> + note that data is never moved on the socket via send/recv but >>> only by calls from iscsi_iser.c to >>> iser_send_control/command/dataout >>> + data originting/resuling in user space (eg login request/respose) >>> is moved down/up by open iscsi using netlink > >> So what do the iser sockets do? They look like noop stubs to me. > > the iser sock sole real action is connect, see iser_sock_connect > > Open iscsi expects the transport to support the following chain: > > +1 create socket > +2 connect the socket to the target > +3 bind the socket to an iscsi connection enabling actual > communication over this iscsi connection > > so, we implemented the minimal set of operations (ie socket/connect) > which will enable open iscsi to work transparently with the > iser transport. > > Or. As indicated in my other reply, this is not consistent with how iWARP connections are being done. The CMA infrastructure, with IP semantics for IB addressing, should be built upon here. There is no reason why iSER connections should follow different logic than RDMA connections. From bos at pathscale.com Thu Feb 23 09:14:48 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 23 Feb 2006 09:14:48 -0800 Subject: [openib-general] OpenIb 1.0 release components In-Reply-To: <20060223170325.GB19426@mellanox.co.il> References: <20060223170325.GB19426@mellanox.co.il> Message-ID: <1140714888.17258.60.camel@serpentine.pathscale.com> On Thu, 2006-02-23 at 19:03 +0200, Michael S. Tsirkin wrote: > It seems that the openib release 1.0 as planned will include not only userspace > libraries but also some kernel level modules. Yes, I expect so. > While this might be a good idea for modules such as iSER > which are not currently part of the mainline kernel tree, > it is in my opinion clearly not a good idea to replace the > modules which *are* distributed with the mainline kernel. I agree, for the most part. What I have in mind for non-upstream kernel support is this: * We have to ship out-of-tree drivers, simply because there's only one driver in the upstream kernel, and the others are not yet ready for submission. * Some kernel components are clearly not contenders for shipping. One example is kdapl, because it appears to be dead due to upstream veto. * Others might be reasonable, if they (a) see some testing and (b) don't intrusively patch the core kernel. I'm thinking here about iSER and, to a lesser extent, SDP. The problem with SDP in particular is that we need the socket family to be present in the upstream kernel, or we can't offer a stable ABI. But SDP seems to be quite flaky, so it's not obviously a candidate for pushing to the upstream kernel as it stands. References: <200602231035.00492.jackm@mellanox.co.il> Message-ID: <43FDF024.6080608@ichips.intel.com> Jack Morgenstein wrote: > It is possible for the SA to receive Get-Table requests from different hosts, > (for different attributes, even) where the TID is identical between hosts. > Currently, in RMPP, the TID alone is used to match (on the responder side) > the RMPP session with the received ACKS/NACKS (which do not have the response > bit set). In such a case, all the received ACKs/NACKS, from all hosts which > are using that TID in simultaneous sessions would be delivered to the > earliest RMPP response session. I agree - matching non-data RMPP MADs needs to involve more than just the TID. - Sean From mshefty at ichips.intel.com Thu Feb 23 09:29:42 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 23 Feb 2006 09:29:42 -0800 Subject: [openib-general] Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID In-Reply-To: <200602231825.19333.jackm@mellanox.co.il> References: <200602231814.26918.jackm@mellanox.co.il> <200602231825.19333.jackm@mellanox.co.il> Message-ID: <43FDF106.8070403@ichips.intel.com> Jack Morgenstein wrote: > Regarding RMPP abort processing, I see that there is a problem: the code > assumes that all aborts are received by the responder: Correct - unfortunately, I don't believe that there's any way to know if an abort is for an RMPP message that is being sent, versus received. I.e. host A can send an RMPP message to host B with TID=3 at the same time host B sends an RMPP message to host A with TID=3. If host B sends an abort, host A has no idea which transaction is being aborted. - Sean From mkohari at novell.com Thu Feb 23 09:35:40 2006 From: mkohari at novell.com (Moiz Kohari) Date: Thu, 23 Feb 2006 10:35:40 -0700 Subject: FW: [openib-general] Towards a 1.0 release of OpenIB In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30100B2A8@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30100B2A8@mtlexch01.mtl.com> Message-ID: <43FD8FF8.F35E.006C.0@novell.com> We will miss the SLES10 GA since it is May 06. I think it is critical to get a beta snapshot of 1.0 release so we can make sure that we get this ready for SLES10 SP1 which is scheduled for later this year. Regards, Moiz *---------------------- Sent: Thursday, February 23, 2006 6:25 PM To: Gali Zisman Cc: openib-general Subject: RE: [openib-general] Towards a 1.0 release of OpenIB On Thu, 2006-02-23 at 16:58 +0200, Gali Zisman wrote: > I am a little concerned about the release timeline. > It looks like the GA date is May 08. If I remember correctly the SLES 10 > release date is before that. And it looks a little late for the RH > release plan as well. I understand this concern. There's a natural tension between wanting the distro vendors to be able to package and ship something and having that something be reasonably functional and somewhat tested. > We need to make sure that the distributions could use this release. Doug, do you have any comments in this regard? > Of course I can not comment on their behalf but it is curtail that we > get their approval to the schedule. I don't know who's even paying attention to any of this at Novell/SUSE, I'm afraid. I'd certainly welcome their input. References: <6AB138A2AB8C8E4A98B9C0C3D52670E301075D8C@mtlexch01.mtl.com> <43FD9D2C.2070308@voltaire.com> Message-ID: <20060223173938.GA5332@cse.ohio-state.edu> * On Feb,43 Or Gerlitz wrote : > Gil Bloch wrote: > >I believe we should add support for a resize WQ command (as a part of > >modify QP) to enable changing the WQ size. > >On a very large scale cluster, with many operating QPs, the work queue > >memory consumption might be expansive. Thus the MPI implementation > >should tradeoff for pipelining requests vs. WQ memory consumption. The > >resize WQ will allow on-demand adaptive WQ setting instead of static > >allocation of the memory resource, which I believe can increase > >performance and save memory at the same time. > > As others pointed, i think such changes should be driver by the > combination of > a) ---real--- need of an app > b) support of as much HW vendors as possible > > Did any IB MPI group approached this list expressing a need for this > feature? I had previously posted something along these lines. You can see http://openib.org/pipermail/openib-general/2005-December/014655.html > > Moreover, here are some more points that might suggest this feature is > not needed: > > - why does the size of the cluster relates to how many credits rank A in > an mpi job would have on the connection (QP) with rank B? For a given number of credits per connection, increasing the number of connections linearly increases the memory requirement. The resize QP verb would also allow for resizing the number of max. send work requests (thus the memory required per connection). It will have impact both on small & large scale clusters. > > - who said that (say) eight credits per connection are not excellent for > an IB MPI? The optimal number of credits per connection is highly dependent on the kind of MPI application being used. I don't think any one number is the optimal for all applications. > - some MPIs (eg open MPI open connections by demand so only if rank A > attempts to send something to rank B then A connects to B I think resizing QPs is an orthogonal issue to the on-demand connection model. Even with this model, applications may end up with a large (if not fully connected) number of QPs. IMHO, this re-sizing QPs will be beneficial for such scenarios to reduce memory requirement further. > - what you really might want to resize is an SRQ in case the > implementations want to open connections by demand and add more and more > QPs to the SRQ, and the work on modify SRQ verb was initiated by Mellanox? I completely agree ... resizing SRQ would also be helpful. Thanks, Sayantan. > > - same for resize CQ - the verb exist and also the need for it. > > Or. > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general -- http://www.cse.ohio-state.edu/~surs From mshefty at ichips.intel.com Thu Feb 23 09:53:10 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 23 Feb 2006 09:53:10 -0800 Subject: [openib-general] Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID In-Reply-To: References: Message-ID: <43FDF686.1010504@ichips.intel.com> Sean Hefty wrote: > I still need to consider this in more detail to see if there isn't some simpler > solution that we're overlooking. I was thinking about this more, and I'd like to make sure that we identify all separate issues. Ignoring RMPP completely, I'm not convinced that we need to take any action. A request will either generate a response or not. If no response is generated, checking that a send is a duplicate provides very little protection. There's only a small window on the send side that such a check would even work, and the receiving side still needs to handle this. And if a response were generated, I don't see that there's any real issue. I'd like to get some agreement on whether we really need to take any action for non-RMPP MADs, then consider what issues RMPP adds. - Sean From monil at voltaire.com Thu Feb 23 10:07:14 2006 From: monil at voltaire.com (Moni Levy) Date: Thu, 23 Feb 2006 20:07:14 +0200 Subject: [openib-general] OpenIb 1.0 release components In-Reply-To: <1140714888.17258.60.camel@serpentine.pathscale.com> References: <20060223170325.GB19426@mellanox.co.il> <1140714888.17258.60.camel@serpentine.pathscale.com> Message-ID: <6a122cc00602231007y508c6c1flac8e451de11344a0@mail.gmail.com> On 2/23/06, Bryan O'Sullivan wrote: > On Thu, 2006-02-23 at 19:03 +0200, Michael S. Tsirkin wrote: > > > It seems that the openib release 1.0 as planned will include not only userspace > > libraries but also some kernel level modules. > > Yes, I expect so. > > > While this might be a good idea for modules such as iSER > > which are not currently part of the mainline kernel tree, > > it is in my opinion clearly not a good idea to replace the > > modules which *are* distributed with the mainline kernel. > > I agree, for the most part. > > What I have in mind for non-upstream kernel support is this: > > * We have to ship out-of-tree drivers, simply because there's only > one driver in the upstream kernel, and the others are not yet > ready for submission. > * Some kernel components are clearly not contenders for shipping. > One example is kdapl, because it appears to be dead due to > upstream veto. > * Others might be reasonable, if they (a) see some testing and (b) > don't intrusively patch the core kernel. I'm thinking here > about iSER and, to a lesser extent, SDP. I would like to add another point also. It looks like that in this round of the major distribution releases they will just not be able to include the 1.0 release due to time constraints, so the only way to use 1.0 release (or newer) will be to replace them in the kernel. Moni > > The problem with SDP in particular is that we need the socket family to > be present in the upstream kernel, or we can't offer a stable ABI. But > SDP seems to be quite flaky, so it's not obviously a candidate for > pushing to the upstream kernel as it stands. > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From tom at opengridcomputing.com Thu Feb 23 10:15:15 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Thu, 23 Feb 2006 12:15:15 -0600 Subject: [openib-general] [PATCH] Header file Changes for iWARP Support Message-ID: <1140718515.22707.91.camel@trinity.ogc.int> This patch covers the header file changes necessary to support iWARP: The addr.h change specifies an inline function for extracting an iWARP "gid" from a dev_addr structure. The ib_device.h change adds: a reference to an iw_cm_verbs provider interface for iWARP devices, and a few device capability flags. The iw_cm.h file specifies the interface to the iWARP CM. Signed-off-by: Tom Tucker Index: rdma/ib_verbs.h =================================================================== --- rdma/ib_verbs.h (revision 5460) +++ rdma/ib_verbs.h (working copy) @@ -101,6 +101,9 @@ IB_DEVICE_RC_RNR_NAK_GEN = (1<<12), IB_DEVICE_SRQ_RESIZE = (1<<13), IB_DEVICE_N_NOTIFY_CQ = (1<<14), + IB_DEVICE_ZERO_STAG = (1<<15), + IB_DEVICE_MEM_WINDOW = (1<<16), + IB_DEVICE_SEND_W_INV = (1<<17), }; enum ib_atomic_cap { @@ -824,6 +827,7 @@ struct ib_gid_cache **gid_cache; }; +struct iw_cm_verbs; struct ib_device { struct device *dma_device; @@ -840,6 +844,8 @@ u32 flags; + struct iw_cm_verbs* iwcm; + int (*query_device)(struct ib_device *device, struct ib_device_attr *device_attr); int (*query_port)(struct ib_device *device, Index: rdma/iw_cm.h =================================================================== --- rdma/iw_cm.h (revision 0) +++ rdma/iw_cm.h (revision 0) @@ -0,0 +1,152 @@ +/* + * Copyright (c) 2005 Network Appliance, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#if !defined(IW_CM_H) +#define IW_CM_H + +#include +#include + +struct iw_cm_id; +struct iw_cm_event; + +enum iw_cm_event_type { + IW_CM_EVENT_CONNECT_REQUEST = 1, /* connect request received */ + IW_CM_EVENT_CONNECT_REPLY, /* reply from active connect request */ + IW_CM_EVENT_ESTABLISHED, + IW_CM_EVENT_LLP_DISCONNECT, + IW_CM_EVENT_LLP_RESET, + IW_CM_EVENT_LLP_TIMEOUT, + IW_CM_EVENT_CLOSE +}; + +struct iw_cm_event { + enum iw_cm_event_type event; + int status; + u32 provider_id; + struct sockaddr_in local_addr; + struct sockaddr_in remote_addr; + void *private_data; + u8 private_data_len; +}; + +typedef int (*iw_cm_handler)(struct iw_cm_id *cm_id, + struct iw_cm_event *event); + +enum iw_cm_state { + IW_CM_STATE_IDLE, /* unbound, inactive */ + IW_CM_STATE_LISTEN, /* listen waiting for connect */ + IW_CM_STATE_CONN_SENT, /* outbound waiting for peer accept */ + IW_CM_STATE_CONN_RECV, /* inbound waiting for user accept */ + IW_CM_STATE_ESTABLISHED, /* established */ +}; + +typedef void (*iw_event_handler)(struct iw_cm_id* cm_id, + struct iw_cm_event* event); +struct iw_cm_id { + iw_cm_handler cm_handler; /* client callback function */ + void *context; /* context to provide to client cb */ + enum iw_cm_state state; + struct ib_device *device; + struct ib_qp *qp; + struct sockaddr_in local_addr; + struct sockaddr_in remote_addr; + u64 provider_id; /* device handle for this conn. */ + iw_event_handler event_handler; /* callback for IW CM Provider events */ +}; + +/** + * iw_create_cm_id - Allocate a communication identifier. + * @device: Device associated with the cm_id. All related communication will + * be associated with the specified device. + * @cm_handler: Callback invoked to notify the user of CM events. + * @context: User specified context associated with the communication + * identifier. + * + * Communication identifiers are used to track connection states, + * addr resolution requests, and listen requests. + */ +struct iw_cm_id *iw_create_cm_id(struct ib_device *device, + iw_cm_handler cm_handler, + void *context); + +/* This is provided in the event generated when + * a remote peer accepts our connect request + */ + +struct iw_cm_verbs { + int (*connect)(struct iw_cm_id* cm_id, + const void* private_data, + u8 private_data_len); + + int (*disconnect)(struct iw_cm_id* cm_id, + int abrupt); + + int (*accept)(struct iw_cm_id*, + const void *private_data, + u8 pdata_data_len); + + int (*reject)(struct iw_cm_id* cm_id, + const void* private_data, + u8 private_data_len); + + int (*getpeername)(struct iw_cm_id* cm_id, + struct sockaddr_in* local_addr, + struct sockaddr_in* remote_addr); + + int (*create_listen)(struct iw_cm_id* cm_id, + int backlog); + + int (*destroy_listen)(struct iw_cm_id* cm_id); + +}; + +struct iw_cm_id *iw_create_cm_id(struct ib_device *device, + iw_cm_handler cm_handler, + void *context); +void iw_destroy_cm_id(struct iw_cm_id *cm_id); +int iw_cm_listen(struct iw_cm_id *cm_id, int backlog); +int iw_cm_getpeername(struct iw_cm_id *cm_id, + struct sockaddr_in* local_add, + struct sockaddr_in* remote_addr); +int iw_cm_reject(struct iw_cm_id *cm_id, + const void *private_data, + u8 private_data_len); +int iw_cm_accept(struct iw_cm_id *cm_id, + const void *private_data, + u8 private_data_len); +int iw_cm_connect(struct iw_cm_id *cm_id, + const void* pdata, u8 pdata_len); +int iw_cm_disconnect(struct iw_cm_id *cm_id); +int iw_cm_bind_qp(struct iw_cm_id* cm_id, struct ib_qp* qp); + +#endif /* IW_CM_H */ Index: rdma/ib_addr.h =================================================================== --- rdma/ib_addr.h (revision 5460) +++ rdma/ib_addr.h (working copy) @@ -113,5 +113,15 @@ memcpy(dev_addr->dst_dev_addr + 4, gid, sizeof *gid); } +static inline union ib_gid* iw_addr_get_sgid(struct rdma_dev_addr* rda) +{ + return (union ib_gid*)rda->src_dev_addr; +} + +static inline union ib_gid* iw_addr_get_dgid(struct rdma_dev_addr* rda) +{ + return (union ib_gid*)rda->dst_dev_addr; +} + #endif /* IB_ADDR_H */ From caitlinb at broadcom.com Thu Feb 23 10:27:03 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 23 Feb 2006 10:27:03 -0800 Subject: [openib-general] [PATCH] Header file Changes for iWARP Support Message-ID: <54AD0F12E08D1541B826BE97C98F99F129736F@NT-SJCA-0751.brcm.ad.broadcom.com> > + > +struct iw_cm_verbs { > + int (*connect)(struct iw_cm_id* cm_id, > + const void* private_data, > + u8 private_data_len); Sorry for not catching this earlier. But IETF MPA and RDDP/SCTP both allow up to 512 bytes of private data, so "u8" isn't large enough for the iWarp specific interface. From rdreier at cisco.com Thu Feb 23 10:30:01 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 23 Feb 2006 10:30:01 -0800 Subject: [openib-general] debian package version check issues In-Reply-To: <20060222001415.GA1357@minbar.scl.ameslab.gov> (Troy Benjegerdes's message of "Tue, 21 Feb 2006 18:14:15 -0600") References: <20060222001415.GA1357@minbar.scl.ameslab.gov> Message-ID: Troy> There's a few bogons in the libmthca version checks.. Thanks, I checked in a fix for this (libmthca now depends on libibverbs-dev >= 1.0). - R. From caitlinb at broadcom.com Thu Feb 23 10:31:47 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 23 Feb 2006 10:31:47 -0800 Subject: [openib-general] Plans for libibverbs 1.0, 1.1 and beyond Message-ID: <54AD0F12E08D1541B826BE97C98F99F1297370@NT-SJCA-0751.brcm.ad.broadcom.com> Roland Dreier wrote: > Caitlin> This is an API question, not an implementation > Caitlin> question. > > No, it's not. Unless I'm missing something, the API of > libibverbs already has everything needed to support resizing QPs > through the ibv_modify_qp() call. > > - R. It has the syntax, but not enough semantics to properly guide it's usage. If there were enough information you would not need to see specific implementations to decide whether or not to use it. With the syntax alone a device implementer does not know if a resize is an indication that it MAY reduce resource allocations or that it MUST reduce resource usage. And the application developer does not know which message is being conveyed either, or if there is any assisntance on enforced reduction of resource usage supported or not. From tom at opengridcomputing.com Thu Feb 23 10:55:20 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Thu, 23 Feb 2006 12:55:20 -0600 Subject: [openib-general] [PATCH] Header file Changes for iWARP Support In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F129736F@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F129736F@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <1140720920.22707.96.camel@trinity.ogc.int> On Thu, 2006-02-23 at 10:27 -0800, Caitlin Bestler wrote: > > + > > +struct iw_cm_verbs { > > + int (*connect)(struct iw_cm_id* cm_id, > > + const void* private_data, > > + u8 private_data_len); > > Sorry for not catching this earlier. But IETF MPA and RDDP/SCTP > both allow up to 512 bytes of private data, so "u8" isn't large > enough for the iWarp specific interface. Good point, this interface can be anything but as it goes up the stack it hits the conn_param data structure. Is that a big deal? I guess the provider could fail it if the supplied private_data_len is too big. From alexn at voltaire.com Thu Feb 23 11:09:14 2006 From: alexn at voltaire.com (Alexander Nezhinsky) Date: Thu, 23 Feb 2006 21:09:14 +0200 Subject: [openib-general] [PATCH 6/6] [RFC] iser socket In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1297358@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1297358@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <43FE085A.6000005@voltaire.com> >> >>Open iscsi expects the transport to support the following chain: >> >>+1 create socket >>+2 connect the socket to the target >>+3 bind the socket to an iscsi connection enabling actual >>communication over this iscsi connection >> >>so, we implemented the minimal set of operations (ie socket/connect) >>which will enable open iscsi to work transparently with the >>iser transport. >> >>Or. >> > >As indicated in my other reply, this is not consistent >with how iWARP connections are being done. The CMA >infrastructure, with IP semantics for IB addressing, >should be built upon here. There is no reason why >iSER connections should follow different logic than >RDMA connections. > This one is from your other reply: >There really isn't a lot of user-space logic being preserved >here unless you keep the negotiations in user-space as well. > The point is that open-iscsi *keeps* all negotiations in user-space. As far as i understand, it's not going to change because the separation of control and data paths is one of the basic design principles of the project. And iSER/IB connections do differ from other types. Again, citing your other reply: >Using iscsi_tcp the "startup phase" messages would simply >be translated into writes/reads over a TCP socket. For >iSER/IB they would be translated to Send Messages. For >iSER/IP they could be translated into TCP segments. > The differences that you mentioned correspond to 3 ways that various transports (iscsi/ip, iser/ib, iser/ip) would handle the reads/writes. This presents no problem, because there would be 3 different transport modules, each one acting appropriately in response to the "messages". The only problem that remains is connection establishment. Because the differences there are just as pronounced, and the connections are initiated from user space using a socket, we don't have another solution which looks cleaner. From michaelc at cs.wisc.edu Thu Feb 23 11:12:24 2006 From: michaelc at cs.wisc.edu (Mike Christie) Date: Thu, 23 Feb 2006 13:12:24 -0600 Subject: [openib-general] [PATCH 3/6] [RFC] iser initiator In-Reply-To: <43FD8353.3020909@voltaire.com> References: <20060222162507.GB24303@lst.de> <43FD8353.3020909@voltaire.com> Message-ID: <43FE0918.9010304@cs.wisc.edu> Or Gerlitz wrote: > Christoph Hellwig wrote: > > >>>+static int iser_dma_map_task_data(struct iscsi_iser_cmd_task *p_iser_task, >>>+ struct iser_data_buf *p_data, > > ... > >>>+ if (p_data->type == ISER_BUF_TYPE_SINGLE) { >>>+ p_iser_task->data_len[iser_dir] = p_data->size; >>>+ dma_addr = dma_map_single(dma_device,p_data->p_buf, p_data->size, > > >>I'd say kill the non-SG case. We're in the progress of removing non-SG >>commands in the scsi midlayer, and I'm pretty sure they won't exist >>anymore before the iser code merged. > > > Indeed i see that driver/scsi/iscsi_tcp.c of 2.6.16-rc1 (and on) does > not support non-SG SCSI commands. We should still support them in 2.6.16. We have not cleaned out that code yet. We are doing the header fixup at the same time. That patch to use change when we use sendmsg/sendpage should not have broken the non-sg case. Did it? Can you confirm that as of 2.6.16 a > SCSI LLD does not need to support the non-SG case? For new software iscsi I do not think there is no need to support non-sg commands. There is only one depreciated ioctl that we could ever get a non-sg command (I do not think someone is using iscsi and osst) and that can be converted pretty easily. I had sent a patch long ago but it did not get merged and I did not much care about it since someone should not be using it and newer iscsi code. > > OK - the code for upstream will assume only SG commands. > > Now, regardless of when iSER is to be merged, it is used with 2.6.15 and > older kernels that do issue non-SG commands. I wonder what would be the > simplest patch to support it, does it make sense to use virt_to_page on > sc->request_buffer to compose one entry SG on the fly and use it down > the code? > For upstream code we are going to want something that is clean. By the time iscsi_iser goes in testing for "if (!sc->use_sg)" will be one of those things reviewers ask you to remove becuase it cannot happen in the upstream kernel. Plus there seems like there is some duplication between the iscsi_tcp and iscsi_iser modules for this type of operation. That is just going by when I last looked at your code. I did not see anything posted on open-iscsi or linux-scsi yet so I do not know what this patchset looks like. But it has been stated before that we might have to break up some of the core iscsi code to be shared with iscsi_iser and iscsi_tcp. Managing mempools of iscsi tasks and translating scsi_cmnd scatterlists to iscsi specifics seems like a candidate. I did not see your current code though so I am not 100% sure. From swise at opengridcomputing.com Thu Feb 23 11:14:19 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 23 Feb 2006 13:14:19 -0600 Subject: [openib-general] fmr question Message-ID: <1140722059.17964.31.camel@stevo-desktop> I've perused the previous discussions on FMRs and I'm still not clear on the following: Lets say a FMR is allocated with a map count of 4. And the FMR is mapped 4 times, then unmapped. Are the underlying MRs still valid and setup for RDMA from the HW perspective? Or are they marked "INVALID" as part of unmapping? IE: Do they remain valid up to unmap or up to dealloc of the FMR >From the mthca code, all I see is that MTHCA_MPT_STATUS_HW is set while anything is mapped, and MTHCA_MPT_STATUS_SW is set when the unmap is done. I don't know mthca enough to know this is somehow invalidating the MRs in question. Thanks, Steve. From michaelc at cs.wisc.edu Thu Feb 23 11:17:40 2006 From: michaelc at cs.wisc.edu (Mike Christie) Date: Thu, 23 Feb 2006 13:17:40 -0600 Subject: [openib-general] [PATCH 3/6] [RFC] iser initiator In-Reply-To: <43FE0918.9010304@cs.wisc.edu> References: <20060222162507.GB24303@lst.de> <43FD8353.3020909@voltaire.com> <43FE0918.9010304@cs.wisc.edu> Message-ID: <43FE0A54.8050308@cs.wisc.edu> Mike Christie wrote: > Or Gerlitz wrote: > >>Christoph Hellwig wrote: >> >> >> >>>>+static int iser_dma_map_task_data(struct iscsi_iser_cmd_task *p_iser_task, >>>>+ struct iser_data_buf *p_data, >> >>... >> >> >>>>+ if (p_data->type == ISER_BUF_TYPE_SINGLE) { >>>>+ p_iser_task->data_len[iser_dir] = p_data->size; >>>>+ dma_addr = dma_map_single(dma_device,p_data->p_buf, p_data->size, >> >> >>>I'd say kill the non-SG case. We're in the progress of removing non-SG >>>commands in the scsi midlayer, and I'm pretty sure they won't exist >>>anymore before the iser code merged. >> >> >>Indeed i see that driver/scsi/iscsi_tcp.c of 2.6.16-rc1 (and on) does >>not support non-SG SCSI commands. > > > We should still support them in 2.6.16. We have not cleaned out that > code yet. We are doing the header fixup at the same time. That patch to > use change when we use sendmsg/sendpage should not have broken the > non-sg case. Did it? > > Can you confirm that as of 2.6.16 a > >>SCSI LLD does not need to support the non-SG case? > > > For new software iscsi I do not think there is no need to support non-sg replace "no" with "a" :) From bos at pathscale.com Thu Feb 23 11:31:37 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 23 Feb 2006 11:31:37 -0800 Subject: FW: [openib-general] Towards a 1.0 release of OpenIB In-Reply-To: <43FD8FF8.F35E.006C.0@novell.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30100B2A8@mtlexch01.mtl.com> <43FD8FF8.F35E.006C.0@novell.com> Message-ID: <1140723097.17258.67.camel@serpentine.pathscale.com> On Thu, 2006-02-23 at 10:35 -0700, Moiz Kohari wrote: > We will miss the SLES10 GA since it is May 06. OK. > I think it is critical to get a beta snapshot of 1.0 release so we can make sure that we get this ready for SLES10 SP1 which is scheduled for later this year. We should have an rc1 release within a few weeks. SUBJECT LINE: IDF special rates for OpenIB members Dear OpenIB Members, *Intel Developer Forum (IDF) * is almost here. As a member of OpenIB you'll pay only *$700* for a full conference pass at the Intel(r) Developer Forum (IDF). Intel is also offering a one day pass for *$295* to IDF for all OpenIB members. ** To register for IDF, please visit * https://www21.cplan.com/pls/pg_intel/c125_reg_entry.idfa_spring_sys* * * For a full conference pass, type in OpenIB Priority Code: *TWNOIB* For a one day pass to IDF type in OpenIB Priority Code: *ODROIB*** * * *Open Fabrics Symposium on March 6* The OpenIB Alliance is teaming up with IBTA to offer an Open Fabrics Symposium on Monday, March 6 at Moscone, the day before IDF starts. Register by Feb 28 for only $175 at http://www.acteva.com/booking.cfm?bevaid=104352. Agenda available online. *Who You Know Counts. Grow Your Personal Community. * At IDF, *finding peers* to exchange ideas with *is easy* – *join the* *new* *online Personal Community *. Use it to search for and contact fellow IDF attendees who have joined the online Personal Community. Participating attendees are able to read your information and contact you. Meet peers from across the industry. Brainstorm. Collaborate. *Change technology together*. How will the technologists you meet improve the way you work? Find out.* **Log on now * to *establish your personal profile*. *Access Your Future at the IDF Technology Showcase. * * * There's a reason the *IDF Technology Showcase * is open daily. With more than *170 top companies*, including Intel demonstrating their *latest developments*, hundreds of engineers to meet and discuss technology with, and the *strongest gathering of Technology Communities* around, all in one place, you need three days to do it all. So come rested and ready for a *technology onslaught* filled with *new ideas*, and *new engineers* to collaborate with that *enhance* *your* current and future *projects*. *More Ways to Collaborate with Intel Experts. * Attend the *Think Tank Networking Exchange *. It's the place for you to begin a technology discussion with *Intel technologists*. Exchange ideas, *discuss technical issues*, and *start a collaboration* that could continue long after IDF is over. The Think Tank Networking Exchange happens during lunch on Thursday. So grab a bite, come grab a seat, and *let the problem solving begin*. For information on the 170 hours of technical training you can participate in at IDF, visit the *Content Catalog*. We'll see you at IDF, Your IDF Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Thu Feb 23 12:14:23 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 23 Feb 2006 12:14:23 -0800 Subject: [openib-general] Re: mthca fix: update the init attributes in create_srq In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3016C4EB8@mtlexch01.mtl.com> (Dotan Barak's message of "Mon, 20 Feb 2006 11:52:07 +0200") References: <6AB138A2AB8C8E4A98B9C0C3D52670E3016C4EB8@mtlexch01.mtl.com> Message-ID: Thanks, applied. From rdreier at cisco.com Thu Feb 23 12:34:34 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 23 Feb 2006 12:34:34 -0800 Subject: [openib-general] Re: 1/2 libibverbs: update init attribute in create_srq In-Reply-To: <43FC8852.8020304@mellanox.co.il> (Dotan Barak's message of "Wed, 22 Feb 2006 17:50:42 +0200") References: <43FC8852.8020304@mellanox.co.il> Message-ID: Thanks, applied. From rdreier at cisco.com Thu Feb 23 12:36:31 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 23 Feb 2006 12:36:31 -0800 Subject: [openib-general] Re: 2/2 core: update init attribute in create_srq In-Reply-To: <43FC88AF.4080503@mellanox.co.il> (Dotan Barak's message of "Wed, 22 Feb 2006 17:52:15 +0200") References: <43FC88AF.4080503@mellanox.co.il> Message-ID: Thanks, applied. From rdreier at cisco.com Thu Feb 23 12:39:54 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 23 Feb 2006 12:39:54 -0800 Subject: [openib-general] fmr question In-Reply-To: <1140722059.17964.31.camel@stevo-desktop> (Steve Wise's message of "Thu, 23 Feb 2006 13:14:19 -0600") References: <1140722059.17964.31.camel@stevo-desktop> Message-ID: Steve> Lets say a FMR is allocated with a map count of 4. And the Steve> FMR is mapped 4 times, then unmapped. Are the underlying Steve> MRs still valid and setup for RDMA from the HW perspective? Steve> Or are they marked "INVALID" as part of unmapping? IE: Do Steve> they remain valid up to unmap or up to dealloc of the FMR They are not valid after the unmap operation. In fact, old mappings are probably not valid after the FMR is remapped; the issue for Mellanox hardware is that the old mapping might still be hanging around in an internal cache somewhere. However the unmap_fmr operation flushes all these caches, so all old mappings are gone after an FMR is unmapped. - R. From rdreier at cisco.com Thu Feb 23 12:43:09 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 23 Feb 2006 12:43:09 -0800 Subject: [openib-general] Re: ipoib_multicast_ah.patch In-Reply-To: <20060222104037.GB21077@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 22 Feb 2006 12:40:38 +0200") References: <20060222104037.GB21077@mellanox.co.il> Message-ID: This looks OK, I guess, but... Mst> ipoib_mcast_send tests mcast->ah twice. If this value Mst> is changed between these two points, we leak an skb. who else might change the value here? Answering this would help justify the patch and save me from having to write up a changelog entry. - R. From swise at opengridcomputing.com Thu Feb 23 12:46:53 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 23 Feb 2006 14:46:53 -0600 Subject: [openib-general] fmr question In-Reply-To: References: <1140722059.17964.31.camel@stevo-desktop> Message-ID: <1140727613.17964.46.camel@stevo-desktop> So this implies that there is really only one mapping outstanding at any point in time (less the cache issue). Right? So why is there a map count as an fmr attribute? Its seems like just an arbitrary limit put on how many times you can map an fmr before unmaping. -and- once you unmap you can start mapping again up to the map count... ?? On Thu, 2006-02-23 at 12:39 -0800, Roland Dreier wrote: > Steve> Lets say a FMR is allocated with a map count of 4. And the > Steve> FMR is mapped 4 times, then unmapped. Are the underlying > Steve> MRs still valid and setup for RDMA from the HW perspective? > Steve> Or are they marked "INVALID" as part of unmapping? IE: Do > Steve> they remain valid up to unmap or up to dealloc of the FMR > > They are not valid after the unmap operation. In fact, old mappings > are probably not valid after the FMR is remapped; the issue for > Mellanox hardware is that the old mapping might still be hanging > around in an internal cache somewhere. However the unmap_fmr > operation flushes all these caches, so all old mappings are gone after > an FMR is unmapped. > > - R. From sean.hefty at intel.com Thu Feb 23 14:44:47 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 23 Feb 2006 14:44:47 -0800 Subject: [openib-general] user_mad questions Message-ID: I ran into a couple of questions while updating the user_mad code that I'd like to clarify. First, the only failed sends that are reported are response timeouts. Is there a reason that other failures are not reported? (I can understand treating the MADs as unreliable, but just wanted to check that this was the intent.) Second, when a failed send is reported, there's a mismatch between the size of the data copied to the user and the size returned from ib_umad_read(). The copied data is sizeof(ib_user_mad) + sizeof(ib_mad). The reported size turns out to be sizeof(ib_user_mad) + sizeof(ib_mad_hdr). It appears that the data copy is off. Finally, when reporting that the buffer for a received MAD is too small, what information actually needs to make it back to the user? The code currently copies the entire first segment. What would happen if only the common, RMPP, and class specific headers were returned to the user? - Sean From rdreier at cisco.com Thu Feb 23 15:04:00 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 23 Feb 2006 15:04:00 -0800 Subject: [openib-general] user_mad questions In-Reply-To: (Sean Hefty's message of "Thu, 23 Feb 2006 14:44:47 -0800") References: Message-ID: Sean> First, the only failed sends that are reported are response Sean> timeouts. Is there a reason that other failures are not Sean> reported? (I can understand treating the MADs as Sean> unreliable, but just wanted to check that this was the Sean> intent.) No reason really except that all other errors "should never happen" and therefore there's not much that a consumer can do with the error. Sean> Second, when a failed send is reported, there's a mismatch Sean> between the size of the data copied to the user and the size Sean> returned from ib_umad_read(). The copied data is Sean> sizeof(ib_user_mad) + sizeof(ib_mad). The reported size Sean> turns out to be sizeof(ib_user_mad) + sizeof(ib_mad_hdr). Sean> It appears that the data copy is off. Yes, that seems to be a long-standing bug. Not sure when that was introduced. - R. From rdreier at cisco.com Thu Feb 23 15:06:52 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 23 Feb 2006 15:06:52 -0800 Subject: [openib-general] fmr question In-Reply-To: <1140727613.17964.46.camel@stevo-desktop> (Steve Wise's message of "Thu, 23 Feb 2006 14:46:53 -0600") References: <1140722059.17964.31.camel@stevo-desktop> <1140727613.17964.46.camel@stevo-desktop> Message-ID: Steve> So this implies that there is really only one mapping Steve> outstanding at any point in time (less the cache issue). Steve> Right? So why is there a map count as an fmr attribute? Steve> Its seems like just an arbitrary limit put on how many Steve> times you can map an fmr before unmaping. -and- once you Steve> unmap you can start mapping again up to the map count... ?? Part of the L_Key and R_Key are twiddled each time the FMR is remapped. Because of the possibility of stale data hanging around the cache, we don't want to reuse an old key without flushing any translation caches. So when we've used all the possibilities for the changeable part of the key, we need to make sure the FMR is unmapped (and the cache flushed) before remapping it. The whole FMR scheme was driven by Mellanox hardware, so if there's a way we can change things to make it more generic, I'm all for it. - R. From mshefty at ichips.intel.com Thu Feb 23 15:07:38 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 23 Feb 2006 15:07:38 -0800 Subject: [openib-general] user_mad questions In-Reply-To: References: Message-ID: <43FE403A.1020209@ichips.intel.com> Roland Dreier wrote: > Sean> Second, when a failed send is reported, there's a mismatch > Sean> between the size of the data copied to the user and the size > Sean> returned from ib_umad_read(). The copied data is > Sean> sizeof(ib_user_mad) + sizeof(ib_mad). The reported size > Sean> turns out to be sizeof(ib_user_mad) + sizeof(ib_mad_hdr). > Sean> It appears that the data copy is off. > > Yes, that seems to be a long-standing bug. Not sure when that was introduced. I have a patch to fix this as part of some general RMPP cleanup/optimizations. - Sean From caitlinb at broadcom.com Thu Feb 23 15:13:10 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 23 Feb 2006 15:13:10 -0800 Subject: [openib-general] fmr question Message-ID: <54AD0F12E08D1541B826BE97C98F99F12973BD@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > Steve> So this implies that there is really only one mapping > Steve> outstanding at any point in time (less the cache issue). > Steve> Right? So why is there a map count as an fmr attribute? > Steve> Its seems like just an arbitrary limit put on how many > Steve> times you can map an fmr before unmaping. -and- once you > Steve> unmap you can start mapping again up to the map count... ?? > > Part of the L_Key and R_Key are twiddled each time the FMR is > remapped. Because of the possibility of stale data hanging > around the cache, we don't want to reuse an old key without > flushing any translation caches. So when we've used all the > possibilities for the changeable part of the key, we need to > make sure the FMR is unmapped (and the cache flushed) before > remapping it. > > The whole FMR scheme was driven by Mellanox hardware, so if > there's a way we can change things to make it more generic, > I'm all for it. > Both the RDMAC and InfiniBand 1.2 verbs define FMR binding and invalidation through work requests on privileged QPs. Invalidation of a specific bind is reported as a Work Completion. By definition, any translation caches MUST be cleared by the time that completion is delivered to the Consumer. Adopting those semantics, and making them available via a Work Request for devices supporting that capability, would eliminate the problem. In my opinion undefined flushing of old translation caches is a security vulnerability that makes the current FMR definition unusable in any network that was not 100% physically secured. And even then it is a weakness asking for a bug to hit it. From grandmother at aol.com Thu Feb 23 23:52:54 2006 From: grandmother at aol.com (Kasai) Date: Fri, 24 Feb 2006 05:52:54 -0200 Subject: [openib-general] =?iso-8859-1?q?=96=BE=93=FA=82=CD=82=A2=82=A9?= =?iso-8859-1?q?=82=AA=82=C5=82=B7=82=A9=81H?= Message-ID: �@���������������������������������������������������������� �@�@���@�@�f�l�P�O�O���錾�I�@�ߋ��ŋ��̊��S�����T�C�g�@�@�@ �@�@�@���c �@�@�@�@���Ȃ��̗��������܂��� �@���������������������������������������������������������� �@�@�@�@�@�@�������N���b�N�o�ň��S����o�^�T�C�g�I�� �@�@�����S�E���S�E�Q�S���ԃT�|�[�g�̐��I�� ���c�d�d�c���c�d�d�c���c�d�d�c���c�d�c���c�d�d�c���c�d�d�c���c�d�c�� ���ގ����~�����������F���~�������G�b�`�F���~�������閧�̗����������� ����Ȕޏ��������W�����S�����T�C�g���Љ�I�����ȑf�l�����΂���Ȃ̂� �^���ɔޏ���T���Ă�����A����؂������t��������T���Ă�����͑��Q�b�g�I ���S���������� http://continent.goldi33.com/_6/ ���c�d�d�c���c�d�d�c���c�d�d�c���c�d�c���c�d�d�c���c�d�d�c���c�d�c�� �E�c�������@��������̓��e�@�������c�E ���O�F�T�q �N��Q�T�� �E�ƁF�����E ���Z�F���� ��l��炵��n�߂đ��P�N�B�Ȃ��Ȃ��o���Ȃ��A�ގ����o���܂���B ����A�����Ȃ�����Ǝv���A�o��n�T�C�g�ɂ�����v���œo�^���܂����B ���͖������[��������̂�y���݂ɂ��Ă��܂��B�����f�G�Ȕގ��Ɖ��΂����Ȃ� ���S���������� http://dental.goldi33.com/_6/ ���c�d�d�c���c�d�d�c���c�d�d�c���c�d�c���c�d�d�c���c�d�d�c���c�d�c�� From gti2 at allabout.co.jp Fri Feb 24 02:13:52 2006 From: gti2 at allabout.co.jp (=?gb2312?B?objWuLaopMqkt6G5?=) Date: Fri, 24 Feb 2006 02:13:52 -0800 (PST) Subject: [openib-general] =?iso-2022-jp?b?PBskQiVhITwla014TVE8VCRYGyhC?= =?iso-2022-jp?b?GyRCJE4kKkNOJGkkOxsoQj4=?= Message-ID: <20060224101352.C3E692283DF@openib.ca.sandia.gov> 当社公認ソーシャルコミュニティサイトです。http://www.gyakuten6.net/serebu/?1928 公認サイトですので安心してご利用下さい。今メール利用者の中で話題になっているサイトです。 メール不要な方はこちら↓ concept3_net at yahoo.ca -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Fri Feb 24 04:01:24 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Feb 2006 07:01:24 -0500 Subject: ***SPAM*** [openib-general] OpenSM mads timeout In-Reply-To: References: Message-ID: <1140782437.28051.40264.camel@hal.voltaire.com> On Thu, 2006-02-23 at 04:54, yipee wrote: > Hi, > > What happens when OpenSM stops receiving replies to MAD requests it sends to > hosts in the subnet? It depends on where the error occurs in OpenSM as it has different "phases". > Does it notify the infiniband switch to do something to its corresponding ports? Not sure what you mean by this. Are you referring to non responsive ports ? If so, are they links to a CA or another switch ? Can you send the OpenSM logs (and were they run with -V) ? -- Hal > What function/s in the OpenSM source does this? > > > Thanks, > y > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From takshak at gs-lab.com Fri Feb 24 05:09:51 2006 From: takshak at gs-lab.com (Takshak C.) Date: Fri, 24 Feb 2006 18:39:51 +0530 Subject: [openib-general] Get Table Records for SA Attribute ID ? In-Reply-To: <1140615978.28051.15030.camel@hal.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F589AC20@taurus.voltaire.com> <43F4788E.3070909@gs-lab.com> <1140120738.4333.33149.camel@hal.voltaire.com> <43FC515A.3020404@gs-lab.com> <1140612035.28051.14448.camel@hal.voltaire.com> <43FC6288.4040402@gs-lab.com> <1140615978.28051.15030.camel@hal.voltaire.com> Message-ID: <43FF059F.4000901@gs-lab.com> Hi Hal: Thanks for the information. I would like to confirm that, umad_send() and umad_recv calls goes out of osm libraries. I have written an application to get PATH_RECORDS using opensm libraries similar to osmtest. If I don't start openSM instance, then I don't get results rather I get an error message as below: [ umad_receiver: ERR 5409: send completed with error (method=0x12 attr=0x35 trans_id=0x1) -- dropping ] I have tried to go inside. Just after umad_send() call in osm_vendor_ibumad.c ( osm_vendor_send() function ), I called umad_recv() function call and tried to check the receive length. And I am getting 24. Why this could have been happened ? return length = 24 means call has not received path records. But if I start openSM instance then I get proper result length. I would like to remove dependency of starting this openSM instance and my application program should run independently. Could you please throw me some light on this. When I have called umad_recv() it should not depend on openSM instance I believe to received the things. Thanks & Regards. - Takshak Hal Rosenstock wrote: >On Wed, 2006-02-22 at 08:09, Takshak C. wrote: > >>Thanks a lot Hal, for clearing my doubts. >>I would like to redefine my problem based on your inputs. >> >>I am into a scenario, where vendor specific primary SM is running in >>the subnet. >>This running SM is different than openSM. I have loaded an openIB >>stack on the host. >> > >OK. I understand your configuration. > > >>Some of the sample examples from management/diags/src/ directory like >>smpquery >>for nodeinfo etc works and gives result to me. >> >>Now, could it be possible for me to write a SA query and fetch the >>path, service >>or info records >> > >Info records ? > > >> without starting openSM instance as I have already primary SM >>running in the subnet. ? >> > >Yes; all you are (conceptually) talking about is a user SA client. > > >> I believe, this question could be right and your answer would >>help me. >> >>I do not want to start openSM because then synchronization between >>primary SM >>and openSM would bring other issues or difficulties. >> > >Understood. It was unclear whether you had an SM in your subnet. > >You should be able to link libopensm and the other management libraries >to an SA application which would do this (and not require OpenSM >itself). > > >>Could you please tell me, how should I go about it ? Waiting. >> > >I think I've already answered this. > >-- Hal > > >>Regards. >>- Takshak >> >> >> >>Hal Rosenstock wrote: >> >>>On Wed, 2006-02-22 at 06:56, Takshak C. wrote: >>> >>> >>>>Hal Rosenstock wrote: >>>> >>>> >>>>>>Please throw some light on this. Do you have any userspace SA support for retrieving path, service record >>>>>>information ? >>>>>> >>>>>> >>>>>> >>>>>There have been discussions about userspace SA support but nothing >>>>>currently for OpenIB (gen2). Currently, you can get this by using >>>>> >>>>> >>>>> >>>>Could you please tell me, when userspace SA support will be available >>>>in openIB gen2. >>>> >>>> >>>I don't know but I'm not sure how much this helps you based on your >>>questions below. >>> >>> >>> >>>>>osm_vendor_ibumad_sa.c which supports most SA requests. It is built as >>>>>part of libosmvendor (part of the OpenSM build) but can be used outside >>>>>of OpenSM. It is used by osmtest if you want to look at some use cases. >>>>>It obtains PathRecords and ServiceRecords. That might be an easier >>>>>direction to go than trying to use the management libraries to build the >>>>>pieces of a userspace SA client you want. >>>>> >>>>>-- Hal >>>>> >>>>> >>>>> >>>>See, to execute osmtest, I found that openSM instance must be there. >>>> >>>> >>>Must be where ? What is your IB configuration ? >>> >>> >>> >>>> So, even if I use part >>>>of libosmvendor library ( osm_vendor_ibumad_sa.c) functions, I have to >>>>start openSM >>>>instance to execute the SA query successfully. >>>> >>>> >>>An SM is needed in the subnet and SA is part of that and answers such >>>queries. >>> >>> >>> >>>>Without starting openSM client, I m able to retrieve node description, >>>>node info, SM info, >>>>port info by using management libraries libibumad and libibmad. >>>> >>>> >>>of the local node only (until the SM brings up the subnet). >>> >>> >>> >>>>What I want to achieve is, without talking with openSM instance, my SA >>>>query client >>>>should go and get the required information. >>>> >>>> >>>Why ? >>> >>> >>> >>>>Is this possible ?. >>>> >>>> >>>No. What would you query for paths to if the subnet were not up ? >>> >>>-- Hal >>> >>> >>> >>>>Would like to know your inputs on this. >>>> >>>>Regards, >>>>- Takshak >>>> >>>> >>>>>>Regards. >>>>>>- Takshak >>>>>> >>>>>> >>>>>>Hal Rosenstock wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>>Hi, >>>>>>> >>>>>>>There are a couple of issues with the below. >>>>>>> >>>>>>>1. SA MAD structure is missing the RMPP header. Once I saw that I didn't check for further issues with the format. >>>>>>> >>>>>>>2. I will assume your register call sets RMPP. >>>>>>> >>>>>>>3. SA class version is 2. >>>>>>> >>>>>>>What SM are you using ? If you are using OpenSM, you can turn on verbose and see if the packet is seen by the SM. You could also enable madeye (in utils) to see if the packet is sent (and if anything is received back). >>>>>>> >>>>>>>-- Hal >>>>>>> >>>>>>>________________________________ >>>>>>> >>>>>>>From: openib-general-bounces at openib.org on behalf of Takshak C. >>>>>>>Sent: Mon 2/6/2006 8:00 AM >>>>>>>To: openib-general at openib.org >>>>>>>Subject: [openib-general] Get Table Records for SA Attribute ID ? >>>>>>> >>>>>>> >>>>>>> >>>>>>>Hi, >>>>>>> >>>>>>>I m trying to get the table records for SA attribute ID in following way. >>>>>>>But, I m not getting a single record, could anyone comment on the problem. >>>>>>> >>>>>>>1. I have created saMadFormat structure described in the specification as below: >>>>>>> >>>>>>>struct saMadFormat >>>>>>>{ >>>>>>> >>>>>>> uint8_t base_version ; >>>>>>> uint8_t mgmt_class ; >>>>>>> uint8_t class_version ; >>>>>>> uint8_t sa_method ; >>>>>>> uint16_t status ; >>>>>>> uint16_t not_used ; >>>>>>> uint64_t tid ; >>>>>>> uint16_t attr_id ; >>>>>>> uint16_t resv ; >>>>>>> uint32_t attr_mod ; >>>>>>> uint64_t sa_key; >>>>>>> uint64_t sm_key ; >>>>>>> uint32_t seg_num ; >>>>>>> uint32_t payload_len ; >>>>>>> uint8_t frag_flag ; >>>>>>> uint8_t edit_mod ; >>>>>>> uint16_t window ; >>>>>>> uint32_t endRID ; >>>>>>> uint64_t comp_mask ; >>>>>>> uint8_t adminData[192] ; >>>>>>>}; >>>>>>> >>>>>>>2. Then I have done all the basic operations like umad_open, umad_register for the IB_SA_CLASS >>>>>>> and umad_open_port etc successfully. >>>>>>> >>>>>>>3. struct saMadFormat *saQuery = (struct saMadFormat*)(umad_get_mad(umad)); >>>>>>> memset(saQuery, 0, sizeof(*saQuery)); >>>>>>> >>>>>>> saQuery->base_version = 1; >>>>>>> saQuery->mgmt_class = IB_SA_CLASS ; >>>>>>> saQuery->class_version = 1 ; >>>>>>> saQuery->sa_method = IB_MAD_METHOD_GET_TABLE ; >>>>>>> saQuery->attr_id = IB_SA_ATTR_PATHRECORD ; >>>>>>> saQuery->attr_mod = 0 ; >>>>>>> saQuery->tid = htonll(drmad_tid++); >>>>>>> saQuery->endRID = 0 ; >>>>>>> >>>>>>> umad_set_addr(umad, lid, 1, 0, IB_DEFAULT_QP1_QKEY); >>>>>>> umad_set_grh(umad, 0); >>>>>>> umad_set_pkey(umad, 0xFFFF); >>>>>>> >>>>>>>4. length = IB_MAD_SIZE; >>>>>>> >>>>>>> if (umad_send(portid, mad_agent, umad, length, timeout_ms, 0) < 0) >>>>>>> IBPANIC("send failed"); >>>>>>> >>>>>>> if (umad_recv(portid, umad, &length, -1) != mad_agent) >>>>>>> IBPANIC("recv error: %s", drmad_status_str(saQuery)); >>>>>>> >>>>>>> >>>>>>> >>>>>>> if (!dump_char) { >>>>>>> xdump(stdout, 0, saQuery->adminData, 192); >>>>>>> return 0; >>>>>>> } >>>>>>> >>>>>>>I m expecting that, I will get the resultant data in saQuery->adminData. >>>>>>>Is this correct ? If not then, how should I retrieve the table records ? >>>>>>>Any Idea ? >>>>>>> >>>>>>> >>>>>>>Thanks >>>>>>>- Takshak >>>>>>> >>>>>>>_______________________________________________ >>>>>>>openib-general mailing list >>>>>>>openib-general at openib.org >>>>>>>http://openib.org/mailman/listinfo/openib-general >>>>>>> >>>>>>>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> >>> >>> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Fri Feb 24 06:00:59 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Feb 2006 09:00:59 -0500 Subject: [openib-general] Get Table Records for SA Attribute ID ? In-Reply-To: <43FF059F.4000901@gs-lab.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F589AC20@taurus.voltaire.com> <43F4788E.3070909@gs-lab.com> <1140120738.4333.33149.camel@hal.voltaire.com> <43FC515A.3020404@gs-lab.com> <1140612035.28051.14448.camel@hal.voltaire.com> <43FC6288.4040402@gs-lab.com> <1140615978.28051.15030.camel@hal.voltaire.com> <43FF059F.4000901@gs-lab.com> Message-ID: <1140789577.28051.41187.camel@hal.voltaire.com> Hi Takshak, On Fri, 2006-02-24 at 08:09, Takshak C. wrote: > Hi Hal: > > Thanks for the information. > > I would like to confirm that, umad_send() and umad_recv calls goes out of osm libraries. Not sure what you mean by this. The umad library of which those calls are a part of is underneath any OSM libraries that might be used. > I have written an application to get PATH_RECORDS using opensm libraries similar to osmtest. > If I don't start openSM instance, then I don't get results rather I get an error message as > below: > [ umad_receiver: ERR 5409: send completed with error (method=0x12 attr=0x35 trans_id=0x1) -- dropping ] That's saying that a matching receive was not seen (the response to the GetTable PathRecord request). The transaction ID looks funny to me for matching. Is this being set correctly ? (Not having the code it is hard for me to tell how things are initialized and what mode this is working in). > I have tried to go inside. Just after umad_send() call in osm_vendor_ibumad.c ( osm_vendor_send() function ), > I called umad_recv() function call and tried to check the receive length. And I am getting 24. > > Why this could have been happened ? return length = 24 means call has not received path records. Because if it is getting a response, due to the transaction ID, I do not think it is considered a match. I can't be sure with the info provided. > But if I start openSM instance then I get proper result length. Sounds like your application may have an initialization issue that this fixes. > I would like to remove dependency of starting this openSM instance and my application program > should run independently. This should be possible. > Could you please throw me some light on this. When I have called umad_recv() it should not depend > on openSM instance I believe to received the things. Correct. -- Hal > Thanks & Regards. > - Takshak > > > Hal Rosenstock wrote: > > On Wed, 2006-02-22 at 08:09, Takshak C. wrote: > > > Thanks a lot Hal, for clearing my doubts. > > > I would like to redefine my problem based on your inputs. > > > > > > I am into a scenario, where vendor specific primary SM is running in > > > the subnet. > > > This running SM is different than openSM. I have loaded an openIB > > > stack on the host. > > > > OK. I understand your configuration. > > > > > Some of the sample examples from management/diags/src/ directory like > > > smpquery > > > for nodeinfo etc works and gives result to me. > > > > > > Now, could it be possible for me to write a SA query and fetch the > > > path, service > > > or info records > > > > Info records ? > > > > > without starting openSM instance as I have already primary SM > > > running in the subnet. ? > > > > Yes; all you are (conceptually) talking about is a user SA client. > > > > > I believe, this question could be right and your answer would > > > help me. > > > > > > I do not want to start openSM because then synchronization between > > > primary SM > > > and openSM would bring other issues or difficulties. > > > > Understood. It was unclear whether you had an SM in your subnet. > > > > You should be able to link libopensm and the other management libraries > > to an SA application which would do this (and not require OpenSM > > itself). > > > > > Could you please tell me, how should I go about it ? Waiting. > > > > I think I've already answered this. > > > > -- Hal > > > > > Regards. > > > - Takshak > > > > > > > > > > > > Hal Rosenstock wrote: > > > > On Wed, 2006-02-22 at 06:56, Takshak C. wrote: > > > > > > > > > Hal Rosenstock wrote: > > > > > > > > > > > > Please throw some light on this. Do you have any userspace SA support for retrieving path, service record > > > > > > > information ? > > > > > > > > > > > > > > > > > > > > > > > > > > There have been discussions about userspace SA support but nothing > > > > > > currently for OpenIB (gen2). Currently, you can get this by using > > > > > > > > > > > > > > > > > > > > > > Could you please tell me, when userspace SA support will be available > > > > > in openIB gen2. > > > > > > > > > > > > > I don't know but I'm not sure how much this helps you based on your > > > > questions below. > > > > > > > > > > > > > > osm_vendor_ibumad_sa.c which supports most SA requests. It is built as > > > > > > part of libosmvendor (part of the OpenSM build) but can be used outside > > > > > > of OpenSM. It is used by osmtest if you want to look at some use cases. > > > > > > It obtains PathRecords and ServiceRecords. That might be an easier > > > > > > direction to go than trying to use the management libraries to build the > > > > > > pieces of a userspace SA client you want. > > > > > > > > > > > > -- Hal > > > > > > > > > > > > > > > > > > > > > > See, to execute osmtest, I found that openSM instance must be there. > > > > > > > > > > > > > Must be where ? What is your IB configuration ? > > > > > > > > > > > > > So, even if I use part > > > > > of libosmvendor library ( osm_vendor_ibumad_sa.c) functions, I have to > > > > > start openSM > > > > > instance to execute the SA query successfully. > > > > > > > > > > > > > An SM is needed in the subnet and SA is part of that and answers such > > > > queries. > > > > > > > > > > > > > Without starting openSM client, I m able to retrieve node description, > > > > > node info, SM info, > > > > > port info by using management libraries libibumad and libibmad. > > > > > > > > > > > > > of the local node only (until the SM brings up the subnet). > > > > > > > > > > > > > What I want to achieve is, without talking with openSM instance, my SA > > > > > query client > > > > > should go and get the required information. > > > > > > > > > > > > > Why ? > > > > > > > > > > > > > Is this possible ?. > > > > > > > > > > > > > No. What would you query for paths to if the subnet were not up ? > > > > > > > > -- Hal > > > > > > > > > > > > > Would like to know your inputs on this. > > > > > > > > > > Regards, > > > > > - Takshak > > > > > > > > > > > > Regards. > > > > > > > - Takshak > > > > > > > > > > > > > > > > > > > > > Hal Rosenstock wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > There are a couple of issues with the below. > > > > > > > > > > > > > > > > 1. SA MAD structure is missing the RMPP header. Once I saw that I didn't check for further issues with the format. > > > > > > > > > > > > > > > > 2. I will assume your register call sets RMPP. > > > > > > > > > > > > > > > > 3. SA class version is 2. > > > > > > > > > > > > > > > > What SM are you using ? If you are using OpenSM, you can turn on verbose and see if the packet is seen by the SM. You could also enable madeye (in utils) to see if the packet is sent (and if anything is received back). > > > > > > > > > > > > > > > > -- Hal > > > > > > > > > > > > > > > > ________________________________ > > > > > > > > > > > > > > > > From: openib-general-bounces at openib.org on behalf of Takshak C. > > > > > > > > Sent: Mon 2/6/2006 8:00 AM > > > > > > > > To: openib-general at openib.org > > > > > > > > Subject: [openib-general] Get Table Records for SA Attribute ID ? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > I m trying to get the table records for SA attribute ID in following way. > > > > > > > > But, I m not getting a single record, could anyone comment on the problem. > > > > > > > > > > > > > > > > 1. I have created saMadFormat structure described in the specification as below: > > > > > > > > > > > > > > > > struct saMadFormat > > > > > > > > { > > > > > > > > > > > > > > > > uint8_t base_version ; > > > > > > > > uint8_t mgmt_class ; > > > > > > > > uint8_t class_version ; > > > > > > > > uint8_t sa_method ; > > > > > > > > uint16_t status ; > > > > > > > > uint16_t not_used ; > > > > > > > > uint64_t tid ; > > > > > > > > uint16_t attr_id ; > > > > > > > > uint16_t resv ; > > > > > > > > uint32_t attr_mod ; > > > > > > > > uint64_t sa_key; > > > > > > > > uint64_t sm_key ; > > > > > > > > uint32_t seg_num ; > > > > > > > > uint32_t payload_len ; > > > > > > > > uint8_t frag_flag ; > > > > > > > > uint8_t edit_mod ; > > > > > > > > uint16_t window ; > > > > > > > > uint32_t endRID ; > > > > > > > > uint64_t comp_mask ; > > > > > > > > uint8_t adminData[192] ; > > > > > > > > }; > > > > > > > > > > > > > > > > 2. Then I have done all the basic operations like umad_open, umad_register for the IB_SA_CLASS > > > > > > > > and umad_open_port etc successfully. > > > > > > > > > > > > > > > > 3. struct saMadFormat *saQuery = (struct saMadFormat*)(umad_get_mad(umad)); > > > > > > > > memset(saQuery, 0, sizeof(*saQuery)); > > > > > > > > > > > > > > > > saQuery->base_version = 1; > > > > > > > > saQuery->mgmt_class = IB_SA_CLASS ; > > > > > > > > saQuery->class_version = 1 ; > > > > > > > > saQuery->sa_method = IB_MAD_METHOD_GET_TABLE ; > > > > > > > > saQuery->attr_id = IB_SA_ATTR_PATHRECORD ; > > > > > > > > saQuery->attr_mod = 0 ; > > > > > > > > saQuery->tid = htonll(drmad_tid++); > > > > > > > > saQuery->endRID = 0 ; > > > > > > > > > > > > > > > > umad_set_addr(umad, lid, 1, 0, IB_DEFAULT_QP1_QKEY); > > > > > > > > umad_set_grh(umad, 0); > > > > > > > > umad_set_pkey(umad, 0xFFFF); > > > > > > > > > > > > > > > > 4. length = IB_MAD_SIZE; > > > > > > > > > > > > > > > > if (umad_send(portid, mad_agent, umad, length, timeout_ms, 0) < 0) > > > > > > > > IBPANIC("send failed"); > > > > > > > > > > > > > > > > if (umad_recv(portid, umad, &length, -1) != mad_agent) > > > > > > > > IBPANIC("recv error: %s", drmad_status_str(saQuery)); > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > if (!dump_char) { > > > > > > > > xdump(stdout, 0, saQuery->adminData, 192); > > > > > > > > return 0; > > > > > > > > } > > > > > > > > > > > > > > > > I m expecting that, I will get the resultant data in saQuery->adminData. > > > > > > > > Is this correct ? If not then, how should I retrieve the table records ? > > > > > > > > Any Idea ? > > > > > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > - Takshak > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > openib-general mailing list > > > > > > > > openib-general at openib.org > > > > > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > From necktie at hotmail.com Fri Feb 24 07:04:03 2006 From: necktie at hotmail.com (Katase) Date: Fri, 24 Feb 2006 20:04:03 +0500 Subject: [openib-general] =?iso-8859-1?q?=8BX=82=B5=82=AD=82=A8=8A=E8=82?= =?iso-8859-1?q?=A2=82=B5=82=DC=82=B7?= Message-ID: �����f�l�P�O�O���I���S�����S�̊��S�����T�C�g���� �@�`�`�`�^�ʖڂȏo�����g�ȏo��܂Ł`�`�` �@�@�@����������ɑf�G�ȗ��������܂���� �@���S���������ˁ@http://repression.goldi33.com/_6/ ���恈�恈�恈�恈�恈�恈�恈�恈�恈�恈�恈�� �������E�E�E�E�E�E�O�~�@�����[������M�E�E�E�O�~ �����A�h����M�E�E�E�O�~�@���s�d�k�ԑ���M�E�E�O�~ �������茟���E�E�E�E�O�~�@�����ł�E�E�E�E�E�E�O�~ ���ׂĂ����S�����ƂȂ��Ă���܂��̂ŁA���S���Ă��g���������I �@���S���������ˁ@http://gentlemen.goldi33.com/_6/ From mst at mellanox.co.il Fri Feb 24 07:20:18 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 24 Feb 2006 17:20:18 +0200 Subject: [openib-general] Re: OpenIb 1.0 release components In-Reply-To: <6a122cc00602231007y508c6c1flac8e451de11344a0@mail.gmail.com> References: <20060223170325.GB19426@mellanox.co.il> <1140714888.17258.60.camel@serpentine.pathscale.com> <6a122cc00602231007y508c6c1flac8e451de11344a0@mail.gmail.com> Message-ID: <20060224152018.GA16807@mellanox.co.il> Quoting Moni Levy : > > > While this might be a good idea for modules such as iSER > > > which are not currently part of the mainline kernel tree, > > > it is in my opinion clearly not a good idea to replace the > > > modules which *are* distributed with the mainline kernel. > > > > I agree, for the most part. > > > > What I have in mind for non-upstream kernel support is this: > > > > * We have to ship out-of-tree drivers, simply because there's only > > one driver in the upstream kernel, and the others are not yet > > ready for submission. > > * Some kernel components are clearly not contenders for shipping. > > One example is kdapl, because it appears to be dead due to > > upstream veto. > > * Others might be reasonable, if they (a) see some testing and (b) > > don't intrusively patch the core kernel. I'm thinking here > > about iSER and, to a lesser extent, SDP. > > I would like to add another point also. It looks like that in this > round of the major distribution releases they will just not be able to > include the 1.0 release due to time constraints, so the only way to > use 1.0 release (or newer) will be to replace them in the kernel. > > Moni I dont really understand this last point. What do you mean when you say "replace them in kernel"? Replace what? I understand it why you might want to add out of kernel modules such as iSER. My point is they must work with core components included in kernel, not with core out of the svn tree. I gather Brian here agrees. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From tom at opengridcomputing.com Fri Feb 24 07:38:33 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Fri, 24 Feb 2006 09:38:33 -0600 Subject: [openib-general] [PATCH] Small change to addr.c for iWARP support Message-ID: <1140795513.29942.9.camel@trinity.ogc.int> Two changes to addr.c for iWARP: - Recognition of RDMA_NODE_RNIC in copy_addr. - Remove filter for ARPHRD_INFINIBAND in addr_arp_recv Signed-off-by: Tom Tucker Index: addr.c =================================================================== --- addr.c (revision 5460) +++ addr.c (working copy) @@ -73,6 +73,9 @@ case ARPHRD_INFINIBAND: dev_addr->dev_type = RDMA_NODE_IB_CA; break; + case ARPHRD_ETHER: + dev_addr->dev_type = RDMA_NODE_RNIC; + break; default: return -EADDRNOTAVAIL; } @@ -335,8 +338,7 @@ arp_hdr = (struct arphdr *) skb->nh.raw; - if (dev->type == ARPHRD_INFINIBAND && - (arp_hdr->ar_op == __constant_htons(ARPOP_REQUEST) || + if ((arp_hdr->ar_op == __constant_htons(ARPOP_REQUEST) || arp_hdr->ar_op == __constant_htons(ARPOP_REPLY))) set_timeout(jiffies); From halr at voltaire.com Fri Feb 24 07:49:26 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Feb 2006 10:49:26 -0500 Subject: [openib-general] user_mad questions In-Reply-To: <43FE403A.1020209@ichips.intel.com> References: <43FE403A.1020209@ichips.intel.com> Message-ID: <1140796153.28051.41767.camel@hal.voltaire.com> Hi Sean, On Thu, 2006-02-23 at 18:07, Sean Hefty wrote: > Roland Dreier wrote: > > Sean> Second, when a failed send is reported, there's a mismatch > > Sean> between the size of the data copied to the user and the size > > Sean> returned from ib_umad_read(). The copied data is > > Sean> sizeof(ib_user_mad) + sizeof(ib_mad). The reported size > > Sean> turns out to be sizeof(ib_user_mad) + sizeof(ib_mad_hdr). > > Sean> It appears that the data copy is off. > > > > Yes, that seems to be a long-standing bug. Not sure when that was introduced. > > I have a patch to fix this as part of some general RMPP cleanup/optimizations. Should the fix for this be separate from other RMPP changes ? -- Hal From tom at opengridcomputing.com Fri Feb 24 07:58:24 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Fri, 24 Feb 2006 09:58:24 -0600 Subject: [openib-general] [PATCH v2] iw_cm.h file with formatting changes Message-ID: <1140796704.29942.22.camel@trinity.ogc.int> Here is another version of the iw_cm.h file updated with the suggested formatting changes. Thanks for the comments. I have not changed the size of the pdata len yet. WRT casting the iWARP dev_addr to an ib_gid, the purpose for doing this was to simplify the cma_acquire_dev logic as follows.... static int cma_acquire_dev(struct rdma_id_private *id_priv) { enum rdma_node_type dev_type = id_priv->id.route.addr.dev_addr.dev_type; struct cma_device *cma_dev; union ib_gid *gid; int ret = -ENODEV; switch (rdma_node_get_transport(dev_type)) { case RDMA_TRANSPORT_IB: gid = ib_addr_get_sgid(&id_priv->id.route.addr.dev_addr); break; case RDMA_TRANSPORT_IWARP: gid = iw_addr_get_sgid(&id_priv->id.route.addr.dev_addr); break; default: return -ENODEV; } mutex_lock(&lock); list_for_each_entry(cma_dev, &dev_list, list) { ret = ib_find_cached_gid(cma_dev->device, gid, &id_priv->id.port_num, NULL); if (!ret) { cma_attach_to_dev(id_priv, cma_dev); break; } } mutex_unlock(&lock); return ret; } Since iWARP devices advertise their Ethernet addresses as port GIDs, this all works. Although the 'name' may be (ok, it is) misleading, the result simplifies the logic. Signed-off-by: Tom Tucker Index: include/rdma/iw_cm.h =================================================================== --- include/rdma/iw_cm.h (revision 0) +++ include/rdma/iw_cm.h (revision 0) @@ -0,0 +1,153 @@ +/* + * Copyright (c) 2005 Network Appliance, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#if !defined(IW_CM_H) +#define IW_CM_H + +#include +#include + +struct iw_cm_id; +struct iw_cm_event; + +enum iw_cm_event_type { + IW_CM_EVENT_CONNECT_REQUEST = 1, /* connect request received */ + IW_CM_EVENT_CONNECT_REPLY, /* reply from active connect request */ + IW_CM_EVENT_ESTABLISHED, + IW_CM_EVENT_LLP_DISCONNECT, + IW_CM_EVENT_LLP_RESET, + IW_CM_EVENT_LLP_TIMEOUT, + IW_CM_EVENT_CLOSE +}; + +struct iw_cm_event { + enum iw_cm_event_type event; + int status; + u32 provider_id; + struct sockaddr_in local_addr; + struct sockaddr_in remote_addr; + void *private_data; + u8 private_data_len; +}; + +typedef int (*iw_cm_handler)(struct iw_cm_id *cm_id, + struct iw_cm_event *event); + +enum iw_cm_state { + IW_CM_STATE_IDLE, /* unbound, inactive */ + IW_CM_STATE_LISTEN, /* listen waiting for connect */ + IW_CM_STATE_CONN_SENT, /* outbound waiting for peer accept */ + IW_CM_STATE_CONN_RECV, /* inbound waiting for user accept */ + IW_CM_STATE_ESTABLISHED, /* established */ +}; + +typedef void (*iw_event_handler)(struct iw_cm_id *cm_id, + struct iw_cm_event *event); +struct iw_cm_id { + iw_cm_handler cm_handler; /* client callback function */ + void *context; /* context to provide to client cb */ + enum iw_cm_state state; + struct ib_device *device; + struct ib_qp *qp; /* If the qp is null, use qp_num */ + u32 qp_num; + struct sockaddr_in local_addr; + struct sockaddr_in remote_addr; + u64 provider_id; /* device handle for this conn. */ + iw_event_handler event_handler; /* callback for IW CM Provider events */ +}; + +/** + * iw_create_cm_id - Allocate a communication identifier. + * @device: Device associated with the cm_id. All related communication will + * be associated with the specified device. + * @cm_handler: Callback invoked to notify the user of CM events. + * @context: User specified context associated with the communication + * identifier. + * + * Communication identifiers are used to track connection states, + * addr resolution requests, and listen requests. + */ +struct iw_cm_id *iw_create_cm_id(struct ib_device *device, + iw_cm_handler cm_handler, + void *context); + +/* This is provided in the event generated when + * a remote peer accepts our connect request + */ + +struct iw_cm_verbs { + int (*connect)(struct iw_cm_id *cm_id, + const void *private_data, + u8 private_data_len); + + int (*disconnect)(struct iw_cm_id *cm_id, + int abrupt); + + int (*accept)(struct iw_cm_id *cm_id, + const void *private_data, + u8 pdata_data_len); + + int (*reject)(struct iw_cm_id *cm_id, + const void *private_data, + u8 private_data_len); + + int (*getpeername)(struct iw_cm_id *cm_id, + struct sockaddr_in *local_addr, + struct sockaddr_in *remote_addr); + + int (*create_listen)(struct iw_cm_id *cm_id, + int backlog); + + int (*destroy_listen)(struct iw_cm_id *cm_id); + +}; + +struct iw_cm_id *iw_create_cm_id(struct ib_device *device, + iw_cm_handler cm_handler, + void *context); +void iw_destroy_cm_id(struct iw_cm_id *cm_id); +int iw_cm_listen(struct iw_cm_id *cm_id, int backlog); +int iw_cm_getpeername(struct iw_cm_id *cm_id, + struct sockaddr_in *local_add, + struct sockaddr_in *remote_addr); +int iw_cm_reject(struct iw_cm_id *cm_id, + const void *private_data, + u8 private_data_len); +int iw_cm_accept(struct iw_cm_id *cm_id, + const void *private_data, + u8 private_data_len); +int iw_cm_connect(struct iw_cm_id *cm_id, + const void *pdata, u8 pdata_len); +int iw_cm_disconnect(struct iw_cm_id *cm_id); +int iw_cm_bind_qp(struct iw_cm_id *cm_id, struct ib_qp *qp); + +#endif /* IW_CM_H */ From halr at voltaire.com Fri Feb 24 08:22:06 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Feb 2006 11:22:06 -0500 Subject: [openib-general] Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID In-Reply-To: <200602231814.26918.jackm@mellanox.co.il> References: <200602231814.26918.jackm@mellanox.co.il> Message-ID: <1140798125.4336.1.camel@hal.voltaire.com> On Thu, 2006-02-23 at 11:14, Jack Morgenstein wrote: > On Thursday 23 February 2006 09:41, Sean Hefty wrote: > > > > What specific error do you see in the receive path? > > SA Host, Host1, Host2. By SA host, do you mean SA ? (SA can run on any endport (host, switch port 0 or router port). > Host1 and Host2 have simultaneous GET_TABLE query responses (both with same > TID) in flight with the SA Host. > > Host1 sends an RMPP abort to the SA. The SA Host receives the abort and does > abort_send(), searching on the TID alone. The wrong session gets aborted. ^^^^ could get aborted and that is because the lookup is on TID only (which could be the same from different SA clients on different hosts) ? -- Hal > - Jack > > > > > - Sean From kschoche at scl.ameslab.gov Fri Feb 24 08:31:06 2006 From: kschoche at scl.ameslab.gov (Kyle Schochenmaier) Date: Fri, 24 Feb 2006 10:31:06 -0600 Subject: [openib-general] create QP failure In-Reply-To: <79ae2f320602221513l25121359ya4d7fc9e5e4613c0@mail.gmail.com> References: <43FC9F0D.9040304@scl.ameslab.gov> <79ae2f320602221513l25121359ya4d7fc9e5e4613c0@mail.gmail.com> Message-ID: <43FF34CA.5000709@scl.ameslab.gov> Fabian Tillier wrote: >On 2/22/06, Roland Dreier wrote: > > >> Kyle> The second ibv_create_qp() call fails for some reason, even >> Kyle> though it's being initialized with the same attributes as >> Kyle> the first call/qp. I didnt see anything in particular that >> Kyle> said this should break on updating to the latest rc's. Have >> Kyle> I missed a major change which would cause this to break now? >> >>It should still work. There was a change to the interface, so that >>the kernel now returns the real QP capacities so libibverbs can give >>you the true QP attributes that your QP was created with. This >>probably introduced a bug somewhere. >> >> > >Roland, is the attribute structure now being used as an output >parameter when it wasn't before? If so, what happens if the output >from one create_qp call gets passed as input into another? > >And Kyle, did the application reset the attributes before the second >call, or just pass the output of the first call as input to the >second? > >- Fab > > > > > In my case, where I was pretty much literally calling: ibv_create_qp(c->qp,&attr); check_for_errors(); ibv_create_qp(c->qp_ack,&attr); check_for_errors(); The second create_qp call failed over. So maybe the answer to your question about parameter passing can be discovered from that ;) Apparently it is (now?) necessary to use a seperate - or 'reset' the old one - ibv_qp_init_attr struct to pass to subsequent ibv_create_qp() calls. Previously I had been unaware of this, and was able to get by with using the same init_attr's passed to both create_qp(*) calls. I guess I just needed to zero out some of the values that got set inside the create_qp() call, it makes sense that I would need to do this.. thanks, - Kyle -- Kyle Schochenmaier kschoche at scl.ameslab.gov Research Assistant, Dr. Brett Bode AmesLab - US Dept.Energy Scalable Computing Laboratory From halr at voltaire.com Fri Feb 24 08:28:42 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Feb 2006 11:28:42 -0500 Subject: [openib-general] Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID In-Reply-To: <43FDF686.1010504@ichips.intel.com> References: <43FDF686.1010504@ichips.intel.com> Message-ID: <1140798219.4336.4.camel@hal.voltaire.com> On Thu, 2006-02-23 at 12:53, Sean Hefty wrote: > Sean Hefty wrote: > > I still need to consider this in more detail to see if there isn't some simpler > > solution that we're overlooking. > > I was thinking about this more, and I'd like to make sure that we identify all > separate issues. > > Ignoring RMPP completely, I'm not convinced that we need to take any action. A > request will either generate a response or not. If no response is generated, > checking that a send is a duplicate provides very little protection. There's > only a small window on the send side that such a check would even work, and the > receiving side still needs to handle this. And if a response were generated, I > don't see that there's any real issue. > > I'd like to get some agreement on whether we really need to take any action for > non-RMPP MADs, then consider what issues RMPP adds. Sorry for not following this thread completely. What's the TID issue (separate from the RMPP issues on top of it) ? Is it the potential duplication of the TID from different hosts adding complexity on the SM/SA side ? Is that a fair assesment ? -- Hal > - Sean From maiafwyee at kobej.zzn.com Fri Feb 24 08:40:50 2006 From: maiafwyee at kobej.zzn.com (maiafwyee at kobej.zzn.com) Date: Fri, 24 Feb 2006 08:40:50 -0800 (PST) Subject: [openib-general] =?utf-8?b?wpHDpcKQbMKCw4zClsOpwpdWwoLDkQ==?= Message-ID: 20060224234301.40906mail@mail.koqspoo28759-superderisystem_server65-getwoman114.cc ���C�^�[�̏����ł��B ���`�I���s�s�������B 3���Ɏ��Ԃ�Ƃ�\��ł��B ���̑O�ɏ����낤�Ɖ��L�T�C�g�ɍs���Č��܂����B http://www.freewebroom.com/tv/ �S�ԑg���w���S�����x�łȂ�Ƃ�����I�I �������Ȃ��T�C�g�΂���ł��A�܂��₵�����̏������������X�Ɠo�ꂵ�܂��B ����͂���p���_�C�X�@���Ј�x���Ă݂Ă��������B�ō��ł��� http://www.freewebroom.com/tv/ ---------------------------------------------------------------------- �o�^������]�̕��́A���萔�ł������L�������肢���܂�m(_ _)m �������@Unsubscribe mail service�@������ http://www.freewebroom.com/tv/release.php ---------------------------------------------------------------------- From rdreier at cisco.com Fri Feb 24 09:08:20 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 24 Feb 2006 09:08:20 -0800 Subject: [openib-general] create QP failure In-Reply-To: <43FF34CA.5000709@scl.ameslab.gov> (Kyle Schochenmaier's message of "Fri, 24 Feb 2006 10:31:06 -0600") References: <43FC9F0D.9040304@scl.ameslab.gov> <79ae2f320602221513l25121359ya4d7fc9e5e4613c0@mail.gmail.com> <43FF34CA.5000709@scl.ameslab.gov> Message-ID: > In my case, where I was pretty much literally calling: > ibv_create_qp(c->qp,&attr); > check_for_errors(); > ibv_create_qp(c->qp_ack,&attr); > check_for_errors(); > The second create_qp call failed over. > So maybe the answer to your question about parameter passing can > be discovered from that ;) > Apparently it is (now?) necessary to use a seperate - or 'reset' > the old one - ibv_qp_init_attr struct to pass to subsequent > ibv_create_qp() calls. No, this is a bug in the verbs library or the kernel driver. It should work fine to reuse the same structure, and it should not be necessary to clear anything in the attributes. Can you give more details about the failure you're seeing? For example, could you dump the QP attributes structure before and after the first ibv_create_qp() call? Or you could just post a test case that shows the problem. - R. From bos at pathscale.com Fri Feb 24 09:16:14 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 24 Feb 2006 09:16:14 -0800 Subject: [openib-general] Suggested components to support in 1.0 Message-ID: <1140801374.1158.22.camel@camp4.serpentine.com> Here is a first cut at the set of components (protocols, drivers, userspace bits) that I think we should be supporting in 1.0. Please look over it and let me know if I'm missing anything. HCA support (both kernel driver and userspace verbs components): * ehca * ipath * mthca IB protocols: * IPoIB * RC * SDP * SRQ * UC * UD Userland software: * libibverbs * libsdp * opensm As far as I can tell, most of the rest of OpenIB userland (libibcm, libibat, libibmad, etc) is logically part of OpenSM, can be treated as such (I think Doug is already doing this with his Red Hat spec files) and is unlikely to be used by other applications. Am I way off? Components that I don't know what to do about, and will likely want to drop unless someone can vouch for them: * iSER * SRP * uDAPL From swise at opengridcomputing.com Fri Feb 24 09:22:26 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 24 Feb 2006 11:22:26 -0600 Subject: [openib-general] fmr question In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F12973BD@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F12973BD@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <1140801747.11764.24.camel@stevo-desktop> Something similar to the mw_bind semantics should work to make it more like the iwarp fast-register (i'm not sure about IB 1.2). A function like ib_bind_mw() to post the map WR, and then a new completion type to post the results back to the CQ... Would we just change ib_map_phys_fmr() to do this or create a new API function to preserve backwards compatibility? On Thu, 2006-02-23 at 15:13 -0800, Caitlin Bestler wrote: > openib-general-bounces at openib.org wrote: > > Steve> So this implies that there is really only one mapping > > Steve> outstanding at any point in time (less the cache issue). > > Steve> Right? So why is there a map count as an fmr attribute? > > Steve> Its seems like just an arbitrary limit put on how many > > Steve> times you can map an fmr before unmaping. -and- once you > > Steve> unmap you can start mapping again up to the map count... ?? > > > > Part of the L_Key and R_Key are twiddled each time the FMR is > > remapped. Because of the possibility of stale data hanging > > around the cache, we don't want to reuse an old key without > > flushing any translation caches. So when we've used all the > > possibilities for the changeable part of the key, we need to > > make sure the FMR is unmapped (and the cache flushed) before > > remapping it. > > > > The whole FMR scheme was driven by Mellanox hardware, so if > > there's a way we can change things to make it more generic, > > I'm all for it. > > > > Both the RDMAC and InfiniBand 1.2 verbs define FMR binding > and invalidation through work requests on privileged QPs. > Invalidation of a specific bind is reported as a Work Completion. > By definition, any translation caches MUST be cleared by the > time that completion is delivered to the Consumer. > > Adopting those semantics, and making them available via a > Work Request for devices supporting that capability, would > eliminate the problem. > > In my opinion undefined flushing of old translation caches > is a security vulnerability that makes the current FMR definition > unusable in any network that was not 100% physically secured. > And even then it is a weakness asking for a bug to hit it. From robert.j.woodruff at intel.com Fri Feb 24 09:23:26 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Fri, 24 Feb 2006 09:23:26 -0800 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <1140801374.1158.22.camel@camp4.serpentine.com> Message-ID: <000001c63967$05b41320$6aa1070a@amr.corp.intel.com> Bryan wrote, >Components that I don't know what to do about, and will likely want to >drop unless someone can vouch for them: > * iSER > * SRP > * uDAPL We need uDAPL and I am sure people want SRP and I think both are in good shape. I am not sure that iSer is quite ready, but will let Voltaire make that call. woody From caitlinb at broadcom.com Fri Feb 24 09:32:31 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 24 Feb 2006 09:32:31 -0800 Subject: [openib-general] fmr question Message-ID: <54AD0F12E08D1541B826BE97C98F99F1297477@NT-SJCA-0751.brcm.ad.broadcom.com> Steve Wise wrote: > Something similar to the mw_bind semantics should work to > make it more like the iwarp fast-register (i'm not sure about > IB 1.2). A function like ib_bind_mw() to post the map WR, > and then a new completion type to post the results back to the CQ... > > Would we just change ib_map_phys_fmr() to do this or create a > new API function to preserve backwards compatibility? > > The rationale for bind_mw() is that the caller needs to know what the RKey that will be associated with the window when the bind completes *before* it reaps the completion. A special method is not required to achieve this when the RKey is defined to have an index and key portion, and the new key value is selected by the consumer. All that is required is to post the work request. For FMRs that applies to RDMAC, IB 1.2 and RNIC-PI. Therefore new work request and work completion structs are all that are required. There is no need for a bind_fmr() method, anymore than there is for a post_rdma_write() method. From halr at voltaire.com Fri Feb 24 09:31:15 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Feb 2006 12:31:15 -0500 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <1140801374.1158.22.camel@camp4.serpentine.com> References: <1140801374.1158.22.camel@camp4.serpentine.com> Message-ID: <1140802077.4336.34.camel@hal.voltaire.com> Hi Bryan, On Fri, 2006-02-24 at 12:16, Bryan O'Sullivan wrote: > Here is a first cut at the set of components (protocols, drivers, > userspace bits) that I think we should be supporting in 1.0. Please > look over it and let me know if I'm missing anything. > > HCA support (both kernel driver and userspace verbs components): > > * ehca > * ipath > * mthca > > IB protocols: > > * IPoIB > * RC > * SDP > * SRQ > * UC > * UD I don't understand RC, UC, UD being separately listed here. > Userland software: > > * libibverbs > * libsdp > * opensm There are diags as well as OpenSM in the management directory. Also, what about librdmacm ? libibcm ? > As far as I can tell, most of the rest of OpenIB userland (libibcm, > libibat, libibmad, etc) is logically part of OpenSM, libibmad (and libibcommon and libibumad) are but libibcm and libibat are not. I believe libibat as well as the core at and user_at are superceeded by addr and libibrdma. > can be treated as > such (I think Doug is already doing this with his Red Hat spec files) > and is unlikely to be used by other applications. Am I way off? > > Components that I don't know what to do about, and will likely want to > drop unless someone can vouch for them: > > * iSER > * SRP > * uDAPL Why not ? -- Hal > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From robert.j.woodruff at intel.com Fri Feb 24 09:34:43 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Fri, 24 Feb 2006 09:34:43 -0800 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <1140801374.1158.22.camel@camp4.serpentine.com> Message-ID: <000101c63968$992ec900$6aa1070a@amr.corp.intel.com> Bryan wrote, >Here is a first cut at the set of components (protocols, drivers, >userspace bits) that I think we should be supporting in 1.0. Please >look over it and let me know if I'm missing anything. If we plan on making the branch today, does everyone have the code checked in that they want in the release ? Right now the SVN tree is at 5491, can we assume that is the rev. for the Branch ? woody From robert.j.woodruff at intel.com Fri Feb 24 09:41:09 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Fri, 24 Feb 2006 09:41:09 -0800 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <1140802077.4336.34.camel@hal.voltaire.com> Message-ID: <000201c63969$7f1eebc0$6aa1070a@amr.corp.intel.com> Hal wrote, >Also, what about librdmacm ? libibcm ? > As far as I can tell, most of the rest of OpenIB userland (libibcm, > libibat, libibmad, etc) is logically part of OpenSM, Perhaps another way to determine what should be in the release is to take everything that is currently in the trunk (say svn rev 5491), and then discuss removing stuff that is no longer needed or not yet ready. woody From bos at pathscale.com Fri Feb 24 09:42:51 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 24 Feb 2006 09:42:51 -0800 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <000001c63967$05b41320$6aa1070a@amr.corp.intel.com> References: <000001c63967$05b41320$6aa1070a@amr.corp.intel.com> Message-ID: <1140802971.1158.38.camel@camp4.serpentine.com> On Fri, 2006-02-24 at 09:23 -0800, Bob Woodruff wrote: > >Components that I don't know what to do about, and will likely want to > >drop unless someone can vouch for them: > > > * iSER > > * SRP > > * uDAPL > > > We need uDAPL and I am sure people want SRP and > I think both are in good shape. OK. Rather than dig through my mail logs, can someone at NetApp please help me out and let us know what the official state of udapl is? The fact that there are 3 userspace providers, and that kdapl still hasn't been killed off in the kernel source tree, makes the situation very confusing to people who are not following closely. Regarding SRP, if you think it's in OK shape, that's OK by me. > I am not sure that iSer is quite ready, but will let Voltaire make that > call. OK. References: <1140801374.1158.22.camel@camp4.serpentine.com> Message-ID: <43FF4616.3010502@ichips.intel.com> Bryan O'Sullivan wrote: > Userland software: > > * libibverbs > * libsdp > * opensm > > As far as I can tell, most of the rest of OpenIB userland (libibcm, > libibat, libibmad, etc) is logically part of OpenSM, can be treated as > such (I think Doug is already doing this with his Red Hat spec files) > and is unlikely to be used by other applications. Am I way off? libibcm and librdmacm are not OpenSM related, but at least one is required to support userspace connections. We should support librdmacm. libibcm is not as important. > Components that I don't know what to do about, and will likely want to > drop unless someone can vouch for them: > > * iSER > * SRP > * uDAPL uDAPL is important to Intel. - Sean From bos at pathscale.com Fri Feb 24 09:47:18 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 24 Feb 2006 09:47:18 -0800 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <1140802077.4336.34.camel@hal.voltaire.com> References: <1140801374.1158.22.camel@camp4.serpentine.com> <1140802077.4336.34.camel@hal.voltaire.com> Message-ID: <1140803238.1158.42.camel@camp4.serpentine.com> On Fri, 2006-02-24 at 12:31 -0500, Hal Rosenstock wrote: > I don't understand RC, UC, UD being separately listed here. Just completeness. > > Userland software: > > > > * libibverbs > > * libsdp > > * opensm > > There are diags as well as OpenSM in the management directory. Do you want the diags in, or out? They're not packaged in any way, so I'd vote for "out". > Also, what about librdmacm ? libibcm ? > > > As far as I can tell, most of the rest of OpenIB userland (libibcm, > > libibat, libibmad, etc) is logically part of OpenSM, > > libibmad (and libibcommon and libibumad) are but libibcm and libibat are > not. I believe libibat as well as the core at and user_at are > superceeded by addr and libibrdma. OK. What kinds of states are the old and new libraries in? I'd rather not ship a library and its proposed replacement. Message-ID: <000301c6396a$9c14e710$6aa1070a@amr.corp.intel.com> Bryan wrote, >Regarding SRP, if you think it's in OK shape, that's OK by me. I have not tested SRP, so I do not know, but I think others have. From mshefty at ichips.intel.com Fri Feb 24 09:52:33 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 24 Feb 2006 09:52:33 -0800 Subject: [openib-general] Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID In-Reply-To: <1140798219.4336.4.camel@hal.voltaire.com> References: <43FDF686.1010504@ichips.intel.com> <1140798219.4336.4.camel@hal.voltaire.com> Message-ID: <43FF47E1.4090406@ichips.intel.com> Hal Rosenstock wrote: > What's the TID issue (separate from the RMPP issues on top of it) ? Is > it the potential duplication of the TID from different hosts adding > complexity on the SM/SA side ? Is that a fair assesment ? If we ignore RMPP, the TID issue as far as I can make out is this. Duplicate requests may be received by a ULP, resulting in duplicate responses. Looking at the code, I think that the only real issue here is some inefficiencies. (Plus a bug that can result in matching RMPP ACKs with the wrong transaction, but this is a separate issue.) Adding in RMPP, the inefficiencies grow. For example, host A issues a path record query, times out, and resends the query. The SA on the other side currently sends two responses, but both responses will have the same TID. When segments are received by host A, he will interpret all segments as belonging to one response, with the second response containing duplicated MADs. What I believe will happen is that one of the responses will be reassembled and returned to the user. On the SA side, all ACKs will match with the first response, and the second response will time out, never getting past the first segment. I was thinking about how we can reduce some of the inefficiencies. Currently, we only track if a request is waiting for a response or not. We can add a new state indicating that a response is in progress, which would be set when the first segment of a response is received. This would be used to suppress duplicate requests. On the receive side, I was considering adding an API that the user would invoke to indicate that a response was being generated. The MAD layer would queue this information, and a received request would be checked against this queue to determine if it were a duplicate. When the response is sent, the queued information would be removed. I think that we may be able to use such an API to support dual-sided RMPP as well. - Sean From robert.j.woodruff at intel.com Fri Feb 24 09:52:31 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Fri, 24 Feb 2006 09:52:31 -0800 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <1140803238.1158.42.camel@camp4.serpentine.com> Message-ID: <000401c6396b$159ea440$6aa1070a@amr.corp.intel.com> Bryan wrote, >Do you want the diags in, or out? They're not packaged in any way, so >I'd vote for "out". Again, I would vote for including everything that is in the trunk unless the maintainer decides it is not ready or it is some code that is now obsolete and should be removed. woody From halr at voltaire.com Fri Feb 24 09:49:48 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Feb 2006 12:49:48 -0500 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <000101c63968$992ec900$6aa1070a@amr.corp.intel.com> References: <000101c63968$992ec900$6aa1070a@amr.corp.intel.com> Message-ID: <1140802641.4336.46.camel@hal.voltaire.com> On Fri, 2006-02-24 at 12:34, Bob Woodruff wrote: > Bryan wrote, > >Here is a first cut at the set of components (protocols, drivers, > >userspace bits) that I think we should be supporting in 1.0. Please > >look over it and let me know if I'm missing anything. > > If we plan on making the branch today, does everyone have > the code checked in that they want in the release ? > Right now the SVN tree is at 5491, can we assume that is the > rev. for the Branch ? If that's not the case, we'll just need to track any needed changes over from the trunk to the release branch. I think this needs to get started. -- Hal > woody > > From yaronh at voltaire.com Fri Feb 24 09:54:31 2006 From: yaronh at voltaire.com (Yaron Haviv) Date: Fri, 24 Feb 2006 19:54:31 +0200 Subject: [openib-general] Suggested components to support in 1.0 Message-ID: <35EA21F54A45CB47B879F21A91F4862FACFDCB@taurus.voltaire.com> > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Bob Woodruff > Sent: Friday, February 24, 2006 12:23 PM > To: 'Bryan O'Sullivan'; openib-general > Subject: RE: [openib-general] Suggested components to support in 1.0 > > Bryan wrote, > >Components that I don't know what to do about, and will likely want to > >drop unless someone can vouch for them: > > > * iSER > > * SRP > > * uDAPL > > > We need uDAPL and I am sure people want SRP and > I think both are in good shape. > I am not sure that iSer is quite ready, but will let Voltaire make that > call. > > woody Woody, I believe that OpenIB iSER is quickly getting there with the amount of dedicated work Or, Dan, and others put into it We would definitely vote for it Yaron > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general From jlentini at netapp.com Fri Feb 24 09:55:43 2006 From: jlentini at netapp.com (James Lentini) Date: Fri, 24 Feb 2006 12:55:43 -0500 (EST) Subject: [openib-general] [PATCH][RFC] CMA automatic port number assignment Message-ID: The RDMA CM does not automatically assign a port number to the active consumer on IB. My expectation was that if the active consumer called rdma_create_id() rdma_resolve_addr() rdma_create_qp() rdma_resolve_route() rdma_connect() analogous to the BSD sockets sequence socket(2) connect(2) the consumer would automatically be assigned a local TCP/UDP/SCTP port number. The simple patch below solves the problem by picking a random ephemeral port number (where ephemeral is defined as any port > 1024). This fix is not perfect. The choice of port number does not coordinate with the relevant transport stack (TCP/UDP/SDP). Given that the passive side CMA code has the same behavior, this follows an established precedent. Also, this method needs to be reviewed for security issues. There is no attempt to ensure that multiple (src addr, dest addr, dest port) triplets do not simultaneously use the same src port or to ensure that the same port is not reused before some given timeout (e.g. TIME_WAIT). Given that the src port is chosen at random, I would expect both of these occurrences to be extremely unlikely. As a reference, the actual TCP code makes this assignment in tcp_v4_hash_connect() using the sysctl_local_port_range array to determine the range of ephemeral ports and the secure_tcp_port_ephemeral() function to choose a secure point to start searching for an available port. Unfortunately neither of these are available to module code. Signed-off-by: James Lentini Index: core/cma.c =================================================================== --- core/cma.c (revision 5489) +++ core/cma.c (working copy) @@ -1361,6 +1361,14 @@ err: } EXPORT_SYMBOL(rdma_bind_addr); +static u16 cma_generate_ephemeral_port(void) +{ + u16 port; + + get_random_bytes(&port, sizeof port); + return cpu_to_be16(port | 1024U); +} + static void cma_format_hdr(void *hdr, enum rdma_port_space ps, struct rdma_route *route) { @@ -1371,6 +1379,9 @@ static void cma_format_hdr(void *hdr, en src4 = (struct sockaddr_in *) &route->addr.src_addr; dst4 = (struct sockaddr_in *) &route->addr.dst_addr; + if (!src4->sin_port) + src4->sin_port = cma_generate_ephemeral_port(); + switch (ps) { case RDMA_PS_SDP: sdp_hdr = hdr; From bos at pathscale.com Fri Feb 24 09:57:32 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 24 Feb 2006 09:57:32 -0800 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <000401c6396b$159ea440$6aa1070a@amr.corp.intel.com> References: <000401c6396b$159ea440$6aa1070a@amr.corp.intel.com> Message-ID: <1140803852.1158.46.camel@camp4.serpentine.com> On Fri, 2006-02-24 at 09:52 -0800, Bob Woodruff wrote: > Bryan wrote, > >Do you want the diags in, or out? They're not packaged in any way, so > >I'd vote for "out". > > Again, I would vote for including everything that is in the trunk > unless the maintainer decides it is not ready or it is some > code that is now obsolete and should be removed. I don't have a problem with leaving code in the SVN branch and not touching it. My point is more that the management diags don't have an RPM spec file, so if someone doesn't write one, they won't get shipped in binary form, and hence they won't get tested or used. This applies to other components, too. References: <35EA21F54A45CB47B879F21A91F4862FACFDCB@taurus.voltaire.com> Message-ID: <1140803884.1158.48.camel@camp4.serpentine.com> On Fri, 2006-02-24 at 19:54 +0200, Yaron Haviv wrote: > Woody, I believe that OpenIB iSER is quickly getting there with the > amount of dedicated work Or, Dan, and others put into it > > We would definitely vote for it Thanks. That's good to know. References: <1140801374.1158.22.camel@camp4.serpentine.com> <1140802077.4336.34.camel@hal.voltaire.com> <1140803238.1158.42.camel@camp4.serpentine.com> Message-ID: <1140803796.4336.48.camel@hal.voltaire.com> On Fri, 2006-02-24 at 12:47, Bryan O'Sullivan wrote: > On Fri, 2006-02-24 at 12:31 -0500, Hal Rosenstock wrote: > > > I don't understand RC, UC, UD being separately listed here. > > Just completeness. > > > > Userland software: > > > > > > * libibverbs > > > * libsdp > > > * opensm > > > > There are diags as well as OpenSM in the management directory. > > Do you want the diags in, or out? They're not packaged in any way, so > I'd vote for "out". I think they need to be in. > > Also, what about librdmacm ? libibcm ? > > > > > As far as I can tell, most of the rest of OpenIB userland (libibcm, > > > libibat, libibmad, etc) is logically part of OpenSM, > > > > libibmad (and libibcommon and libibumad) are but libibcm and libibat are > > not. I believe libibat as well as the core at and user_at are > > superceeded by addr and libibrdma. > > OK. What kinds of states are the old and new libraries in? I'd rather > not ship a library and its proposed replacement. > > From mshefty at ichips.intel.com Fri Feb 24 10:03:20 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 24 Feb 2006 10:03:20 -0800 Subject: [openib-general] user_mad questions In-Reply-To: <1140796153.28051.41767.camel@hal.voltaire.com> References: <43FE403A.1020209@ichips.intel.com> <1140796153.28051.41767.camel@hal.voltaire.com> Message-ID: <43FF4A68.5050503@ichips.intel.com> Hal Rosenstock wrote: >>> Sean> Second, when a failed send is reported, there's a mismatch >>> Sean> between the size of the data copied to the user and the size >>> Sean> returned from ib_umad_read(). The copied data is >>> Sean> sizeof(ib_user_mad) + sizeof(ib_mad). The reported size >>> Sean> turns out to be sizeof(ib_user_mad) + sizeof(ib_mad_hdr). >>> Sean> It appears that the data copy is off. >>> >>>Yes, that seems to be a long-standing bug. Not sure when that was introduced. >> >>I have a patch to fix this as part of some general RMPP cleanup/optimizations. > > Should the fix for this be separate from other RMPP changes ? It will take some time for me to separate out this change, but I can do that. I didn't discover this until near the end of the updates that I was trying to do, which were actually related to receive handling... - Sean From bos at pathscale.com Fri Feb 24 10:03:37 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 24 Feb 2006 10:03:37 -0800 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <1140803796.4336.48.camel@hal.voltaire.com> References: <1140801374.1158.22.camel@camp4.serpentine.com> <1140802077.4336.34.camel@hal.voltaire.com> <1140803238.1158.42.camel@camp4.serpentine.com> <1140803796.4336.48.camel@hal.voltaire.com> Message-ID: <1140804217.1158.50.camel@camp4.serpentine.com> On Fri, 2006-02-24 at 12:56 -0500, Hal Rosenstock wrote: > I think they need to be in. OK. Will you write a spec file so that they get packaged appropriately, then, or do you want someone else to do that? References: <000401c6396b$159ea440$6aa1070a@amr.corp.intel.com> <1140803852.1158.46.camel@camp4.serpentine.com> Message-ID: <1140804038.4336.53.camel@hal.voltaire.com> On Fri, 2006-02-24 at 12:57, Bryan O'Sullivan wrote: > On Fri, 2006-02-24 at 09:52 -0800, Bob Woodruff wrote: > > Bryan wrote, > > >Do you want the diags in, or out? They're not packaged in any way, so > > >I'd vote for "out". > > > > Again, I would vote for including everything that is in the trunk > > unless the maintainer decides it is not ready or it is some > > code that is now obsolete and should be removed. > > I don't have a problem with leaving code in the SVN branch and not > touching it. My point is more that the management diags don't have an > RPM spec file, so if someone doesn't write one, they won't get shipped > in binary form, and hence they won't get tested or used. I will add this for the diags. > This applies > to other components, too. > > From halr at voltaire.com Fri Feb 24 10:12:12 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Feb 2006 13:12:12 -0500 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <1140804217.1158.50.camel@camp4.serpentine.com> References: <1140801374.1158.22.camel@camp4.serpentine.com> <1140802077.4336.34.camel@hal.voltaire.com> <1140803238.1158.42.camel@camp4.serpentine.com> <1140803796.4336.48.camel@hal.voltaire.com> <1140804217.1158.50.camel@camp4.serpentine.com> Message-ID: <1140804691.4336.55.camel@hal.voltaire.com> On Fri, 2006-02-24 at 13:03, Bryan O'Sullivan wrote: > On Fri, 2006-02-24 at 12:56 -0500, Hal Rosenstock wrote: > > > I think they need to be in. > > OK. Will you write a spec file so that they get packaged appropriately, > then, or do you want someone else to do that? I'll do it. I'll get to it as soon as I can. Don't delay cutting the branch for this. I'm sure there are a number of issues we will need to work out. This is all part of getting to the release IMO. -- Hal > > From kschoche at scl.ameslab.gov Fri Feb 24 10:16:14 2006 From: kschoche at scl.ameslab.gov (Kyle Schochenmaier) Date: Fri, 24 Feb 2006 12:16:14 -0600 Subject: [openib-general] create QP failure In-Reply-To: References: <43FC9F0D.9040304@scl.ameslab.gov> <79ae2f320602221513l25121359ya4d7fc9e5e4613c0@mail.gmail.com> <43FF34CA.5000709@scl.ameslab.gov> Message-ID: <43FF4D6E.1020504@scl.ameslab.gov> Roland Dreier wrote: > > In my case, where I was pretty much literally calling: > > ibv_create_qp(c->qp,&attr); > > check_for_errors(); > > ibv_create_qp(c->qp_ack,&attr); > > check_for_errors(); > > > The second create_qp call failed over. > > So maybe the answer to your question about parameter passing can > > be discovered from that ;) > > > Apparently it is (now?) necessary to use a seperate - or 'reset' > > the old one - ibv_qp_init_attr struct to pass to subsequent > > ibv_create_qp() calls. > >No, this is a bug in the verbs library or the kernel driver. It >should work fine to reuse the same structure, and it should not be >necessary to clear anything in the attributes. Can you give more >details about the failure you're seeing? For example, could you dump >the QP attributes structure before and after the first ibv_create_qp() >call? Or you could just post a test case that shows the problem. > > - R. > > > > > It appears as if att.cap.max_inline_data is getting set by create_qp() (**ibv_qp_init_attr att**) qp_context:(nil) send_cq:0x597020 recv_cq:0x597020 srq:(nil) s_wr:65535 r_wr:65535 s_sge:28 r_sge:28 inline:0 c->qp = ibv_create_qp(c->pd, &att); (**ibv_qp_init_attr att**) qp_context:(nil) send_cq:0x597020 recv_cq:0x597020 srq:(nil) s_wr:65535 r_wr:65535 s_sge:28 r_sge:28 inline:476 ..which causes the second create_qp() to fail. I'm not entirely sure how to get any other useful return codes from this type of function call.. any ideas? hope that helps, - Kyle -- Kyle Schochenmaier kschoche at scl.ameslab.gov Research Assistant, Dr. Brett Bode AmesLab - US Dept.Energy Scalable Computing Laboratory From bos at pathscale.com Fri Feb 24 10:17:18 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 24 Feb 2006 10:17:18 -0800 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <1140804691.4336.55.camel@hal.voltaire.com> References: <1140801374.1158.22.camel@camp4.serpentine.com> <1140802077.4336.34.camel@hal.voltaire.com> <1140803238.1158.42.camel@camp4.serpentine.com> <1140803796.4336.48.camel@hal.voltaire.com> <1140804217.1158.50.camel@camp4.serpentine.com> <1140804691.4336.55.camel@hal.voltaire.com> Message-ID: <1140805038.1158.61.camel@camp4.serpentine.com> On Fri, 2006-02-24 at 13:12 -0500, Hal Rosenstock wrote: > I'll do it. I'll get to it as soon as I can. Don't delay cutting the > branch for this. I'm sure there are a number of issues we will need to > work out. This is all part of getting to the release IMO. Yes, I agree. I'm creating a 1.0 branch in the SVN repository today. It will be named gen2/branches/1.0. I'll send out mail when it's ready (in about an hour). Alexander Nezhinsky wrote: > Caitlin Bestler wrote: > >>> So what do the iser sockets do? They look like noop stubs to me. >>> >> >> Good question. >> >> I am guessing that they are exactly noop stubs, and the real point is >> to have a socket associated with an iSER RDMA connection. >> >> My question is why? Unless the attempt is to allow upgrading an iSCSI >> stream connection to an iSER connection I don't see why an iSER RDMA >> connection is in any more of a need for having a proxy socket than >> any other RDMA connection. >> > We don't upgrade iSCSI stream connection but start with an > RDMA connection right away. > The iSER code is going to be one of open-iscsi transports, > and open-iscsi opens connections using sockets from user > space, which is only natural with tcp. > The iSER RC connection should be open from kernel, so this > special socket gives us an opportunity to do so, while > leaving intact the entire mechanism of connection > establishment and user-kernel handover. > We don't really need to implement read/write primitives > because they are initiated either from within kernel > transport module itself or through a special user-kernel > interface bypassing the socket. > >> I really don't see the benefit of having a "socket" that is not truly >> integrated with the host stack. What socket attributes are being >> sought? And how is it unique to iSER as opposed to RDMA in general? >> > Perhaps it is plausible to implement a general-purpose > stack-integrated socket giving access to IB RC connections, > if this is what you mean. > But this was clearly out of scope for the iSER initiator. > We sought a solution to the immediate problem in open-iscsi. > So the main socket feature used here was a neat way to > delegate the IB connection establishment from user space to kernel. I remain concerned that the use of a socket without full socket semantics will only lead to confusion. I also do not see how it can enable iSER/IP. For example, is this limited socket inheritable? If so how does it transfer ownership of the "connection" to the child? How will this work for iSER/IP as well as iSER/IB? Extending CMA to allow sending/receiving startup-phase messages does map naturally to iSER/IB, iSER/IP as well as plain iSCSI/TCP. It also avoids creating any unrealistic expectations, such as a general purpose write that can composite a message in multiple calls or that can be inherited by a child, that calling it a "socket" would imply to any experienced socket developer. From jlentini at netapp.com Fri Feb 24 10:24:22 2006 From: jlentini at netapp.com (James Lentini) Date: Fri, 24 Feb 2006 13:24:22 -0500 (EST) Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <1140802971.1158.38.camel@camp4.serpentine.com> References: <000001c63967$05b41320$6aa1070a@amr.corp.intel.com> <1140802971.1158.38.camel@camp4.serpentine.com> Message-ID: On Fri, 24 Feb 2006, Bryan O'Sullivan wrote: > On Fri, 2006-02-24 at 09:23 -0800, Bob Woodruff wrote: > > > >Components that I don't know what to do about, and will likely want to > > >drop unless someone can vouch for them: > > > > > * iSER > > > * SRP > > > * uDAPL > > > > > > We need uDAPL and I am sure people want SRP and > > I think both are in good shape. > > OK. Rather than dig through my mail logs, can someone at NetApp please > help me out and let us know what the official state of udapl is? The > fact that there are 3 userspace providers, and that kdapl still hasn't > been killed off in the kernel source tree, makes the situation very > confusing to people who are not following closely. uDAPL should be included. The different providers were developed to support 3 different OpenIB configurations: OpenIB verbs (when there was no user space CM), OpenIB verbs + IB CM, and OpenIB verbs + RDMA CM. I recommend you include the openib_cma provider and the user space RDMA cm library. kDAPL does not need to be in the release, although it would be trivial to include it. Please let me know if you need any help with this. From halr at voltaire.com Fri Feb 24 10:24:14 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Feb 2006 13:24:14 -0500 Subject: [openib-general] user_mad questions In-Reply-To: <43FF4A68.5050503@ichips.intel.com> References: <43FE403A.1020209@ichips.intel.com> <1140796153.28051.41767.camel@hal.voltaire.com> <43FF4A68.5050503@ichips.intel.com> Message-ID: <1140805453.4336.59.camel@hal.voltaire.com> On Fri, 2006-02-24 at 13:03, Sean Hefty wrote: > > Should the fix for this be separate from other RMPP changes ? > > It will take some time for me to separate out this change, but I can do that. I > didn't discover this until near the end of the updates that I was trying to do, > which were actually related to receive handling... Then it might not be worth that extra effort. -- Hal From bos at pathscale.com Fri Feb 24 10:28:28 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 24 Feb 2006 10:28:28 -0800 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: References: <000001c63967$05b41320$6aa1070a@amr.corp.intel.com> <1140802971.1158.38.camel@camp4.serpentine.com> Message-ID: <1140805708.1158.70.camel@camp4.serpentine.com> On Fri, 2006-02-24 at 13:24 -0500, James Lentini wrote: > uDAPL should be included. The different providers were developed to > support 3 different OpenIB configurations: OpenIB verbs (when there > was no user space CM), OpenIB verbs + IB CM, and OpenIB verbs + RDMA > CM. I recommend you include the openib_cma provider and the user space > RDMA cm library. OK, thanks. Someone will need to write RPM spec files for these components. Is this something someone at NetApp can do? > kDAPL does not need to be in the release, although it would be trivial > to include it. My understanding is that kDAPL has been vetoed for inclusion in the upstream kernel, so I do not currently see any point in shipping it. Please correct me if I'm wrong. References: <1140805207.1158.64.camel@camp4.serpentine.com> Message-ID: <1140806200.1158.74.camel@camp4.serpentine.com> On Fri, 2006-02-24 at 10:20 -0800, Bryan O'Sullivan wrote: > I'm creating a 1.0 branch in the SVN repository today. And here's the URL: https://openib.org/svn/gen2/branches/1.0/ References: <000001c63967$05b41320$6aa1070a@amr.corp.intel.com> <1140802971.1158.38.camel@camp4.serpentine.com> <1140805708.1158.70.camel@camp4.serpentine.com> Message-ID: On Fri, 24 Feb 2006, Bryan O'Sullivan wrote: > On Fri, 2006-02-24 at 13:24 -0500, James Lentini wrote: > > > uDAPL should be included. The different providers were developed to > > support 3 different OpenIB configurations: OpenIB verbs (when there > > was no user space CM), OpenIB verbs + IB CM, and OpenIB verbs + RDMA > > CM. I recommend you include the openib_cma provider and the user space > > RDMA cm library. > > OK, thanks. Someone will need to write RPM spec files for these > components. Is this something someone at NetApp can do? I can do it. When do you need the spec file? > > kDAPL does not need to be in the release, although it would be trivial > > to include it. > > My understanding is that kDAPL has been vetoed for inclusion in the > upstream kernel, That's correct. > so I do not currently see any point in shipping it. Ok. From robert.j.woodruff at intel.com Fri Feb 24 10:56:13 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Fri, 24 Feb 2006 10:56:13 -0800 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <1140803852.1158.46.camel@camp4.serpentine.com> Message-ID: <000501c63973$fbe37d10$6aa1070a@amr.corp.intel.com> Bryan wrote, >I don't have a problem with leaving code in the SVN branch and not >touching it. My point is more that the management diags don't have an >RPM spec file, so if someone doesn't write one, they won't get shipped >in binary form, and hence they won't get tested or used. This applies >to other components, too. > References: <000001c63967$05b41320$6aa1070a@amr.corp.intel.com> <1140802971.1158.38.camel@camp4.serpentine.com> <1140805708.1158.70.camel@camp4.serpentine.com> Message-ID: <43FF56ED.3020707@ichips.intel.com> Bryan O'Sullivan wrote: > My understanding is that kDAPL has been vetoed for inclusion in the > upstream kernel, so I do not currently see any point in shipping it. > Please correct me if I'm wrong. I'm not saying that kDAPL needs to be included in the release, but I don't think that this is the correct criteria to use when determining which modules to include in the OpenIB release. We may very well have kernel modules that will never be merged into the kernel, such as test programs or debugging tools, that may make sense to add to a release. - Sean From bos at pathscale.com Fri Feb 24 11:13:55 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 24 Feb 2006 11:13:55 -0800 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: References: <000001c63967$05b41320$6aa1070a@amr.corp.intel.com> <1140802971.1158.38.camel@camp4.serpentine.com> <1140805708.1158.70.camel@camp4.serpentine.com> Message-ID: <1140808435.1158.81.camel@camp4.serpentine.com> On Fri, 2006-02-24 at 13:37 -0500, James Lentini wrote: > I can do it. When do you need the spec file? Thanks. The scheduled date for 1.0rc1 is March 6, so if you want udapl in rc1, it would be best the spec files checked in by the end of Tuesday. References: <1140802077.4336.34.camel@hal.voltaire.com> <000201c63969$7f1eebc0$6aa1070a@amr.corp.intel.com> Message-ID: I'll second Woody's suggegstion. Most items are in the SVN because there are interested parties. We should have some criteria for what to remove rather than what to include. Dan On 2/24/06, Bob Woodruff wrote: > Hal wrote, > >Also, what about librdmacm ? libibcm ? > > > As far as I can tell, most of the rest of OpenIB userland (libibcm, > > libibat, libibmad, etc) is logically part of OpenSM, > > Perhaps another way to determine what should be in the release > is to take everything that is currently in the trunk (say svn rev 5491), and > then discuss removing stuff that is no longer needed or not > yet ready. > > woody > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From swise at opengridcomputing.com Fri Feb 24 11:19:02 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 24 Feb 2006 13:19:02 -0600 Subject: [openib-general] [PATCH][RFC] CMA automatic port number assignment In-Reply-To: References: Message-ID: <1140808742.11764.36.camel@stevo-desktop> On Fri, 2006-02-24 at 12:55 -0500, James Lentini wrote: > The RDMA CM does not automatically assign a port number to the active > consumer on IB. > > My expectation was that if the active consumer called > > rdma_create_id() > rdma_resolve_addr() > rdma_create_qp() > rdma_resolve_route() > rdma_connect() > > analogous to the BSD sockets sequence > > socket(2) > connect(2) > > the consumer would automatically be assigned a local TCP/UDP/SCTP port > number. > > The simple patch below solves the problem by picking a random > ephemeral port number (where ephemeral is defined as any port > 1024). > > This fix is not perfect. > > The choice of port number does not coordinate with the relevant > transport stack (TCP/UDP/SDP). Given that the passive side CMA code > has the same behavior, this follows an established precedent. > > Also, this method needs to be reviewed for security issues. There is > no attempt to ensure that multiple (src addr, dest addr, dest port) > triplets do not simultaneously use the same src port or to ensure that > the same port is not reused before some given timeout (e.g. > TIME_WAIT). Given that the src port is chosen at random, I would > expect both of these occurrences to be extremely unlikely. > > As a reference, the actual TCP code makes this assignment in > tcp_v4_hash_connect() using the sysctl_local_port_range array to > determine the range of ephemeral ports and the > secure_tcp_port_ephemeral() function to choose a secure point to start > searching for an available port. Unfortunately neither of these are > available to module code. > I don't think this is going to work for iWARP providers. From an iWARP perspective, either the CMA needs to track the entire port number space across all providers (ie replicate what the native stack does), or it needs to track port spaces per provider, or it needs to query the provider to get an ephemeral port. The problem is that the provider might already have that port in use from a previous connection. > Signed-off-by: James Lentini > > Index: core/cma.c > =================================================================== > --- core/cma.c (revision 5489) > +++ core/cma.c (working copy) > @@ -1361,6 +1361,14 @@ err: > } > EXPORT_SYMBOL(rdma_bind_addr); > > +static u16 cma_generate_ephemeral_port(void) > +{ > + u16 port; > + > + get_random_bytes(&port, sizeof port); > + return cpu_to_be16(port | 1024U); > +} > + > static void cma_format_hdr(void *hdr, enum rdma_port_space ps, > struct rdma_route *route) > { > @@ -1371,6 +1379,9 @@ static void cma_format_hdr(void *hdr, en > src4 = (struct sockaddr_in *) &route->addr.src_addr; > dst4 = (struct sockaddr_in *) &route->addr.dst_addr; > > + if (!src4->sin_port) > + src4->sin_port = cma_generate_ephemeral_port(); > + > switch (ps) { > case RDMA_PS_SDP: > sdp_hdr = hdr; > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From bos at pathscale.com Fri Feb 24 11:20:12 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 24 Feb 2006 11:20:12 -0800 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <43FF56ED.3020707@ichips.intel.com> References: <000001c63967$05b41320$6aa1070a@amr.corp.intel.com> <1140802971.1158.38.camel@camp4.serpentine.com> <1140805708.1158.70.camel@camp4.serpentine.com> <43FF56ED.3020707@ichips.intel.com> Message-ID: <1140808812.1158.89.camel@camp4.serpentine.com> On Fri, 2006-02-24 at 10:56 -0800, Sean Hefty wrote: > I'm not saying that kDAPL needs to be included in the release, but I don't think > that this is the correct criteria to use when determining which modules to > include in the OpenIB release. We may very well have kernel modules that will > never be merged into the kernel, such as test programs or debugging tools, that > may make sense to add to a release. I don't have a problem with shipping kernel components that are not upstream or not going to be submitted upstream, provided there's a good reason to have them present and they're not too intrusive. In the case of kdapl, the nature of the upstream veto means that distros will probably never pick it up as long as that veto isn't lifted. Since its entire purpose is, if my understanding is correct, to have other in-kernel components use it, it's effectively dead code as it stands. There is thus no practical value to including it, as far as I can see. References: Message-ID: <43FF5DD9.6070107@ichips.intel.com> James Lentini wrote: > +static u16 cma_generate_ephemeral_port(void) > +{ > + u16 port; > + > + get_random_bytes(&port, sizeof port); > + return cpu_to_be16(port | 1024U); > +} As you mentioned, we should verify that the port number is not in use. > static void cma_format_hdr(void *hdr, enum rdma_port_space ps, > struct rdma_route *route) > { > @@ -1371,6 +1379,9 @@ static void cma_format_hdr(void *hdr, en > src4 = (struct sockaddr_in *) &route->addr.src_addr; > dst4 = (struct sockaddr_in *) &route->addr.dst_addr; > > + if (!src4->sin_port) > + src4->sin_port = cma_generate_ephemeral_port(); This only sets the port number in the header. It would make more sense to save the port number in the rdma_cm_id's src_addr. - Sean From bos at pathscale.com Fri Feb 24 11:31:34 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 24 Feb 2006 11:31:34 -0800 Subject: [openib-general] Branch and patch management for 1.0 Message-ID: <1140809494.1158.97.camel@camp4.serpentine.com> All SVN committers are welcome to commit fixes and merge changes to the 1.0 branch. Please keep in mind the need to keep changes to a minimum, and to focus on correctness. If you don't have commit permissions, please send patches to both me and the mailing list, with "[PATCH 1.0]" in the Subject line. Unless people object, I'd like to merge changes back from the 1.0 branch to the trunk quite frequently, so the two don't diverge too much. References: <1140718515.22707.91.camel@trinity.ogc.int> Message-ID: <43FF5FDF.30600@ichips.intel.com> Tom Tucker wrote: > +struct iw_cm_verbs; > struct ib_device { > struct device *dma_device; > > @@ -840,6 +844,8 @@ > > u32 flags; > > + struct iw_cm_verbs* iwcm; > + Does anyone object to adding this to ib_device? I'm not thrilled about this, but I don't see another alternative, and I'm not sure it's any worse than having a 'process_mad' function. Maybe we need a more generic way of providing transport/device specific extensions? Something like: struct ib_device { ... union { struct iw_verbs *iw; struct ib_verbs *ib; } ext_verbs; ... }; Thoughts? - Sean From dledford at redhat.com Fri Feb 24 11:41:35 2006 From: dledford at redhat.com (Doug Ledford) Date: Fri, 24 Feb 2006 14:41:35 -0500 Subject: [openib-general] Towards a 1.0 release of OpenIB In-Reply-To: <1140711884.17258.50.camel@serpentine.pathscale.com> References: <1140711884.17258.50.camel@serpentine.pathscale.com> Message-ID: <20060224194135.GJ5082@redhat.com> On Thu, Feb 23, 2006 at 08:24:44AM -0800, Bryan O'Sullivan wrote: > On Thu, 2006-02-23 at 16:58 +0200, Gali Zisman wrote: > > > I am a little concerned about the release timeline. > > It looks like the GA date is May 08. If I remember correctly the SLES 10 > > release date is before that. And it looks a little late for the RH > > release plan as well. > > I understand this concern. There's a natural tension between wanting > the distro vendors to be able to package and ship something and having > that something be reasonably functional and somewhat tested. > > > We need to make sure that the distributions could use this release. > > Doug, do you have any comments in this regard? Unfortunately, I'm not allowed to discuss dates for our releases. Having said that, you guys do what you feel is best, and I'll find a way to work with it. You get me a strong, stable, complete 1.0 release and I'll move mountains to make sure I'm supporting that instead of something I threw together myself over the course of the 7 year lifetime we promise for RHEL releases ;-) > > Of course I can not comment on their behalf but it is curtail that we > > get their approval to the schedule. > > I don't know who's even paying attention to any of this at Novell/SUSE, > I'm afraid. I'd certainly welcome their input. > > 919-754-3700 x44233 Red Hat, Inc. 1801 Varsity Dr. Raleigh, NC 27606 From mshefty at ichips.intel.com Fri Feb 24 12:33:03 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 24 Feb 2006 12:33:03 -0800 Subject: [openib-general] Re: [PATCH] Small change to addr.c for iWARP support In-Reply-To: <1140795513.29942.9.camel@trinity.ogc.int> References: <1140795513.29942.9.camel@trinity.ogc.int> Message-ID: <43FF6D7F.7040206@ichips.intel.com> Tom Tucker wrote: > Two changes to addr.c for iWARP: Thanks. Committed in 5493 with a minor fix up to remove extra parens. - Sean From takshak at gs-lab.com Fri Feb 24 12:46:22 2006 From: takshak at gs-lab.com (Takshak C.) Date: Sat, 25 Feb 2006 02:16:22 +0530 (IST) Subject: [openib-general] Get Table Records for SA Attribute ID ? In-Reply-To: <1140789577.28051.41187.camel@hal.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F589AC20@taurus.voltaire.com> <43F4788E.3070909@gs-lab.com> <1140120738.4333.33149.camel@hal.voltaire.com> <43FC515A.3020404@gs-lab.com> <1140612035.28051.14448.camel@hal.voltaire.com> <43FC6288.4040402@gs-lab.com> <1140615978.28051.15030.camel@hal.voltaire.com> <43FF059F.4000901@gs-lab.com> <1140789577.28051.41187.camel@hal.voltaire.com> Message-ID: <4439.65.219.193.226.1140813982.squirrel@65.219.193.226> Hi Hal: Thanks. Basically, I m using the following flow in my application: status = opensm_init(&options, (osm_log_level_t) log_flags ); // Get some portID where you need to bind this SM if ( guid == 0 && !(guid = get_port_guid(guid, &port_attr))) { printf("\n We can't proceed further without port_guid \n"); exit(0); } status = opensm_bind(&port_attr) ; // The above function internally calls // osmv_bind_sa(...) function of libvendor library. Then, I m calling opensm_saquery()..this function prepares the query request and calls status = osmv_query_sa(openSMObject.h_bind, &req) ; So, I believe, I m not playing with transaction ID here. Thus, I m wondering what could have been happened with transaction ID mismatched. I have attached my application code file along with this mail for your kind reference. Please take a look. I m also trying to figure out what initilization has been done by openSM instance which is not happening through my normal application startup time. One thing, I have observed is as below which might help me to get your comments: 1. I Start openSM instance in one terminal. It says Subnet is up and it is Master. 2. On other terminal when I start my attached application, then openSM terminal says I m processing the MAD ( osm_sa_mad_ctrl.c processing function ). It means that, my application acts as a client and openSM acting as a server to serve the request. I would like to deattached this client and server communication. This is my current goal. If you could point to me what exactly I need to do for this then it would be great help. If you could point me to the documentation then it would also help me to understand the communication between the client and openSM as a server. Thanks and Regards, - Takshak > Hi Takshak, > > On Fri, 2006-02-24 at 08:09, Takshak C. wrote: >> Hi Hal: >> >> Thanks for the information. >> >> I would like to confirm that, umad_send() and umad_recv calls goes out >> of osm libraries. > > Not sure what you mean by this. The umad library of which those calls > are a part of is underneath any OSM libraries that might be used. > >> I have written an application to get PATH_RECORDS using opensm libraries >> similar to osmtest. >> If I don't start openSM instance, then I don't get results rather I get >> an error message as >> below: >> [ umad_receiver: ERR 5409: send completed with error (method=0x12 >> attr=0x35 trans_id=0x1) -- dropping ] > > That's saying that a matching receive was not seen (the response to the > GetTable PathRecord request). The transaction ID looks funny to me for > matching. Is this being set correctly ? (Not having the code it is hard > for me to tell how things are initialized and what mode this is working > in). > >> I have tried to go inside. Just after umad_send() call in >> osm_vendor_ibumad.c ( osm_vendor_send() function ), >> I called umad_recv() function call and tried to check the receive >> length. And I am getting 24. >> >> Why this could have been happened ? return length = 24 means call has >> not received path records. > > Because if it is getting a response, due to the transaction ID, I do not > think it is considered a match. I can't be sure with the info provided. > >> But if I start openSM instance then I get proper result length. > > Sounds like your application may have an initialization issue that this > fixes. > >> I would like to remove dependency of starting this openSM instance and >> my application program >> should run independently. > > This should be possible. > >> Could you please throw me some light on this. When I have called >> umad_recv() it should not depend >> on openSM instance I believe to received the things. > > Correct. > > -- Hal > >> Thanks & Regards. >> - Takshak >> >> >> Hal Rosenstock wrote: >> > On Wed, 2006-02-22 at 08:09, Takshak C. wrote: >> > > Thanks a lot Hal, for clearing my doubts. >> > > I would like to redefine my problem based on your inputs. >> > > >> > > I am into a scenario, where vendor specific primary SM is running in >> > > the subnet. >> > > This running SM is different than openSM. I have loaded an openIB >> > > stack on the host. >> > >> > OK. I understand your configuration. >> > >> > > Some of the sample examples from management/diags/src/ directory >> like >> > > smpquery >> > > for nodeinfo etc works and gives result to me. >> > > >> > > Now, could it be possible for me to write a SA query and fetch the >> > > path, service >> > > or info records >> > >> > Info records ? >> > >> > > without starting openSM instance as I have already primary SM >> > > running in the subnet. ? >> > >> > Yes; all you are (conceptually) talking about is a user SA client. >> > >> > > I believe, this question could be right and your answer would >> > > help me. >> > > >> > > I do not want to start openSM because then synchronization between >> > > primary SM >> > > and openSM would bring other issues or difficulties. >> > >> > Understood. It was unclear whether you had an SM in your subnet. >> > >> > You should be able to link libopensm and the other management >> libraries >> > to an SA application which would do this (and not require OpenSM >> > itself). >> > >> > > Could you please tell me, how should I go about it ? Waiting. >> > >> > I think I've already answered this. >> > >> > -- Hal >> > >> > > Regards. >> > > - Takshak >> > > >> > > >> > > >> > > Hal Rosenstock wrote: >> > > > On Wed, 2006-02-22 at 06:56, Takshak C. wrote: >> > > > >> > > > > Hal Rosenstock wrote: >> > > > > >> > > > > > > Please throw some light on this. Do you have any userspace >> SA support for retrieving path, service record >> > > > > > > information ? >> > > > > > > >> > > > > > > >> > > > > > >> > > > > > There have been discussions about userspace SA support but >> nothing >> > > > > > currently for OpenIB (gen2). Currently, you can get this by >> using >> > > > > > >> > > > > > >> > > > > >> > > > > Could you please tell me, when userspace SA support will be >> available >> > > > > in openIB gen2. >> > > > > >> > > > >> > > > I don't know but I'm not sure how much this helps you based on >> your >> > > > questions below. >> > > > >> > > > >> > > > > > osm_vendor_ibumad_sa.c which supports most SA requests. It is >> built as >> > > > > > part of libosmvendor (part of the OpenSM build) but can be >> used outside >> > > > > > of OpenSM. It is used by osmtest if you want to look at some >> use cases. >> > > > > > It obtains PathRecords and ServiceRecords. That might be an >> easier >> > > > > > direction to go than trying to use the management libraries to >> build the >> > > > > > pieces of a userspace SA client you want. >> > > > > > >> > > > > > -- Hal >> > > > > > >> > > > > > >> > > > > >> > > > > See, to execute osmtest, I found that openSM instance must be >> there. >> > > > > >> > > > >> > > > Must be where ? What is your IB configuration ? >> > > > >> > > > >> > > > > So, even if I use part >> > > > > of libosmvendor library ( osm_vendor_ibumad_sa.c) functions, I >> have to >> > > > > start openSM >> > > > > instance to execute the SA query successfully. >> > > > > >> > > > >> > > > An SM is needed in the subnet and SA is part of that and answers >> such >> > > > queries. >> > > > >> > > > >> > > > > Without starting openSM client, I m able to retrieve node >> description, >> > > > > node info, SM info, >> > > > > port info by using management libraries libibumad and libibmad. >> > > > > >> > > > >> > > > of the local node only (until the SM brings up the subnet). >> > > > >> > > > >> > > > > What I want to achieve is, without talking with openSM instance, >> my SA >> > > > > query client >> > > > > should go and get the required information. >> > > > > >> > > > >> > > > Why ? >> > > > >> > > > >> > > > > Is this possible ?. >> > > > > >> > > > >> > > > No. What would you query for paths to if the subnet were not up ? >> > > > >> > > > -- Hal >> > > > >> > > > >> > > > > Would like to know your inputs on this. >> > > > > >> > > > > Regards, >> > > > > - Takshak >> > > > > >> > > > > > > Regards. >> > > > > > > - Takshak >> > > > > > > >> > > > > > > >> > > > > > > Hal Rosenstock wrote: >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > > Hi, >> > > > > > > > >> > > > > > > > There are a couple of issues with the below. >> > > > > > > > >> > > > > > > > 1. SA MAD structure is missing the RMPP header. Once I saw >> that I didn't check for further issues with the format. >> > > > > > > > >> > > > > > > > 2. I will assume your register call sets RMPP. >> > > > > > > > >> > > > > > > > 3. SA class version is 2. >> > > > > > > > >> > > > > > > > What SM are you using ? If you are using OpenSM, you can >> turn on verbose and see if the packet is seen by the SM. >> You could also enable madeye (in utils) to see if the >> packet is sent (and if anything is received back). >> > > > > > > > >> > > > > > > > -- Hal >> > > > > > > > >> > > > > > > > ________________________________ >> > > > > > > > >> > > > > > > > From: openib-general-bounces at openib.org on behalf of >> Takshak C. >> > > > > > > > Sent: Mon 2/6/2006 8:00 AM >> > > > > > > > To: openib-general at openib.org >> > > > > > > > Subject: [openib-general] Get Table Records for SA >> Attribute ID ? >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > Hi, >> > > > > > > > >> > > > > > > > I m trying to get the table records for SA attribute ID in >> following way. >> > > > > > > > But, I m not getting a single record, could anyone comment >> on the problem. >> > > > > > > > >> > > > > > > > 1. I have created saMadFormat structure described in the >> specification as below: >> > > > > > > > >> > > > > > > > struct saMadFormat >> > > > > > > > { >> > > > > > > > >> > > > > > > > uint8_t base_version ; >> > > > > > > > uint8_t mgmt_class ; >> > > > > > > > uint8_t class_version ; >> > > > > > > > uint8_t sa_method ; >> > > > > > > > uint16_t status ; >> > > > > > > > uint16_t not_used ; >> > > > > > > > uint64_t tid ; >> > > > > > > > uint16_t attr_id ; >> > > > > > > > uint16_t resv ; >> > > > > > > > uint32_t attr_mod ; >> > > > > > > > uint64_t sa_key; >> > > > > > > > uint64_t sm_key ; >> > > > > > > > uint32_t seg_num ; >> > > > > > > > uint32_t payload_len ; >> > > > > > > > uint8_t frag_flag ; >> > > > > > > > uint8_t edit_mod ; >> > > > > > > > uint16_t window ; >> > > > > > > > uint32_t endRID ; >> > > > > > > > uint64_t comp_mask ; >> > > > > > > > uint8_t adminData[192] ; >> > > > > > > > }; >> > > > > > > > >> > > > > > > > 2. Then I have done all the basic operations like >> umad_open, umad_register for the IB_SA_CLASS >> > > > > > > > and umad_open_port etc successfully. >> > > > > > > > >> > > > > > > > 3. struct saMadFormat *saQuery = (struct >> saMadFormat*)(umad_get_mad(umad)); >> > > > > > > > memset(saQuery, 0, sizeof(*saQuery)); >> > > > > > > > >> > > > > > > > saQuery->base_version = 1; >> > > > > > > > saQuery->mgmt_class = IB_SA_CLASS ; >> > > > > > > > saQuery->class_version = 1 ; >> > > > > > > > saQuery->sa_method = IB_MAD_METHOD_GET_TABLE ; >> > > > > > > > saQuery->attr_id = IB_SA_ATTR_PATHRECORD ; >> > > > > > > > saQuery->attr_mod = 0 ; >> > > > > > > > saQuery->tid = htonll(drmad_tid++); >> > > > > > > > saQuery->endRID = 0 ; >> > > > > > > > >> > > > > > > > umad_set_addr(umad, lid, 1, 0, IB_DEFAULT_QP1_QKEY); >> > > > > > > > umad_set_grh(umad, 0); >> > > > > > > > umad_set_pkey(umad, 0xFFFF); >> > > > > > > > >> > > > > > > > 4. length = IB_MAD_SIZE; >> > > > > > > > >> > > > > > > > if (umad_send(portid, mad_agent, umad, length, >> timeout_ms, 0) < 0) >> > > > > > > > IBPANIC("send failed"); >> > > > > > > > >> > > > > > > > if (umad_recv(portid, umad, &length, -1) != mad_agent) >> > > > > > > > IBPANIC("recv error: %s", drmad_status_str(saQuery)); >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > if (!dump_char) { >> > > > > > > > xdump(stdout, 0, saQuery->adminData, 192); >> > > > > > > > return 0; >> > > > > > > > } >> > > > > > > > >> > > > > > > > I m expecting that, I will get the resultant data in >> saQuery->adminData. >> > > > > > > > Is this correct ? If not then, how should I retrieve the >> table records ? >> > > > > > > > Any Idea ? >> > > > > > > > >> > > > > > > > >> > > > > > > > Thanks >> > > > > > > > - Takshak >> > > > > > > > >> > > > > > > > _______________________________________________ >> > > > > > > > openib-general mailing list >> > > > > > > > openib-general at openib.org >> > > > > > > > http://openib.org/mailman/listinfo/openib-general >> > > > > > > > >> > > > > > > > To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > >> > > > > > >> > > > >> > > > >> > >> > > -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: openIBsaQuery.c URL: From jlentini at netapp.com Fri Feb 24 13:10:34 2006 From: jlentini at netapp.com (James Lentini) Date: Fri, 24 Feb 2006 16:10:34 -0500 (EST) Subject: [openib-general] [PATCH][RFC] CMA automatic port number assignment In-Reply-To: <1140808742.11764.36.camel@stevo-desktop> References: <1140808742.11764.36.camel@stevo-desktop> Message-ID: On Fri, 24 Feb 2006, Steve Wise wrote: > I don't think this is going to work for iWARP providers. From an > iWARP perspective, either the CMA needs to track the entire port > number space across all providers (ie replicate what the native > stack does), or it needs to track port spaces per provider, or it > needs to query the provider to get an ephemeral port. The problem > is that the provider might already have that port in use from a > previous connection. My patch was intended to fix a very specific problem on IB. Stepping back and looking at the big picture, there are several different guarantees the CMA could make. When the CMA automatically generates a port number, does it 1) coordinate with the native transport (TCP/UDP/SCTP) stack? or 2) coordinate across RDMA transports (IB and iWARP)? or 3) coordinate across individual devices within a transport type? (all IB devices share a port space, all iWARP devices share a port space, but iB and iWARP aren't coordinated) 4) coordinate on a device by device basis (an mthca device's port space would be separate from other mthca devices, ipath devices, etc.) For IB, the CMA can guarantee #3 relatively easily. If this is going to be an issue for iWARP, it is acceptable for the CMA to only guarantee #4. Do you agree Sean? From hch at lst.de Fri Feb 24 13:16:15 2006 From: hch at lst.de (Christoph Hellwig) Date: Fri, 24 Feb 2006 22:16:15 +0100 Subject: [openib-general] [PATCH 3/6] [RFC] iser initiator In-Reply-To: <43FD8DF9.5090200@voltaire.com> References: <20060222162507.GB24303@lst.de> <43FD8353.3020909@voltaire.com> <43FD8DF9.5090200@voltaire.com> Message-ID: <20060224211615.GA30927@lst.de> On Thu, Feb 23, 2006 at 12:27:05PM +0200, Or Gerlitz wrote: > >>I'd say kill the non-SG case. We're in the progress of removing non-SG > >>commands in the scsi midlayer, and I'm pretty sure they won't exist > >>anymore before the iser code merged. > > >I wonder what would be the simplest patch to support it, does it make > >sense to > >use virt_to_page on sc->request_buffer to compose one entry SG on the > fly > >and use it down the code? > > Specifically, does something like makes sense? > > struct scatterlist my_sg; /* somewhere, but iser_send_command stack */ > > if(!sc->use_sg) { > my_sg.page = virt_to_page(sc->request_buffer); > my_sg.length = sc->request_bufflen; > my_sg.offset = 0; > } > > now continue as ususal to process my_sg (it can't be on the stack Yes, that makes sense for now. There's even a sg_init_one helper to do the legwork for you. But before iser support goes into mainline that'll be onbsolete already and can be removed again. From hch at lst.de Fri Feb 24 13:20:29 2006 From: hch at lst.de (Christoph Hellwig) Date: Fri, 24 Feb 2006 22:20:29 +0100 Subject: [openib-general] [PATCH 5/6] [RFC] iser handling of memory for RDMA In-Reply-To: <43FD85C5.1030502@voltaire.com> References: <20060222162903.GC24303@lst.de> <43FD85C5.1030502@voltaire.com> Message-ID: <20060224212029.GB30927@lst.de> On Thu, Feb 23, 2006 at 11:52:05AM +0200, Or Gerlitz wrote: > Can you educate me here a little... basically what i was thinking about > dma mapping is that it maps from kernel virtual address to the bus > address related to the device and SG sent down to a LLD from the > midlayer can be supplied to dma_map_sg. > > Since that was my thought i assumed using page_address(sg->page) is fine > > So what you say here is that there are cases (eg highmem) where > dma_map_sg does not assume such mapping currently exist? nor the LLD can > assume this. dma_map_page/dma_map_sg map from page frames to bus addresses. There is no need for the pages to mapped into kernel virtual memory at all. E.g. the simple non-iommu implementation of dma_map_page on i386 does the following: static inline dma_addr_t dma_map_page(struct device *dev, struct page *page, unsigned long offset, size_t size, enum dma_data_direction direction) { BUG_ON(direction == DMA_NONE); return page_to_phys(page) + offset; } it doesn't involve kernel virtual addresses at all, just a struct page and it's physical address. For more complex schemes the physical address needs to be translated to a bus address, but there's not requirement for the page to be mapped into kva. For example direct I/O on filesystems or block devices will send down pages not mapped into KVA to the scsi subsystem. From hch at lst.de Fri Feb 24 13:23:48 2006 From: hch at lst.de (Christoph Hellwig) Date: Fri, 24 Feb 2006 22:23:48 +0100 Subject: [openib-general] [PATCH 6/6] [RFC] iser socket In-Reply-To: <43FDBF83.5010504@voltaire.com> References: <54AD0F12E08D1541B826BE97C98F99F1297203@NT-SJCA-0751.brcm.ad.broadcom.com> <43FDBF83.5010504@voltaire.com> Message-ID: <20060224212348.GC30927@lst.de> On Thu, Feb 23, 2006 at 03:58:27PM +0200, Alexander Nezhinsky wrote: > We don't upgrade iSCSI stream connection but start with > an RDMA connection right away. > The iSER code is going to be one of open-iscsi transports, > and open-iscsi opens connections using sockets from user space, > which is only natural with tcp. > The iSER RC connection should be open from kernel, so this special > socket gives us an opportunity to do so, while leaving intact the > entire mechanism of connection establishment and user-kernel handover. > We don't really need to implement read/write primitives because > they are initiated either from within kernel transport module > itself or through a special user-kernel interface bypassing > the socket. I think the iscsi userspace common code should be fixed to handle that case. If we really need a dummy handle it could at least be a pipe or something else that doesn't require writing a new stub. (Sorry for beeing so vague, it's been a while I looked at the iscsi code last) From mshefty at ichips.intel.com Fri Feb 24 13:25:36 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 24 Feb 2006 13:25:36 -0800 Subject: [openib-general] [PATCH][RFC] CMA automatic port number assignment In-Reply-To: References: <1140808742.11764.36.camel@stevo-desktop> Message-ID: <43FF79D0.7070900@ichips.intel.com> James Lentini wrote: > When the CMA automatically generates a port number, does it > 1) coordinate with the native transport (TCP/UDP/SCTP) stack? or No - the CMA port spaces are separate from the TCP/UDP/SCTP port spaces for IB. This is why it doesn't bother generating a local port number. Doing so isn't really needed. > 2) coordinate across RDMA transports (IB and iWARP)? or Not really, because iWarp will have true IP port spaces, whereas, IB only pretends to have them. There is a sort of coordination across all transports for listening requests. > For IB, the CMA can guarantee #3 relatively easily. If this is going > to be an issue for iWARP, it is acceptable for the CMA to only > guarantee #4. > > Do you agree Sean? The real issue is that the CMA port spaces are not IP port spaces for IB. On the active side of the connection, we can use any number that's convenient. If we want to assign different port numbers to different connections, then the QPN should work fine. What the underlying issue that you're trying to solve? - Sean From hch at lst.de Fri Feb 24 13:27:48 2006 From: hch at lst.de (Christoph Hellwig) Date: Fri, 24 Feb 2006 22:27:48 +0100 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <1140801374.1158.22.camel@camp4.serpentine.com> References: <1140801374.1158.22.camel@camp4.serpentine.com> Message-ID: <20060224212748.GD30927@lst.de> On Fri, Feb 24, 2006 at 09:16:14AM -0800, Bryan O'Sullivan wrote: > Here is a first cut at the set of components (protocols, drivers, > userspace bits) that I think we should be supporting in 1.0. Please > look over it and let me know if I'm missing anything. > > HCA support (both kernel driver and userspace verbs components): > > * ehca > * ipath > * mthca If fear the first two want be ready in time unless the progress rate increases a lot. > IB protocols: > > * IPoIB > * RC I haven't seen any RC code for the gen2 stack yet and thus highly doubt it'll be ready in time. > * SDP There's various political problems involved here. And besides that the code needs some serious work which is only happening really really slowly. > * SRQ > * UC > * UD What upper layer protocols do you refer to with these acronyms? > Components that I don't know what to do about, and will likely want to > drop unless someone can vouch for them: > > * iSER this has been making some nice progress and could make it into 2.6.17 > * SRP srp has been in mainline for quite a while and is a really simple protocol. no need not to support it. From mshefty at ichips.intel.com Fri Feb 24 13:42:05 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 24 Feb 2006 13:42:05 -0800 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <20060224212748.GD30927@lst.de> References: <1140801374.1158.22.camel@camp4.serpentine.com> <20060224212748.GD30927@lst.de> Message-ID: <43FF7DAD.7000809@ichips.intel.com> Christoph Hellwig wrote: >>IB protocols: >> >> * IPoIB >> * RC > > I haven't seen any RC code for the gen2 stack yet and thus highly doubt > it'll be ready in time. I think he just means reliable connections, which is there. The reliability is handled by the hardware; and the IB CM or RDMA CM establishes the connections. >> * SRQ >> * UC >> * UD > > What upper layer protocols do you refer to with these acronyms? shared receive queues, unreliable connections, and unreliable datagrams - Sean From mlleini at ca.sandia.gov Fri Feb 24 13:43:43 2006 From: mlleini at ca.sandia.gov (Matt L. Leininger) Date: Fri, 24 Feb 2006 13:43:43 -0800 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <1140803238.1158.42.camel@camp4.serpentine.com> References: <1140801374.1158.22.camel@camp4.serpentine.com> <1140802077.4336.34.camel@hal.voltaire.com> <1140803238.1158.42.camel@camp4.serpentine.com> Message-ID: <1140817424.6119.491.camel@localhost> On Fri, 2006-02-24 at 09:47 -0800, Bryan O'Sullivan wrote: > On Fri, 2006-02-24 at 12:31 -0500, Hal Rosenstock wrote: > > > I don't understand RC, UC, UD being separately listed here. > > Just completeness. > > > > Userland software: > > > > > > * libibverbs > > > * libsdp > > > * opensm > > > > There are diags as well as OpenSM in the management directory. > > Do you want the diags in, or out? They're not packaged in any way, so > I'd vote for "out". Customers need them "in". - Matt > > > Also, what about librdmacm ? libibcm ? > > > > > As far as I can tell, most of the rest of OpenIB userland (libibcm, > > > libibat, libibmad, etc) is logically part of OpenSM, > > > > libibmad (and libibcommon and libibumad) are but libibcm and libibat are > > not. I believe libibat as well as the core at and user_at are > > superceeded by addr and libibrdma. > > OK. What kinds of states are the old and new libraries in? I'd rather > not ship a library and its proposed replacement. > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mlleini at ca.sandia.gov Fri Feb 24 13:45:57 2006 From: mlleini at ca.sandia.gov (Matt L. Leininger) Date: Fri, 24 Feb 2006 13:45:57 -0800 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <1140803852.1158.46.camel@camp4.serpentine.com> References: <000401c6396b$159ea440$6aa1070a@amr.corp.intel.com> <1140803852.1158.46.camel@camp4.serpentine.com> Message-ID: <1140817557.6119.493.camel@localhost> On Fri, 2006-02-24 at 09:57 -0800, Bryan O'Sullivan wrote: > On Fri, 2006-02-24 at 09:52 -0800, Bob Woodruff wrote: > > Bryan wrote, > > >Do you want the diags in, or out? They're not packaged in any way, so > > >I'd vote for "out". > > > > Again, I would vote for including everything that is in the trunk > > unless the maintainer decides it is not ready or it is some > > code that is now obsolete and should be removed. > > I don't have a problem with leaving code in the SVN branch and not > touching it. My point is more that the management diags don't have an > RPM spec file, so if someone doesn't write one, they won't get shipped > in binary form, and hence they won't get tested or used. This applies > to other components, too. Agreed. Is anyone planning on adding a spec file for the tools and diags? - Matt > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From hch at lst.de Fri Feb 24 13:47:39 2006 From: hch at lst.de (Christoph Hellwig) Date: Fri, 24 Feb 2006 22:47:39 +0100 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <43FF7DAD.7000809@ichips.intel.com> References: <1140801374.1158.22.camel@camp4.serpentine.com> <20060224212748.GD30927@lst.de> <43FF7DAD.7000809@ichips.intel.com> Message-ID: <20060224214738.GA31477@lst.de> On Fri, Feb 24, 2006 at 01:42:05PM -0800, Sean Hefty wrote: > Christoph Hellwig wrote: > >>IB protocols: > >> > >> * IPoIB > >> * RC > > > >I haven't seen any RC code for the gen2 stack yet and thus highly doubt > >it'll be ready in time. > > I think he just means reliable connections, which is there. The > reliability is handled by the hardware; and the IB CM or RDMA CM > establishes the connections. sorry, I was thinking of the silverstorm RDS stuff. But for RC then the same is true as with the other features below that aren't protocols. > >> * SRQ > >> * UC > >> * UD > > > >What upper layer protocols do you refer to with these acronyms? > > shared receive queues, unreliable connections, and unreliable datagrams But that's not IB protocols.. From halr at voltaire.com Fri Feb 24 13:45:24 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Feb 2006 16:45:24 -0500 Subject: [openib-general] Get Table Records for SA Attribute ID ? In-Reply-To: <4439.65.219.193.226.1140813982.squirrel@65.219.193.226> References: <5CE025EE7D88BA4599A2C8FEFCF226F589AC20@taurus.voltaire.com> <43F4788E.3070909@gs-lab.com> <1140120738.4333.33149.camel@hal.voltaire.com> <43FC515A.3020404@gs-lab.com> <1140612035.28051.14448.camel@hal.voltaire.com> <43FC6288.4040402@gs-lab.com> <1140615978.28051.15030.camel@hal.voltaire.com> <43FF059F.4000901@gs-lab.com> <1140789577.28051.41187.camel@hal.voltaire.com> <4439.65.219.193.226.1140813982.squirrel@65.219.193.226> Message-ID: <1140817523.4339.131.camel@chrisq-g40.us.voltaire.com> Hi again Takshak, On Fri, 2006-02-24 at 15:46, Takshak C. wrote: > Hi Hal: > > Thanks. Basically, I m using the following flow in my application: > > status = opensm_init(&options, (osm_log_level_t) log_flags ); > // Get some portID where you need to bind this SM > if ( guid == 0 && !(guid = get_port_guid(guid, &port_attr))) > { > printf("\n We can't proceed further without port_guid \n"); > exit(0); > } > status = opensm_bind(&port_attr) ; Why are you calling opensm_init/bind ? That's why it requires OpenSM (not sure which pieces this pulls in). Look at main.c/osmtest.c as to how it initializes (osmtest_init/bind) and you should on your way... -- Hal > // The above function internally calls > // osmv_bind_sa(...) function of libvendor library. > > Then, I m calling opensm_saquery()..this function prepares > the query request and calls > status = osmv_query_sa(openSMObject.h_bind, &req) ; > > So, I believe, I m not playing with transaction ID here. > Thus, I m wondering what could have been happened with transaction ID > mismatched. I have attached my application code file along with this mail > for your kind reference. Please take a look. > > I m also trying to figure out what initilization has been done by openSM > instance which is not happening through my normal application startup > time. > > One thing, I have observed is as below which might help me to get your > comments: > > 1. I Start openSM instance in one terminal. It says Subnet is up and it is > Master. > 2. On other terminal when I start my attached application, then openSM > terminal says I m processing the MAD ( osm_sa_mad_ctrl.c processing > function ). > > It means that, my application acts as a client and openSM acting as a > server to serve the request. I would like to deattached this client and > server communication. This is my current goal. > > If you could point to me what exactly I need to do for this then it would > be great help. If you could point me to the documentation then it would > also help me to understand the communication between the client and openSM > as a server. > > Thanks and Regards, > - Takshak > > > > > Hi Takshak, > > > > On Fri, 2006-02-24 at 08:09, Takshak C. wrote: > >> Hi Hal: > >> > >> Thanks for the information. > >> > >> I would like to confirm that, umad_send() and umad_recv calls goes out > >> of osm libraries. > > > > Not sure what you mean by this. The umad library of which those calls > > are a part of is underneath any OSM libraries that might be used. > > > >> I have written an application to get PATH_RECORDS using opensm libraries > >> similar to osmtest. > >> If I don't start openSM instance, then I don't get results rather I get > >> an error message as > >> below: > >> [ umad_receiver: ERR 5409: send completed with error (method=0x12 > >> attr=0x35 trans_id=0x1) -- dropping ] > > > > That's saying that a matching receive was not seen (the response to the > > GetTable PathRecord request). The transaction ID looks funny to me for > > matching. Is this being set correctly ? (Not having the code it is hard > > for me to tell how things are initialized and what mode this is working > > in). > > > >> I have tried to go inside. Just after umad_send() call in > >> osm_vendor_ibumad.c ( osm_vendor_send() function ), > >> I called umad_recv() function call and tried to check the receive > >> length. And I am getting 24. > >> > >> Why this could have been happened ? return length = 24 means call has > >> not received path records. > > > > Because if it is getting a response, due to the transaction ID, I do not > > think it is considered a match. I can't be sure with the info provided. > > > >> But if I start openSM instance then I get proper result length. > > > > Sounds like your application may have an initialization issue that this > > fixes. > > > >> I would like to remove dependency of starting this openSM instance and > >> my application program > >> should run independently. > > > > This should be possible. > > > >> Could you please throw me some light on this. When I have called > >> umad_recv() it should not depend > >> on openSM instance I believe to received the things. > > > > Correct. > > > > -- Hal > > > >> Thanks & Regards. > >> - Takshak > >> > >> > >> Hal Rosenstock wrote: > >> > On Wed, 2006-02-22 at 08:09, Takshak C. wrote: > >> > > Thanks a lot Hal, for clearing my doubts. > >> > > I would like to redefine my problem based on your inputs. > >> > > > >> > > I am into a scenario, where vendor specific primary SM is running in > >> > > the subnet. > >> > > This running SM is different than openSM. I have loaded an openIB > >> > > stack on the host. > >> > > >> > OK. I understand your configuration. > >> > > >> > > Some of the sample examples from management/diags/src/ directory > >> like > >> > > smpquery > >> > > for nodeinfo etc works and gives result to me. > >> > > > >> > > Now, could it be possible for me to write a SA query and fetch the > >> > > path, service > >> > > or info records > >> > > >> > Info records ? > >> > > >> > > without starting openSM instance as I have already primary SM > >> > > running in the subnet. ? > >> > > >> > Yes; all you are (conceptually) talking about is a user SA client. > >> > > >> > > I believe, this question could be right and your answer would > >> > > help me. > >> > > > >> > > I do not want to start openSM because then synchronization between > >> > > primary SM > >> > > and openSM would bring other issues or difficulties. > >> > > >> > Understood. It was unclear whether you had an SM in your subnet. > >> > > >> > You should be able to link libopensm and the other management > >> libraries > >> > to an SA application which would do this (and not require OpenSM > >> > itself). > >> > > >> > > Could you please tell me, how should I go about it ? Waiting. > >> > > >> > I think I've already answered this. > >> > > >> > -- Hal > >> > > >> > > Regards. > >> > > - Takshak > >> > > > >> > > > >> > > > >> > > Hal Rosenstock wrote: > >> > > > On Wed, 2006-02-22 at 06:56, Takshak C. wrote: > >> > > > > >> > > > > Hal Rosenstock wrote: > >> > > > > > >> > > > > > > Please throw some light on this. Do you have any userspace > >> SA support for retrieving path, service record > >> > > > > > > information ? > >> > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > There have been discussions about userspace SA support but > >> nothing > >> > > > > > currently for OpenIB (gen2). Currently, you can get this by > >> using > >> > > > > > > >> > > > > > > >> > > > > > >> > > > > Could you please tell me, when userspace SA support will be > >> available > >> > > > > in openIB gen2. > >> > > > > > >> > > > > >> > > > I don't know but I'm not sure how much this helps you based on > >> your > >> > > > questions below. > >> > > > > >> > > > > >> > > > > > osm_vendor_ibumad_sa.c which supports most SA requests. It is > >> built as > >> > > > > > part of libosmvendor (part of the OpenSM build) but can be > >> used outside > >> > > > > > of OpenSM. It is used by osmtest if you want to look at some > >> use cases. > >> > > > > > It obtains PathRecords and ServiceRecords. That might be an > >> easier > >> > > > > > direction to go than trying to use the management libraries to > >> build the > >> > > > > > pieces of a userspace SA client you want. > >> > > > > > > >> > > > > > -- Hal > >> > > > > > > >> > > > > > > >> > > > > > >> > > > > See, to execute osmtest, I found that openSM instance must be > >> there. > >> > > > > > >> > > > > >> > > > Must be where ? What is your IB configuration ? > >> > > > > >> > > > > >> > > > > So, even if I use part > >> > > > > of libosmvendor library ( osm_vendor_ibumad_sa.c) functions, I > >> have to > >> > > > > start openSM > >> > > > > instance to execute the SA query successfully. > >> > > > > > >> > > > > >> > > > An SM is needed in the subnet and SA is part of that and answers > >> such > >> > > > queries. > >> > > > > >> > > > > >> > > > > Without starting openSM client, I m able to retrieve node > >> description, > >> > > > > node info, SM info, > >> > > > > port info by using management libraries libibumad and libibmad. > >> > > > > > >> > > > > >> > > > of the local node only (until the SM brings up the subnet). > >> > > > > >> > > > > >> > > > > What I want to achieve is, without talking with openSM instance, > >> my SA > >> > > > > query client > >> > > > > should go and get the required information. > >> > > > > > >> > > > > >> > > > Why ? > >> > > > > >> > > > > >> > > > > Is this possible ?. > >> > > > > > >> > > > > >> > > > No. What would you query for paths to if the subnet were not up ? > >> > > > > >> > > > -- Hal > >> > > > > >> > > > > >> > > > > Would like to know your inputs on this. > >> > > > > > >> > > > > Regards, > >> > > > > - Takshak > >> > > > > > >> > > > > > > Regards. > >> > > > > > > - Takshak > >> > > > > > > > >> > > > > > > > >> > > > > > > Hal Rosenstock wrote: > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > Hi, > >> > > > > > > > > >> > > > > > > > There are a couple of issues with the below. > >> > > > > > > > > >> > > > > > > > 1. SA MAD structure is missing the RMPP header. Once I saw > >> that I didn't check for further issues with the format. > >> > > > > > > > > >> > > > > > > > 2. I will assume your register call sets RMPP. > >> > > > > > > > > >> > > > > > > > 3. SA class version is 2. > >> > > > > > > > > >> > > > > > > > What SM are you using ? If you are using OpenSM, you can > >> turn on verbose and see if the packet is seen by the SM. > >> You could also enable madeye (in utils) to see if the > >> packet is sent (and if anything is received back). > >> > > > > > > > > >> > > > > > > > -- Hal > >> > > > > > > > > >> > > > > > > > ________________________________ > >> > > > > > > > > >> > > > > > > > From: openib-general-bounces at openib.org on behalf of > >> Takshak C. > >> > > > > > > > Sent: Mon 2/6/2006 8:00 AM > >> > > > > > > > To: openib-general at openib.org > >> > > > > > > > Subject: [openib-general] Get Table Records for SA > >> Attribute ID ? > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > Hi, > >> > > > > > > > > >> > > > > > > > I m trying to get the table records for SA attribute ID in > >> following way. > >> > > > > > > > But, I m not getting a single record, could anyone comment > >> on the problem. > >> > > > > > > > > >> > > > > > > > 1. I have created saMadFormat structure described in the > >> specification as below: > >> > > > > > > > > >> > > > > > > > struct saMadFormat > >> > > > > > > > { > >> > > > > > > > > >> > > > > > > > uint8_t base_version ; > >> > > > > > > > uint8_t mgmt_class ; > >> > > > > > > > uint8_t class_version ; > >> > > > > > > > uint8_t sa_method ; > >> > > > > > > > uint16_t status ; > >> > > > > > > > uint16_t not_used ; > >> > > > > > > > uint64_t tid ; > >> > > > > > > > uint16_t attr_id ; > >> > > > > > > > uint16_t resv ; > >> > > > > > > > uint32_t attr_mod ; > >> > > > > > > > uint64_t sa_key; > >> > > > > > > > uint64_t sm_key ; > >> > > > > > > > uint32_t seg_num ; > >> > > > > > > > uint32_t payload_len ; > >> > > > > > > > uint8_t frag_flag ; > >> > > > > > > > uint8_t edit_mod ; > >> > > > > > > > uint16_t window ; > >> > > > > > > > uint32_t endRID ; > >> > > > > > > > uint64_t comp_mask ; > >> > > > > > > > uint8_t adminData[192] ; > >> > > > > > > > }; > >> > > > > > > > > >> > > > > > > > 2. Then I have done all the basic operations like > >> umad_open, umad_register for the IB_SA_CLASS > >> > > > > > > > and umad_open_port etc successfully. > >> > > > > > > > > >> > > > > > > > 3. struct saMadFormat *saQuery = (struct > >> saMadFormat*)(umad_get_mad(umad)); > >> > > > > > > > memset(saQuery, 0, sizeof(*saQuery)); > >> > > > > > > > > >> > > > > > > > saQuery->base_version = 1; > >> > > > > > > > saQuery->mgmt_class = IB_SA_CLASS ; > >> > > > > > > > saQuery->class_version = 1 ; > >> > > > > > > > saQuery->sa_method = IB_MAD_METHOD_GET_TABLE ; > >> > > > > > > > saQuery->attr_id = IB_SA_ATTR_PATHRECORD ; > >> > > > > > > > saQuery->attr_mod = 0 ; > >> > > > > > > > saQuery->tid = htonll(drmad_tid++); > >> > > > > > > > saQuery->endRID = 0 ; > >> > > > > > > > > >> > > > > > > > umad_set_addr(umad, lid, 1, 0, IB_DEFAULT_QP1_QKEY); > >> > > > > > > > umad_set_grh(umad, 0); > >> > > > > > > > umad_set_pkey(umad, 0xFFFF); > >> > > > > > > > > >> > > > > > > > 4. length = IB_MAD_SIZE; > >> > > > > > > > > >> > > > > > > > if (umad_send(portid, mad_agent, umad, length, > >> timeout_ms, 0) < 0) > >> > > > > > > > IBPANIC("send failed"); > >> > > > > > > > > >> > > > > > > > if (umad_recv(portid, umad, &length, -1) != mad_agent) > >> > > > > > > > IBPANIC("recv error: %s", drmad_status_str(saQuery)); > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > if (!dump_char) { > >> > > > > > > > xdump(stdout, 0, saQuery->adminData, 192); > >> > > > > > > > return 0; > >> > > > > > > > } > >> > > > > > > > > >> > > > > > > > I m expecting that, I will get the resultant data in > >> saQuery->adminData. > >> > > > > > > > Is this correct ? If not then, how should I retrieve the > >> table records ? > >> > > > > > > > Any Idea ? > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > Thanks > >> > > > > > > > - Takshak > >> > > > > > > > > >> > > > > > > > _______________________________________________ > >> > > > > > > > openib-general mailing list > >> > > > > > > > openib-general at openib.org > >> > > > > > > > http://openib.org/mailman/listinfo/openib-general > >> > > > > > > > > >> > > > > > > > To unsubscribe, please visit > >> http://openib.org/mailman/listinfo/openib-general > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > >> > > > > > > >> > > > > >> > > > > >> > > >> > > > > From tom at opengridcomputing.com Fri Feb 24 14:19:16 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Fri, 24 Feb 2006 16:19:16 -0600 Subject: [openib-general] Re: [PATCH] Header file Changes for iWARP Support In-Reply-To: <43FF5FDF.30600@ichips.intel.com> References: <1140718515.22707.91.camel@trinity.ogc.int> <43FF5FDF.30600@ichips.intel.com> Message-ID: <1140819556.29942.38.camel@trinity.ogc.int> On Fri, 2006-02-24 at 11:34 -0800, Sean Hefty wrote: > Tom Tucker wrote: > > +struct iw_cm_verbs; > > struct ib_device { > > struct device *dma_device; > > > > @@ -840,6 +844,8 @@ > > > > u32 flags; > > > > + struct iw_cm_verbs* iwcm; > > + > > Does anyone object to adding this to ib_device? I'm not thrilled about this, > but I don't see another alternative, and I'm not sure it's any worse than having > a 'process_mad' function. > > Maybe we need a more generic way of providing transport/device specific > extensions? Something like: > > struct ib_device { > ... > union { > struct iw_verbs *iw; > struct ib_verbs *ib; > } ext_verbs; I like this... It is consistent with the CMA as well. > ... > }; > > Thoughts? > > - Sean From robert.j.woodruff at intel.com Fri Feb 24 14:30:23 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Fri, 24 Feb 2006 14:30:23 -0800 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <1140817557.6119.493.camel@localhost> Message-ID: <000001c63991$e7e518f0$6aa1070a@amr.corp.intel.com> Matt wrote, >Agreed. Is anyone planning on adding a spec file for the tools and >diags? > - Matt I think that Doug from Redhat put the diags and tools in with opensm. Might look at his spec file for that one. I have a spec files that lumps everything into one RPM for usermode, but I think that we are going more towards the separate RPMs for separate components for the release, although I may continue to make the lump-it-all-in-one RPMs for my testing as it makes install/uninstall easier, at least until RC1 RPMs are ready. If anyone else is interested in having these RPMs for early testing of the SVN5492 1.0 branch I will probably put them in my backports-to-2.6.9 branch. woody From rdreier at cisco.com Fri Feb 24 14:36:35 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 24 Feb 2006 14:36:35 -0800 Subject: [openib-general] create QP failure In-Reply-To: <43FF4D6E.1020504@scl.ameslab.gov> (Kyle Schochenmaier's message of "Fri, 24 Feb 2006 12:16:14 -0600") References: <43FC9F0D.9040304@scl.ameslab.gov> <79ae2f320602221513l25121359ya4d7fc9e5e4613c0@mail.gmail.com> <43FF34CA.5000709@scl.ameslab.gov> <43FF4D6E.1020504@scl.ameslab.gov> Message-ID: Kyle> ..which causes the second create_qp() to fail. I'm not Kyle> entirely sure how to get any other useful return codes from Kyle> this type of function call.. any ideas? Thanks, that's enough info I think. I'll try to fix this up. - R. From rdreier at cisco.com Fri Feb 24 14:42:40 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 24 Feb 2006 14:42:40 -0800 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <1140801374.1158.22.camel@camp4.serpentine.com> (Bryan O'Sullivan's message of "Fri, 24 Feb 2006 09:16:14 -0800") References: <1140801374.1158.22.camel@camp4.serpentine.com> Message-ID: > HCA support (both kernel driver and userspace verbs components): > * ehca > * ipath I would be concerned about putting these in the release, since the kernel side will need to be revised for upstream inclusion. There's no guarantee that the ABI will remain stable, and hence there could be some real compatibility problems. > * SRP SRP is upstream, and is needed to talk to the native IB storage that's on the market. I don't see why we would want to try and strip this out of a release. I've said this before, but I think the simplest way to handle kernel components is to simply say we support vanilla upstream kernels. If your favorite feature isn't upstream, then this should motivate you to get it upstream. - R. From rdreier at cisco.com Fri Feb 24 14:43:50 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 24 Feb 2006 14:43:50 -0800 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <1140803852.1158.46.camel@camp4.serpentine.com> (Bryan O'Sullivan's message of "Fri, 24 Feb 2006 09:57:32 -0800") References: <000401c6396b$159ea440$6aa1070a@amr.corp.intel.com> <1140803852.1158.46.camel@camp4.serpentine.com> Message-ID: Bryan> I don't have a problem with leaving code in the SVN branch Bryan> and not touching it. My point is more that the management Bryan> diags don't have an RPM spec file, so if someone doesn't Bryan> write one, they won't get shipped in binary form, and hence Bryan> they won't get tested or used. This applies to other Bryan> components, too. I'm not convinced that spec files should be a requirement for the release. Distributors are going to have to revise or write their own spec files anyway, so I'm not sure if it's a good idea to hold things out of the release just because they don't have a spec file. - R. From rdreier at cisco.com Fri Feb 24 14:45:26 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 24 Feb 2006 14:45:26 -0800 Subject: [openib-general] fmr question In-Reply-To: <1140801747.11764.24.camel@stevo-desktop> (Steve Wise's message of "Fri, 24 Feb 2006 11:22:26 -0600") References: <54AD0F12E08D1541B826BE97C98F99F12973BD@NT-SJCA-0751.brcm.ad.broadcom.com> <1140801747.11764.24.camel@stevo-desktop> Message-ID: Steve> Something similar to the mw_bind semantics should work to Steve> make it more like the iwarp fast-register (i'm not sure Steve> about IB 1.2). A function like ib_bind_mw() to post the Steve> map WR, and then a new completion type to post the results Steve> back to the CQ... Steve> Would we just change ib_map_phys_fmr() to do this or create Steve> a new API function to preserve backwards compatibility? There are pretty big problems with trying to simulate the "register memory through a work queue" operation on current Mellanox HW (hence the current FMR hack). As hardware that supports the work queue stuff becomes available, I think we should just add new APIs (mostly just new work request structures) and leave the old FMR interface for Mellanox HW. - R. From caitlinb at broadcom.com Fri Feb 24 14:56:21 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 24 Feb 2006 14:56:21 -0800 Subject: [openib-general] fmr question Message-ID: <54AD0F12E08D1541B826BE97C98F99F1297502@NT-SJCA-0751.brcm.ad.broadcom.com> Roland Dreier wrote: > Steve> Something similar to the mw_bind semantics should work to > Steve> make it more like the iwarp fast-register (i'm not sure > Steve> about IB 1.2). A function like ib_bind_mw() to post the > Steve> map WR, and then a new completion type to post the results > Steve> back to the CQ... > > Steve> Would we just change ib_map_phys_fmr() to do this or create > Steve> a new API function to preserve backwards compatibility? > > There are pretty big problems with trying to simulate the > "register memory through a work queue" operation on current > Mellanox HW (hence the current FMR hack). As hardware that > supports the work queue stuff becomes available, I think we > should just add new APIs (mostly just new work request > structures) and leave the old FMR interface for Mellanox HW. > > - R. That strikes me as the correct approach. We will eventually have to add device attributes so that ULPs will know which (if any) of the FMR approaches are supported. And I agree that adding new work requests/completions is all that will be required at that later step (beyond the informational attributes). From rdreier at cisco.com Fri Feb 24 15:19:05 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 24 Feb 2006 15:19:05 -0800 Subject: [openib-general] create QP failure In-Reply-To: <43FC9F0D.9040304@scl.ameslab.gov> (Kyle Schochenmaier's message of "Wed, 22 Feb 2006 11:27:41 -0600") References: <43FC9F0D.9040304@scl.ameslab.gov> Message-ID: BTW, what type of HCA & FW version are you using? The limits on QP sge etc depend on the HCA type, and it's easier for me to reproduce your problem if I can match your system. Thanks, Roland From halr at voltaire.com Fri Feb 24 15:46:53 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Feb 2006 18:46:53 -0500 Subject: [openib-general] Branch and patch management for 1.0 In-Reply-To: <1140809494.1158.97.camel@camp4.serpentine.com> References: <1140809494.1158.97.camel@camp4.serpentine.com> Message-ID: <1140824813.4335.2.camel@hal.voltaire.com> Hi Bryan, On Fri, 2006-02-24 at 14:31, Bryan O'Sullivan wrote: > All SVN committers are welcome to commit fixes and merge changes to the > 1.0 branch. Please keep in mind the need to keep changes to a minimum, > and to focus on correctness. > > If you don't have commit permissions, please send patches to both me and > the mailing list, with "[PATCH 1.0]" in the Subject line. Do we want to do this with changes that are committed too or only important ones ? My next comments may be stating the obvious but... > Unless people object, I'd like to merge changes back from the 1.0 branch > to the trunk quite frequently, Some care will need to be done in terms of this. For example, spec files may have different versions, etc. > so the two don't diverge too much. They may diverge for other reasons like additional functionality added to the trunk and not to 1.0. -- Hal > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From bos at pathscale.com Fri Feb 24 17:19:46 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 24 Feb 2006 17:19:46 -0800 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: References: <1140801374.1158.22.camel@camp4.serpentine.com> Message-ID: <1140830386.2587.20.camel@localhost.localdomain> On Fri, 2006-02-24 at 14:42 -0800, Roland Dreier wrote: > > HCA support (both kernel driver and userspace verbs components): > > > * ehca > > * ipath > > I would be concerned about putting these in the release, since the > kernel side will need to be revised for upstream inclusion. There's > no guarantee that the ABI will remain stable, and hence there could be > some real compatibility problems. What compatibility problems are you worried about? Little or none of the userspace ABI is provided directly by the hardware drivers. > > * SRP > > SRP is upstream, and is needed to talk to the native IB storage that's > on the market. I don't see why we would want to try and strip this > out of a release. Fine. References: <1140801374.1158.22.camel@camp4.serpentine.com> <20060224212748.GD30927@lst.de> Message-ID: <1140830703.2587.26.camel@localhost.localdomain> On Fri, 2006-02-24 at 22:27 +0100, Christoph Hellwig wrote: > > HCA support (both kernel driver and userspace verbs components): > > > > * ehca > > * ipath > > * mthca > > If fear the first two want be ready in time unless the progress rate > increases a lot. I have the bulk of the work done for a resubmission of the ipath driver to the upstream kernel. I can't speak for ehca, of course. > > * SDP > > There's various political problems involved here. Pardon my ignorance, but what kinds? (Bryan O'Sullivan's message of "Fri, 24 Feb 2006 17:19:46 -0800") References: <1140801374.1158.22.camel@camp4.serpentine.com> <1140830386.2587.20.camel@localhost.localdomain> Message-ID: Bryan> What compatibility problems are you worried about? Little Bryan> or none of the userspace ABI is provided directly by the Bryan> hardware drivers. There is all the "driver-specific" stuff in the various verbs operations. And of course ipath has all the stuff for your proprietary transport -- my understanding is that all those ioctls are going to be radically changed soon. - R. From mlleinin at hpcn.ca.sandia.gov Fri Feb 24 18:25:04 2006 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Fri, 24 Feb 2006 18:25:04 -0800 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <1140830703.2587.26.camel@localhost.localdomain> References: <1140801374.1158.22.camel@camp4.serpentine.com> <20060224212748.GD30927@lst.de> <1140830703.2587.26.camel@localhost.localdomain> Message-ID: <1140834304.6119.528.camel@localhost> On Fri, 2006-02-24 at 17:25 -0800, Bryan O'Sullivan wrote: > On Fri, 2006-02-24 at 22:27 +0100, Christoph Hellwig wrote: > > > * SDP > > > > There's various political problems involved here. > > Pardon my ignorance, but what kinds? > MS claims that an SDP implementation *may* use some of their IP. Of course they don't tell you which patents. The task of deciding what risk is associated with SDP (licensing, etc.) is left as an exercise to the reader. Various individuals and companies on this list have looked into the SDP licensing issue, but I haven't seen much (needed) discussion lately. - Matt From mst at mellanox.co.il Sat Feb 25 09:03:17 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sat, 25 Feb 2006 19:03:17 +0200 Subject: [openib-general] Re: Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID In-Reply-To: <43FDF106.8070403@ichips.intel.com> References: <200602231814.26918.jackm@mellanox.co.il> <200602231825.19333.jackm@mellanox.co.il> <43FDF106.8070403@ichips.intel.com> Message-ID: <20060225170316.GA15973@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID > > Jack Morgenstein wrote: > >Regarding RMPP abort processing, I see that there is a problem: the code > >assumes that all aborts are received by the responder: > > Correct - unfortunately, I don't believe that there's any way to know if an > abort is for an RMPP message that is being sent, versus received. I.e. > host A can send an RMPP message to host B with TID=3 at the same time host > B sends an RMPP message to host A with TID=3. If host B sends an abort, > host A has no idea which transaction is being aborted. > > - Sean I think we can use the method field for this. C15-0.1.18: ... . The method used for all packets sent from the Receiver to the Sender shall be SubnAdmGetTable() or SubnAdmGetTraceTable(), depending on which initiated the transfer. . The method used for all packets sent from the Sender to the Receiver shall be SubnAdmGetTableResp(). ... -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From Sujal at Mellanox.com Sat Feb 25 20:05:03 2006 From: Sujal at Mellanox.com (Sujal Das) Date: Sat, 25 Feb 2006 20:05:03 -0800 Subject: [openib-general] Suggested components to support in 1.0 Message-ID: <25AE7F432672D511B8DC00B0D0DF11DA06666145@MTIEX01> Will the release address following test/support scenarios pertaining to the specific Open IB components being discussed? - OS distro and kernel versions to be tested with - RHAs, SLES, SUSE Pro, Fedora etc..., kernel 2.6.13, 14.... etc - CPU and chipset platforms - Variations on each HCA hardware. For example - with Mellanox, we have DDR, SDR, Mem-free etc. - Testing with various switches available - Testing on large clusters - 128 nodes and beyond ... - Testing for performance/features with ISV / user level apps apps - Fluent, Oracle, DB2, LSDyna, MPI etc - Testing with available storage targets - both gateways and native, SRP and iSER? -------------- next part -------------- An HTML attachment was scrubbed... URL: From bos at pathscale.com Sat Feb 25 20:35:06 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Sat, 25 Feb 2006 20:35:06 -0800 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <25AE7F432672D511B8DC00B0D0DF11DA06666145@MTIEX01> References: <25AE7F432672D511B8DC00B0D0DF11DA06666145@MTIEX01> Message-ID: <1140928507.9852.68.camel@localhost.localdomain> On Sat, 2006-02-25 at 20:05 -0800, Sujal Das wrote: > - OS distro and kernel versions to be tested with - RHAs, SLES, SUSE > Pro, Fedora etc..., kernel 2.6.13, 14.... etc We'll have to see how many of those are feasible. Packaging userspace for the different distros is easy enough; the big problem is backporting kernel support to kernels that will actually work on the distros in question, and building binary packages of those. I think the only feasible approach will be for people to build binary packages of whatever kernels they can and make them available. People who test can either use those, or build their own kernels and report the kernel versions they are using when they send in test results. > - CPU and chipset platforms I know there's a need for good coverage on the following: * x86_64 * i386 * powerpc (aka ppc64) I assume ia64 needs to be included, too, but I would very much like people to let me know what platforms they want to see tested or can test themselves. > - Variations on each HCA hardware. For example - with Mellanox, we > have DDR, SDR, Mem-free etc. I don't have a comprehensive list of the HCAs vendors plan to test. This would be very useful to have. > - Testing with various switches available Likewise :-) > - Testing on large clusters - 128 nodes and beyond ... > > - Testing for performance/features with ISV / user level apps apps - > Fluent, Oracle, DB2, LSDyna, MPI etc For publicly available software, there's obviously no problem with this. At least some ISVs are known for placing restrictions on the publication of performance results, so while I'd like to see numbers, please be careful in what you choose to report. > - Testing with available storage targets - both gateways and native, > SRP and iSER? Yep. It would be good to have a common set of features and applications that vendors could test in a uniform way, so that we have at least a base set of somewhat standardised test results. In addition, any further testing that people can perform and report on will be most welcome. References: <1140798219.4336.4.camel@hal.voltaire.com> <43FF47E1.4090406@ichips.intel.com> Message-ID: <200602260932.31156.jackm@mellanox.co.il> On Friday 24 February 2006 19:52, Sean Hefty wrote: > What I believe will happen is that one of the responses will be reassembled > and returned to the user. On the SA side, all ACKs will match with the > first response, and the second response will time out, never getting past > the first segment. I think, actually, that there is a race condition on the SA side once the first receiver-requested retry occurs (i.e., the first time a single packet is dropped in transmission from sender to receiver: Sender Rcvr ---------- ------ Send seg1 compl S1, to wait list Ack1 Send S2 - S65 Wait list for Ack Drop, say S35 timeout, retry query Send S'1 (new response session) compl S'1, to wait list (after Session S wait) see as duplicate, Ack34 S gets Ack, continues with S35 Compl S35 to wait list (AFTER Session S') continue until S65 Ack 65 S' Gets the ACK this time, interprets as error (ack-ing unsent segment) aborts the transfer with error IB_MGMT_RMPP_STATUS_W2S See process_rmpp_ack(). Aborts S1 times out, retries a few times. No damage actually done, except that the transfer has failed -- and this can occur if ANY packet gets dropped during the transfer. RMPP thus will not be too reliable. -- Jack From jackm at mellanox.co.il Sun Feb 26 01:41:59 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Sun, 26 Feb 2006 11:41:59 +0200 Subject: [openib-general] [PATCH] mthca: implement query_ah for MADs and memfree -- REMINDER Message-ID: <200602261142.03037.jackm@mellanox.co.il> Just a reminder: I submitted the patch on Feb 13. Implement query_ah in provider layer (except for av's which are in HCA memory) Needed for implementing RMPP duplicate session detection on sending side (extraction of DGID/DLID and GRH flag from address handle). Signed-off-by: Jack Morgenstein Index: src/drivers/infiniband/hw/mthca/mthca_av.c =================================================================== --- src.orig/drivers/infiniband/hw/mthca/mthca_av.c 2006-02-22 09:45:11.621141000 +0200 +++ src/drivers/infiniband/hw/mthca/mthca_av.c 2006-02-22 09:48:39.762130000 +0200 @@ -191,6 +191,34 @@ int mthca_read_ah(struct mthca_dev *dev, return 0; } +int mthca_query_ah(struct mthca_dev *dev, struct mthca_ah *ah, + struct ib_ah_attr *ah_attr) +{ + /* Only implement for MAD and memfree ah for now. */ + if (ah->type == MTHCA_AH_ON_HCA) + return -ENOSYS; + + memset(ah_attr, 0, sizeof *ah_attr); + ah_attr->dlid = be16_to_cpu(ah->av->dlid); + ah_attr->sl = be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 28; + ah_attr->static_rate = ah->av->msg_sr & 0x7; + ah_attr->src_path_bits = ah->av->g_slid & 0x7F; + ah_attr->port_num = be32_to_cpu(ah->av->port_pd) >> 24; + ah_attr->ah_flags = mthca_ah_grh_present(ah) ? IB_AH_GRH : 0; + + if (ah_attr->ah_flags) { + ah_attr->grh.traffic_class = + be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 20; + ah_attr->grh.flow_label = + be32_to_cpu(ah->av->sl_tclass_flowlabel) & 0xfffff; + ah_attr->grh.hop_limit = ah->av->hop_limit; + ah_attr->grh.sgid_index = ah->av->gid_index & + (dev->limits.gid_table_len - 1); + memcpy(ah_attr->grh.dgid.raw, ah->av->dgid, 16); + } + return 0; +} + int __devinit mthca_init_av_table(struct mthca_dev *dev) { int err; Index: src/drivers/infiniband/hw/mthca/mthca_dev.h =================================================================== --- src.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2006-02-22 09:46:05.946574000 +0200 +++ src/drivers/infiniband/hw/mthca/mthca_dev.h 2006-02-22 09:48:39.781132000 +0200 @@ -532,6 +532,8 @@ int mthca_create_ah(struct mthca_dev *de struct mthca_pd *pd, struct ib_ah_attr *ah_attr, struct mthca_ah *ah); +int mthca_query_ah(struct mthca_dev *dev, struct mthca_ah *ah, + struct ib_ah_attr *ah_attr); int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah); int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah, struct ib_ud_header *header); Index: src/drivers/infiniband/hw/mthca/mthca_provider.c =================================================================== --- src.orig/drivers/infiniband/hw/mthca/mthca_provider.c 2006-02-22 09:46:06.020575000 +0200 +++ src/drivers/infiniband/hw/mthca/mthca_provider.c 2006-02-22 09:48:39.803132000 +0200 @@ -446,6 +446,11 @@ static int mthca_ah_destroy(struct ib_ah return 0; } +static int mthca_ah_query(struct ib_ah *ah, struct ib_ah_attr *ah_attr) +{ + return mthca_query_ah(to_mdev(ah->device), to_mah(ah), ah_attr); +} + static struct ib_srq *mthca_create_srq(struct ib_pd *pd, struct ib_srq_init_attr *init_attr, struct ib_udata *udata) @@ -1290,6 +1295,7 @@ int mthca_register_device(struct mthca_d dev->ib_dev.dealloc_pd = mthca_dealloc_pd; dev->ib_dev.create_ah = mthca_ah_create; dev->ib_dev.destroy_ah = mthca_ah_destroy; + dev->ib_dev.query_ah = mthca_ah_query; if (dev->mthca_flags & MTHCA_FLAG_SRQ) { dev->ib_dev.create_srq = mthca_create_srq; From monil at voltaire.com Sun Feb 26 04:51:04 2006 From: monil at voltaire.com (Moni Levy) Date: Sun, 26 Feb 2006 14:51:04 +0200 Subject: [openib-general] Re: OpenIb 1.0 release components In-Reply-To: <20060224152018.GA16807@mellanox.co.il> References: <20060223170325.GB19426@mellanox.co.il> <1140714888.17258.60.camel@serpentine.pathscale.com> <6a122cc00602231007y508c6c1flac8e451de11344a0@mail.gmail.com> <20060224152018.GA16807@mellanox.co.il> Message-ID: <6a122cc00602260451hf55d012o8d0bbc98865556f5@mail.gmail.com> On 2/24/06, Michael S. Tsirkin wrote: > Quoting Moni Levy : > > > > While this might be a good idea for modules such as iSER > > > > which are not currently part of the mainline kernel tree, > > > > it is in my opinion clearly not a good idea to replace the > > > > modules which *are* distributed with the mainline kernel. > > > > > > I agree, for the most part. > > > > > > What I have in mind for non-upstream kernel support is this: > > > > > > * We have to ship out-of-tree drivers, simply because there's only > > > one driver in the upstream kernel, and the others are not yet > > > ready for submission. > > > * Some kernel components are clearly not contenders for shipping. > > > One example is kdapl, because it appears to be dead due to > > > upstream veto. > > > * Others might be reasonable, if they (a) see some testing and (b) > > > don't intrusively patch the core kernel. I'm thinking here > > > about iSER and, to a lesser extent, SDP. > > > > I would like to add another point also. It looks like that in this > > round of the major distribution releases they will just not be able to > > include the 1.0 release due to time constraints, so the only way to > > use 1.0 release (or newer) will be to replace them in the kernel. > > > > Moni > > I dont really understand this last point. What do you mean when you say > "replace them in kernel"? Replace what? Is there an option that the distros would like to get more stable code that is not in kernel.org (yet) ? > > I understand it why you might want to add out of kernel modules such as iSER. > My point is they must work with core components included in kernel, not > with core out of the svn tree. Now I understand. > > I gather Brian here agrees. > > -- > Michael S. Tsirkin > Staff Engineer, Mellanox Technologies > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From ogerlitz at voltaire.com Sun Feb 26 05:01:30 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 26 Feb 2006 15:01:30 +0200 Subject: [openib-general] [PATCH 3/6] [RFC] iser initiator In-Reply-To: <20060224211615.GA30927@lst.de> References: <20060222162507.GB24303@lst.de> <43FD8353.3020909@voltaire.com> <43FD8DF9.5090200@voltaire.com> <20060224211615.GA30927@lst.de> Message-ID: <4401A6AA.3080508@voltaire.com> Christoph Hellwig wrote: >> Specifically, does something like makes sense? >> struct scatterlist my_sg; /* somewhere, but iser_send_command stack */ >> >> if(!sc->use_sg) { >> my_sg.page = virt_to_page(sc->request_buffer); >> my_sg.length = sc->request_bufflen; >> my_sg.offset = 0; >> } >> >> now continue as ususal to process my_sg (it can't be on the stack > Yes, that makes sense for now. There's even a sg_init_one helper > to do the legwork for you. But before iser support goes into mainline > that'll be onbsolete already and can be removed again. Thanks for pointing me to sg_init_one (which saves me from my buggy code suggestion above which always sets zero offset...). So the iser code will be cleaned in two phases with respect "single sc": +1 assume everywhere the sc has SG and just SG-fy the singles. +2 remove the SG-fy trick and maintain it as a patch for backports Or. From ogerlitz at voltaire.com Sun Feb 26 05:04:43 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 26 Feb 2006 15:04:43 +0200 Subject: [openib-general] [PATCH 5/6] [RFC] iser handling of memory for RDMA In-Reply-To: <20060224212029.GB30927@lst.de> References: <20060222162903.GC24303@lst.de> <43FD85C5.1030502@voltaire.com> <20060224212029.GB30927@lst.de> Message-ID: <4401A76B.5080406@voltaire.com> Christoph Hellwig wrote: > dma_map_page/dma_map_sg map from page frames to bus addresses. There > is no need for the pages to mapped into kernel virtual memory at all. > E.g. the simple non-iommu implementation of dma_map_page on i386 does > the following: > > static inline dma_addr_t > dma_map_page(struct device *dev, struct page *page, unsigned long offset, > size_t size, enum dma_data_direction direction) > { > BUG_ON(direction == DMA_NONE); > return page_to_phys(page) + offset; > } > > it doesn't involve kernel virtual addresses at all, just a struct page > and it's physical address. For more complex schemes the physical > address needs to be translated to a bus address, but there's not > requirement for the page to be mapped into kva. For example direct I/O > on filesystems or block devices will send down pages not mapped into KVA > to the scsi subsystem. thanks a lot for taking the time and putting this detailed explanation, now i understand it much better. Or. From mst at mellanox.co.il Sun Feb 26 08:13:28 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 26 Feb 2006 18:13:28 +0200 Subject: [openib-general] Re: Re: OpenIb 1.0 release components In-Reply-To: <6a122cc00602260451hf55d012o8d0bbc98865556f5@mail.gmail.com> References: <20060223170325.GB19426@mellanox.co.il> <1140714888.17258.60.camel@serpentine.pathscale.com> <6a122cc00602231007y508c6c1flac8e451de11344a0@mail.gmail.com> <20060224152018.GA16807@mellanox.co.il> <6a122cc00602260451hf55d012o8d0bbc98865556f5@mail.gmail.com> Message-ID: <20060226161328.GL19855@mellanox.co.il> Quoting Moni Levy : > > > I would like to add another point also. It looks like that in this > > > round of the major distribution releases they will just not be able to > > > include the 1.0 release due to time constraints, so the only way to > > > use 1.0 release (or newer) will be to replace them in the kernel. > > > > > > Moni > > > > I dont really understand this last point. What do you mean when you say > > "replace them in kernel"? Replace what? > > Is there an option that the distros would like to get more stable code > that is not in kernel.org (yet) ? There's always pressure on distros to ship stuff that's not upstream yet. The IB driver stack is still in its early stages, so the unfortunate fact is that lots of things that people want to use are not upstream and are not ready to go upstream yet. We are all working on merging things upstream as fast as possible, so there is no need to bypass the kernel review/testing process. In this way the in-kernel version will become stable. Most importantly as I see it, the focus for the release should be the userspace components, testing them against mainline and possibly other kernels. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From rdreier at cisco.com Sun Feb 26 14:38:40 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 26 Feb 2006 14:38:40 -0800 Subject: [openib-general] Re: [PATCH] mthca - command interface In-Reply-To: <1140432784.4429.14.camel@mtls03.yok.mtl.com> (Eli Cohen's message of "Mon, 20 Feb 2006 12:53:04 +0200") References: <1140432784.4429.14.camel@mtls03.yok.mtl.com> Message-ID: Thanks, I applied this to svn and queued it for 2.6.17, with a few changes: - Fix initialization of dev->cmd.flags to zero all flags - Fix ioremapping of UAR0 to work on ppc64, etc (where HCA's bus address view doesn't match token that ioremap needs) - Change module parameter name to fw_cmd_doorbell - My usual anal cleanups and style changes - R. From sashak at voltaire.com Sun Feb 26 15:14:07 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 27 Feb 2006 01:14:07 +0200 Subject: [openib-general] [PATCH] opensm: check port capabilities to join MC group. Message-ID: <20060226231407.GB8819@sashak.voltaire.com> Hello, As recently discussed on the list OpenSM does not check port's physical MTU and rate during MC group joining process (only requested values). This patch adds such checking. Sasha. This adds verification of endport physical capability to join MC group. Signed-off-by: Sasha Khapyorsky diff --git a/osm/opensm/osm_sa_mcmember_record.c b/osm/opensm/osm_sa_mcmember_record.c index 8f785ff..1003916 100644 --- a/osm/opensm/osm_sa_mcmember_record.c +++ b/osm/opensm/osm_sa_mcmember_record.c @@ -677,6 +677,57 @@ __validate_more_comp_fields( return TRUE; } +/********************************************************************* +In joining an existing group, we make sure the following components +are physically realizable: MTU and RATE +**********************************************************************/ +static boolean_t +__validate_port_caps( + osm_log_t * const p_log, + const osm_mgrp_t *p_mgrp, + const osm_physp_t *p_physp) +{ + ib_port_info_t *p_pi; + uint8_t mtu_required; + uint8_t mtu_mgrp; + uint8_t rate_required; + uint8_t rate_mgrp; + + p_pi = osm_physp_get_port_info_ptr(p_physp); + if (!p_pi) + { + osm_log( p_log, OSM_LOG_DEBUG, + "__validate_port_caps: " + "Cannot get Port's 0x%016" PRIx64 " PortInfo\n", + osm_physp_get_port_guid(p_physp)); + return FALSE; + } + + mtu_required = ib_port_info_get_mtu_cap(p_pi); + mtu_mgrp = (uint8_t)(p_mgrp->mcmember_rec.mtu & 0x3F); + if (mtu_required < mtu_mgrp) + { + osm_log( p_log, OSM_LOG_DEBUG, + "__validate_port_caps: " + "Port's MTU %x is less than %x\n", + mtu_required, mtu_mgrp); + return FALSE; + } + + rate_required = ib_port_info_compute_rate(p_pi); + rate_mgrp = (uint8_t)(p_mgrp->mcmember_rec.rate & 0x3F); + if (rate_required < rate_mgrp) + { + osm_log( p_log, OSM_LOG_DEBUG, + "__validate_port_caps: " + "Port's RATE %x is less than %x\n", + rate_required, rate_mgrp); + return FALSE; + } + + return TRUE; +} + /********************************************************************** * o15-0.2.1: If SA supports UD multicast, then if SA receives a SubnAdmSet() * or SubnAdmDelete() method that would modify an existing @@ -1641,13 +1692,17 @@ osm_mcmr_rcv_join_mgrp( p_rcv->p_log, p_mgrp, p_recvd_mcmember_rec, - p_sa_mad->comp_mask) && (join_state != 0); + p_sa_mad->comp_mask) && __validate_port_caps( + p_rcv->p_log, + p_mgrp, + p_physp) && (join_state != 0); if(!valid) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, "osm_mcmr_rcv_join_mgrp: ERR 1B12: " - "__validate_more_comp_fields or JoinState = 0 failed, " + "__validate_more_comp_fields, __validate_port_caps " + "or JoinState = 0 failed, " "sending IB_SA_MAD_STATUS_REQ_INVALID\n"); /* since we might have created the new group we need to cleanup */ From sashak at voltaire.com Sun Feb 26 15:20:49 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 27 Feb 2006 01:20:49 +0200 Subject: [openib-general] [RFC] [PATCH] OpenSM: Add functional partition manager support In-Reply-To: <20060219163843.GB16012@sashak.voltaire.com> References: <20060219163843.GB16012@sashak.voltaire.com> Message-ID: <20060226232049.GC8819@sashak.voltaire.com> On 18:38 Sun 19 Feb , Sasha Khapyorsky wrote: > > There is phase 1 of partiton manager for OpenSM. Please review. Some selfreview addition: main thing is mtu and rate flags for IPoIB partitions config. This patch is incremental against previous one. Sasha. This adds possibility to specify in partition configuration mtu and rate values (and overwrite defaults) for partitions which suports IPoIB MC group. Also small fix in pkey auto-generation code. Signed-off-by: Sasha Khapyorsky --- osm/doc/partition-config.txt | 18 +++++++++++---- osm/opensm/osm_prtn.c | 15 +++++++------ osm/opensm/osm_prtn_config.c | 50 ++++++++++++++++++++++++++++++++---------- 3 files changed, 59 insertions(+), 24 deletions(-) diff --git a/osm/doc/partition-config.txt b/osm/doc/partition-config.txt index b3ba804..a5d0fd0 100644 --- a/osm/doc/partition-config.txt +++ b/osm/doc/partition-config.txt @@ -32,15 +32,23 @@ General file format: Partition Definition: -------------------- -[PartitionName][=PKey][,flag] +[PartitionName][=PKey][,flag[=value]] PartitionName - free string, will be used with logging. When omitted empty string will be used. PKey - P_Key value for this partition. Only low 15 bits will be used. When omitted will be autogenerated. flag - used to indicate IPoIB capability of this partition. - 'ipoib' is only valid value currently (in future other - values may be added). + +Currently recognized flags are: + +ipoib - indicates that this partition may be used for IPoIB, as + result IPoIB capable MC group will be created. +rate= - specifies rate for this IPoIB MC group (default is 3 (10Bps)) +mtu= - specifies MTU for this IPoIB MC group (default is 4 (2048)) + +Note that values for 'rate' and 'mtu' should be specified as defined in +IBTA specification (for example mtu=4 for 2048). PortGUIDs list: @@ -49,7 +57,7 @@ PortGUIDs list: [PortGUID[=full|=part]] [,PortGUID[=full|=part]] [,PortGUID] ... PortGUID - GUID of partition member EndPort. Hexadecimal numbers - should start from 0x. + should start from 0x, decimal numbers are accepted too. full or part - indicates full or partial membership for this port. When omitted (or unrecognized) partial membership is assumed. @@ -71,7 +79,7 @@ between. PartitionName does not need to be unique, PKey does need to be unique. If PKey is repeated then those partition configurations will be merged -(see also next note). +and first PartitionName will be used (see also next note). It is possible to split partition configuration in more than one definition, but then PKey should be explicitly specified (overwise diff --git a/osm/opensm/osm_prtn.c b/osm/opensm/osm_prtn.c index f5f3a32..c53d849 100644 --- a/osm/opensm/osm_prtn.c +++ b/osm/opensm/osm_prtn.c @@ -170,7 +170,7 @@ ib_api_status_t osm_prtn_add_all(osm_log ib_api_status_t osm_prtn_add_mcgroup(osm_log_t *p_log, - osm_subn_t *p_subn, osm_prtn_t *p) + osm_subn_t *p_subn, osm_prtn_t *p, uint8_t rate, uint8_t mtu) { ib_member_rec_t mc_rec; ib_net64_t comp_mask; @@ -187,17 +187,17 @@ ib_api_status_t osm_prtn_add_mcgroup(osm cl_memcpy(&mc_rec.mgid.raw[4], &pkey, sizeof(pkey)); mc_rec.qkey = CL_HTON32(0x0b1b); - mc_rec.mtu = 4; /* 2048 Bytes */ + mc_rec.mtu = mtu ? mtu : 4; /* 2048 Bytes */ mc_rec.tclass = 0; mc_rec.pkey = pkey; - mc_rec.rate = 0x3; /* 10Gb/sec */ + mc_rec.rate = rate ? rate : 0x3; /* 10Gb/sec */ mc_rec.pkt_life = OSM_DEFAULT_SUBNET_TIMEOUT; mc_rec.sl_flow_hop = OSM_DEFAULT_SL << 28; /* Note: scope needs to be consistent with MGID */ mc_rec.scope_state = 0x21; - /* mtu and rate will be updated according to CA */ - comp_mask = 0; + /* don't update rate, mtu */ + comp_mask = IB_MCR_COMPMASK_MTU | IB_MCR_COMPMASK_RATE; status = osm_mcmr_rcv_find_or_create_new_mgrp(&p_sa->mcmr_rcv, comp_mask, &mc_rec, &p_mgrp); @@ -216,10 +216,11 @@ static uint16_t __generate_pkey(osm_subn { uint16_t pkey; cl_qmap_t *m = &p_subn->prtn_pkey_tbl; - while ( global_pkey_counter < IB_DEFAULT_PARTIAL_PKEY - 1) { + while ( global_pkey_counter < cl_ntoh16(IB_DEFAULT_PARTIAL_PKEY) - 1) { pkey = ++global_pkey_counter; + pkey = cl_hton16(pkey); if (cl_qmap_get(m, pkey) == cl_qmap_end(m)) - return cl_hton16(pkey); + return pkey; } return 0; } diff --git a/osm/opensm/osm_prtn_config.c b/osm/opensm/osm_prtn_config.c index 97a835a..d818e84 100644 --- a/osm/opensm/osm_prtn_config.c +++ b/osm/opensm/osm_prtn_config.c @@ -83,6 +83,7 @@ struct part_conf { osm_log_t *p_log; osm_subn_t *p_subn; osm_prtn_t *p_prtn; + unsigned is_ipoib, mtu, rate; }; @@ -94,8 +95,7 @@ extern ib_api_status_t osm_prtn_add_port osm_subn_t *p_subn, osm_prtn_t *p, ib_net64_t guid, boolean_t full); extern ib_api_status_t osm_prtn_add_mcgroup(osm_log_t *p_log, - osm_subn_t *p_subn, osm_prtn_t *p); - + osm_subn_t *p_subn, osm_prtn_t *p, uint8_t rate, uint8_t mtu); static int partition_create(unsigned lineno, struct part_conf *conf, char *name, char *id, char *flag, char *flag_val) @@ -120,18 +120,39 @@ static int partition_create(unsigned lin name, cl_hton16(pkey)); if (!conf->p_prtn) return -1; + + if (conf->is_ipoib) + osm_prtn_add_mcgroup(conf->p_log, conf->p_subn, + conf->p_prtn, conf->rate, conf->mtu); - if (flag) { - if(!strncmp(flag, "ipoib", strlen(flag))) - osm_prtn_add_mcgroup(conf->p_log, - conf->p_subn, conf->p_prtn); - else { + return 0; +} + + +static int partition_add_flag(unsigned lineno, struct part_conf *conf, + char *flag, char *val) +{ + int len = strlen(flag); + if (!strncmp(flag, "ipoib", len)) { + conf->is_ipoib = 1; + } + else if (!strncmp(flag, "mtu", len)) { + if (!val || (conf->mtu = strtoul(val, NULL, 0)) == 0) PARSEWARN(conf->p_log, lineno, - "unrecognized partition flag \'%s\'" - " - ignored.\n", flag); - } + "flag \'mtu\' requires valid value" + " - skipped.\n"); + } + else if (!strncmp(flag, "rate", len)) { + if (!val || (conf->rate = strtoul(val, NULL, 0)) == 0) + PARSEWARN(conf->p_log, lineno, + "flag \'rate\' requires valid value" + " - skipped.\n"); + } + else { + PARSEWARN(conf->p_log, lineno, + "unrecognized partition flag \'%s\'" + " - ignored.\n", flag); } - return 0; } @@ -285,7 +306,11 @@ static int parse_part_conf(struct part_c p += ret; len += ret; - if (q) { + while (q) { + flag = flval = NULL; + q = strchr(p, ','); + if (q) + *q++ = '\0'; ret = parse_name_token(p, &flag, &flval); if (!flag) { PARSERR(conf->p_log, lineno, @@ -294,6 +319,7 @@ static int parse_part_conf(struct part_c } p += ret; len += ret; + partition_add_flag(lineno, conf, flag, flval); } if (p != str || (partition_create(lineno, conf, From mst at mellanox.co.il Sun Feb 26 15:55:09 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 27 Feb 2006 01:55:09 +0200 Subject: [openib-general] Re: [PATCH] mthca - command interface In-Reply-To: References: <1140432784.4429.14.camel@mtls03.yok.mtl.com> Message-ID: <20060226235509.GA22338@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] mthca - command interface > > Thanks, I applied this to svn and queued it for 2.6.17, with a few changes: > > - Fix initialization of dev->cmd.flags to zero all flags > - Fix ioremapping of UAR0 to work on ppc64, etc (where HCA's bus > address view doesn't match token that ioremap needs) > - Change module parameter name to fw_cmd_doorbell > - My usual anal cleanups and style changes > > - R. Unfortunately, on some systems we see that command execution time is hurt by this option. It could be a firmware thing - I think its best to disable fw_cmd_doorbell by default until we figure it out. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Sun Feb 26 15:57:50 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 27 Feb 2006 01:57:50 +0200 Subject: [openib-general] Re: ipoib_multicast_ah.patch In-Reply-To: References: <20060222104037.GB21077@mellanox.co.il> Message-ID: <20060226235750.GA22412@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: ipoib_multicast_ah.patch > > This looks OK, I guess, but... > > Mst> ipoib_mcast_send tests mcast->ah twice. If this value > Mst> is changed between these two points, we leak an skb. > > who else might change the value here? Answering this would help > justify the patch and save me from having to write up a changelog entry. I has a quick look and couldnt see any other place - except when the mcast entry is freed. Am I missing something? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From rdreier at cisco.com Sun Feb 26 16:03:00 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 26 Feb 2006 16:03:00 -0800 Subject: [openib-general] Re: [PATCH] mthca - command interface In-Reply-To: <20060226235509.GA22338@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 27 Feb 2006 01:55:09 +0200") References: <1140432784.4429.14.camel@mtls03.yok.mtl.com> <20060226235509.GA22338@mellanox.co.il> Message-ID: Michael> Unfortunately, on some systems we see that command Michael> execution time is hurt by this option. It could be a Michael> firmware thing - I think its best to disable Michael> fw_cmd_doorbell by default until we figure it out. Which systems? Is it more likely that this helps or hurts? If the default is going to be off, is it worth having this option at all? - R. From rdreier at cisco.com Sun Feb 26 16:03:58 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 26 Feb 2006 16:03:58 -0800 Subject: [openib-general] Re: ipoib_multicast_ah.patch In-Reply-To: <20060226235750.GA22412@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 27 Feb 2006 01:57:50 +0200") References: <20060222104037.GB21077@mellanox.co.il> <20060226235750.GA22412@mellanox.co.il> Message-ID: >> who else might change the value here? Answering this would >> help justify the patch and save me from having to write up a >> changelog entry. Michael> I has a quick look and couldnt see any other place - Michael> except when the mcast entry is freed. Am I missing Michael> something? I don't know if you're missing anything -- you posted the patch, and I'm just trying to understand the race it protects against. If nothing else could change the value in between the tests, then what's the harm in testing it twice? - R. From rdreier at cisco.com Sun Feb 26 16:06:09 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 26 Feb 2006 16:06:09 -0800 Subject: [openib-general] [PATCH] mthca: implement query_ah for MADs and memfree -- REMINDER In-Reply-To: <200602261142.03037.jackm@mellanox.co.il> (Jack Morgenstein's message of "Sun, 26 Feb 2006 11:41:59 +0200") References: <200602261142.03037.jackm@mellanox.co.il> Message-ID: Thanks, applied and queued for 2.6.17 From rdreier at cisco.com Sun Feb 26 16:53:13 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 26 Feb 2006 16:53:13 -0800 Subject: [openib-general] create QP failure In-Reply-To: <43FC9F0D.9040304@scl.ameslab.gov> (Kyle Schochenmaier's message of "Wed, 22 Feb 2006 11:27:41 -0600") References: <43FC9F0D.9040304@scl.ameslab.gov> Message-ID: Hmm, I couldn't duplicate your problem here. The test program below just prints max_send_wr 65535, max_recv_wr 65535, max_send_sge 28, max_recv_sge 28 max_inline_data 0 max_send_wr 65535, max_recv_wr 65535, max_send_sge 28, max_recv_sge 28 max_inline_data 476 but creates both QPs successfully. Kyle, can you run the program below and tell me what it prints on your system? Michael -- I did see one problem on Arbel with MemFree FW 5.1.0. The FW reports a max descriptor size of 496 and max SG entries of 30. But 496 byte descriptors are not big enough for a send with 30 SG entries, so an attempt to create a QP with the max parameters fails with max_send_wr 16384, max_recv_wr 16384, max_send_sge 30, max_recv_sge 30 max_inline_data 0 Couldn't create QP #1 What do you think the best way to handle this is? Thanks, Roland here's the test program... it can be built with "gcc -o foo foo.c -libverbs" and doesn't take any options to run. /* * Copyright (c) 2006 Cisco Systems. All rights reserved. * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License version * 2 as published by the Free Software Foundation. * * $Id$ */ #include #include #include #include #include #include #include #include #include #include #include #include #include int main(int argc, char *argv[]) { struct ibv_device **dev_list; struct ibv_device *ib_dev; struct ibv_device_attr devattr; struct ibv_context *context; struct ibv_pd *pd; struct ibv_cq *cq; struct ibv_qp *qp; struct ibv_wc wc; dev_list = ibv_get_device_list(NULL); if (!dev_list) { fprintf(stderr, "No IB devices found\n"); return 1; } ib_dev = *dev_list; if (!ib_dev) { fprintf(stderr, "No IB devices found\n"); return 1; } context = ibv_open_device(ib_dev); if (!context) { fprintf(stderr, "Couldn't get context for %s\n", ibv_get_device_name(ib_dev)); return 1; } if (ibv_query_device(context, &devattr)) { fprintf(stderr, "Couldn't query device attrs\n"); return 1; } pd = ibv_alloc_pd(context); if (!pd) { fprintf(stderr, "Couldn't allocate PD\n"); return 1; } cq = ibv_create_cq(context, 1, NULL, NULL, 0); if (!cq) { fprintf(stderr, "Couldn't create CQ\n"); return 1; } { struct ibv_qp_init_attr attr = { .send_cq = cq, .recv_cq = cq, .cap = { .max_send_wr = devattr.max_qp_wr, .max_recv_wr = devattr.max_qp_wr, .max_send_sge = devattr.max_sge, .max_recv_sge = devattr.max_sge }, .qp_type = IBV_QPT_RC }; printf("max_send_wr %d, max_recv_wr %d, max_send_sge %d, max_recv_sge %d max_inline_data %d\n", attr.cap.max_send_wr, attr.cap.max_recv_wr, attr.cap.max_send_sge, attr.cap.max_recv_sge, attr.cap.max_inline_data); qp = ibv_create_qp(pd, &attr); if (!qp) { fprintf(stderr, "Couldn't create QP #1\n"); return 1; } printf("max_send_wr %d, max_recv_wr %d, max_send_sge %d, max_recv_sge %d max_inline_data %d\n", attr.cap.max_send_wr, attr.cap.max_recv_wr, attr.cap.max_send_sge, attr.cap.max_recv_sge, attr.cap.max_inline_data); qp = ibv_create_qp(pd, &attr); if (!qp) { fprintf(stderr, "Couldn't create QP #2\n"); return 1; } } return 0; } From mst at mellanox.co.il Mon Feb 27 00:59:24 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 27 Feb 2006 10:59:24 +0200 Subject: [openib-general] Re: ipoib_multicast_ah.patch In-Reply-To: References: <20060222104037.GB21077@mellanox.co.il> <20060226235750.GA22412@mellanox.co.il> Message-ID: <20060227085924.GN19855@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: ipoib_multicast_ah.patch > > >> who else might change the value here? Answering this would > >> help justify the patch and save me from having to write up a > >> changelog entry. > > Michael> I has a quick look and couldnt see any other place - > Michael> except when the mcast entry is freed. Am I missing > Michael> something? > > I don't know if you're missing anything -- you posted the patch, and > I'm just trying to understand the race it protects against. If > nothing else could change the value in between the tests, then what's > the harm in testing it twice? I guess there's a misunderstanding here. Its pretty simple: ipoib_mcast_send tests mcast->ah twice under priv->lock. ipoib_mcast_join_finish modifies the mcast->ah without taking a lock. No *other* place modifies the mcast->ah. As a solution, take priv->lock around assignment to mcast->ah thus making sure ipoib_mcast_send is not in flight. OK? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From ogerlitz at voltaire.com Mon Feb 27 01:07:50 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 27 Feb 2006 11:07:50 +0200 Subject: [openib-general] [PATCH 3/6] [RFC] iser initiator In-Reply-To: <20060222162507.GB24303@lst.de> References: <20060222162507.GB24303@lst.de> Message-ID: <4402C166.9010105@voltaire.com> Christoph Hellwig wrote: >> +/* iser_dto_add_regd_buff - increments the reference count for * >> + * the registered buffer & adds it to the DTO object */ >> +static void iser_dto_add_regd_buff(struct iser_dto *p_dto, >> + struct iser_regd_buf *p_regd_buf, >> + unsigned long use_offset, >> + unsigned long use_size) >> +{ >> + int add_idx; >> + >> + atomic_inc(&p_regd_buf->ref_count); > > Please kill the p_ prefix for pointer types all over the code. done, accompanied by another cosmetic change of removing the iser_ pref Or. ------------------------------------------------------------------------ r5503 | ogerlitz | 2006-02-27 11:12:24 +0200 (Mon, 27 Feb 2006) | 4 lines removed the iser_ prefix from struct fields and variable names Signed-off-by: Or Gerlitz ------------------------------------------------------------------------ r5502 | ogerlitz | 2006-02-27 11:09:38 +0200 (Mon, 27 Feb 2006) | 4 lines removed the p_ prefix from struct fields and variable names Signed-off-by: Or Gerlitz ------------------------------------------------------------------------ From jackm at mellanox.co.il Mon Feb 27 01:14:52 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Mon, 27 Feb 2006 11:14:52 +0200 Subject: [openib-general] [PATCH] mthca: implement query_ah for MADs and memfree -- REMINDER In-Reply-To: References: <200602261142.03037.jackm@mellanox.co.il> Message-ID: <200602271114.59340.jackm@mellanox.co.il> On Monday 27 February 2006 02:06, Roland Dreier wrote: > Thanks, applied and queued for 2.6.17 The patch wasn't applied! (though SVN indicated it was). I just updated my local copy, and only the $Id line was changed for the three files included in the patch. Huh? Anyhow, I'm re-sending the patch here: Jack >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Implement query_ah in provider layer (except for av's which are in HCA memory) Needed for implementing RMPP duplicate session detection on sending side (extraction of DGID/DLID and GRH flag from address handle). Signed-off-by: Jack Morgenstein Index: src/drivers/infiniband/hw/mthca/mthca_av.c =================================================================== --- src.orig/drivers/infiniband/hw/mthca/mthca_av.c 2006-02-22 09:45:11.621141000 +0200 +++ src/drivers/infiniband/hw/mthca/mthca_av.c 2006-02-22 09:48:39.762130000 +0200 @@ -191,6 +191,34 @@ int mthca_read_ah(struct mthca_dev *dev, return 0; } +int mthca_query_ah(struct mthca_dev *dev, struct mthca_ah *ah, + struct ib_ah_attr *ah_attr) +{ + /* Only implement for MAD and memfree ah for now. */ + if (ah->type == MTHCA_AH_ON_HCA) + return -ENOSYS; + + memset(ah_attr, 0, sizeof *ah_attr); + ah_attr->dlid = be16_to_cpu(ah->av->dlid); + ah_attr->sl = be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 28; + ah_attr->static_rate = ah->av->msg_sr & 0x7; + ah_attr->src_path_bits = ah->av->g_slid & 0x7F; + ah_attr->port_num = be32_to_cpu(ah->av->port_pd) >> 24; + ah_attr->ah_flags = mthca_ah_grh_present(ah) ? IB_AH_GRH : 0; + + if (ah_attr->ah_flags) { + ah_attr->grh.traffic_class = + be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 20; + ah_attr->grh.flow_label = + be32_to_cpu(ah->av->sl_tclass_flowlabel) & 0xfffff; + ah_attr->grh.hop_limit = ah->av->hop_limit; + ah_attr->grh.sgid_index = ah->av->gid_index & + (dev->limits.gid_table_len - 1); + memcpy(ah_attr->grh.dgid.raw, ah->av->dgid, 16); + } + return 0; +} + int __devinit mthca_init_av_table(struct mthca_dev *dev) { int err; Index: src/drivers/infiniband/hw/mthca/mthca_dev.h =================================================================== --- src.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2006-02-22 09:46:05.946574000 +0200 +++ src/drivers/infiniband/hw/mthca/mthca_dev.h 2006-02-22 09:48:39.781132000 +0200 @@ -532,6 +532,8 @@ int mthca_create_ah(struct mthca_dev *de struct mthca_pd *pd, struct ib_ah_attr *ah_attr, struct mthca_ah *ah); +int mthca_query_ah(struct mthca_dev *dev, struct mthca_ah *ah, + struct ib_ah_attr *ah_attr); int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah); int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah, struct ib_ud_header *header); Index: src/drivers/infiniband/hw/mthca/mthca_provider.c =================================================================== --- src.orig/drivers/infiniband/hw/mthca/mthca_provider.c 2006-02-22 09:46:06.020575000 +0200 +++ src/drivers/infiniband/hw/mthca/mthca_provider.c 2006-02-22 09:48:39.803132000 +0200 @@ -446,6 +446,11 @@ static int mthca_ah_destroy(struct ib_ah return 0; } +static int mthca_ah_query(struct ib_ah *ah, struct ib_ah_attr *ah_attr) +{ + return mthca_query_ah(to_mdev(ah->device), to_mah(ah), ah_attr); +} + static struct ib_srq *mthca_create_srq(struct ib_pd *pd, struct ib_srq_init_attr *init_attr, struct ib_udata *udata) @@ -1290,6 +1295,7 @@ int mthca_register_device(struct mthca_d dev->ib_dev.dealloc_pd = mthca_dealloc_pd; dev->ib_dev.create_ah = mthca_ah_create; dev->ib_dev.destroy_ah = mthca_ah_destroy; + dev->ib_dev.query_ah = mthca_ah_query; if (dev->mthca_flags & MTHCA_FLAG_SRQ) { dev->ib_dev.create_srq = mthca_create_srq; From jackm at mellanox.co.il Mon Feb 27 01:33:06 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Mon, 27 Feb 2006 11:33:06 +0200 Subject: [openib-general] [PATCH] mthca: implement query_ah for MADs and memfree -- REMINDER In-Reply-To: <200602271114.59340.jackm@mellanox.co.il> References: <200602261142.03037.jackm@mellanox.co.il> <200602271114.59340.jackm@mellanox.co.il> Message-ID: <200602271133.06954.jackm@mellanox.co.il> On Monday 27 February 2006 11:14, Jack Morgenstein wrote: > On Monday 27 February 2006 02:06, Roland Dreier wrote: > > Thanks, applied and queued for 2.6.17 > > The patch wasn't applied! (though SVN indicated it was). I just updated my > local copy, and only the $Id line was changed for the three files included > in the patch. > Huh? > Sorry about that -- just noticed that you changed the patch a bit, and put mthca_ah_query() into mthca_av.c directly. -- Jack From yael at mellanox.co.il Mon Feb 27 02:07:48 2006 From: yael at mellanox.co.il (Yael Kalka) Date: Mon, 27 Feb 2006 12:07:48 +0200 Subject: [openib-general] RE: [PATCH] opensm: fixes in signal handling Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FDD5@mtlexch01.mtl.com> Hi Sasha, The patch compiles fine under windows. Thanks, Yael > -----Original Message----- > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > Sent: Wednesday, February 22, 2006 1:33 AM > To: halr at voltaire.com; Yael Kalka > Cc: openib-general at openib.org > Subject: [PATCH] opensm: fixes in signal handling > > Hello, > > There are fixes for broken signal handling stuff. Finally this should > prevent some opensm crashes (actually deadlocks), for instance caused > by such command: > > while [ 1 ] ; do kill -HUP ; done > > > Yael, I hope this patch does not broke windows compilations (this masks > signal related functions under __WIN__), but cannot be absolutely sure - > please look at this from "under windows" point of view. Thanks. > > Sasha. > > > This fixes broken signal handling. In this patch: > > - signal handling stuff is moved to main.c > - cl_sig_* is replaced by more powerfull posix (I hope it should not > be bad for win because this is under !__WIN__ anyway) > - signal handler does not call resweeper or wakeup directly, but only > update new osm_hup_flag (or osm_exit_flag on SIGINT or SIGTERM) > - signal delivery are masked in all threads expept first one, so only > expected thread will be interrupted (from sleep() or poll()) > - resweep thread will be wakeuped from main.c thread instead direct > - poll was added to osm_console - this provides timeout ability and > workarouds getline()'s signal interruption problem. > > > diff --git a/osm/include/opensm/osm_console.h > b/osm/include/opensm/osm_console.h > index 5a4036f..c5cd22a 100644 > --- a/osm/include/opensm/osm_console.h > +++ b/osm/include/opensm/osm_console.h > @@ -50,6 +50,7 @@ > BEGIN_C_DECLS > > void osm_console(osm_opensm_t *p_osm); > +void osm_console_prompt(void); > > END_C_DECLS > > diff --git a/osm/include/opensm/osm_opensm.h > b/osm/include/opensm/osm_opensm.h > index 833c4c3..3235ad4 100644 > --- a/osm/include/opensm/osm_opensm.h > +++ b/osm/include/opensm/osm_opensm.h > @@ -388,38 +388,12 @@ osm_opensm_wait_for_subnet_up( > > /****v* OpenSM/osm_exit_flag > */ > -extern volatile int osm_exit_flag; > +extern volatile unsigned int osm_exit_flag; > /* > * DESCRIPTION > * Set to one to cause all threads to leave > *********/ > > -#ifndef __WIN__ > -/****f* OpenSM: OpenSM/osm_reg_sig_handler > -* NAME > -* osm_reg_sig_handler > -* > -* DESCRIPTION > -* Registers the common signal handler > -* > -* SYNOPSIS > -*/ > -void osm_reg_sig_handler( > -IN osm_opensm_t* const p_osm); > -/* > -* PARAMETERS > -* p_osm > -* [in] Pointer to a OpenSM object to handle signals on. > -* > -* RETURN VALUES > -* None > -* > -* NOTES > -* > -* SEE ALSO > -*********/ > -#endif /* __WIN__ */ > - > END_C_DECLS > > #endif /* _OSM_OPENSM_H_ */ > diff --git a/osm/opensm/main.c b/osm/opensm/main.c > index c5ba443..797b14c 100644 > --- a/osm/opensm/main.c > +++ b/osm/opensm/main.c > @@ -77,14 +77,59 @@ > instantiating more than one opensm object. > */ > osm_opensm_t osm; > -volatile int osm_exit_flag = 0; > + > +volatile unsigned int osm_exit_flag = 0; > + > +static volatile unsigned int osm_hup_flag = 0; > > #define GUID_ARRAY_SIZE 64 > #define INVALID_GUID (0xFFFFFFFFFFFFFFFFULL) > > + > +#ifdef __WIN__ > +#define block_signals() > +#define setup_signals() > +#else > + > +static void mark_exit_flag(int signum) > +{ > + if(!osm_exit_flag) > + printf("OpenSM: Got signal %d - exiting...\n", signum); > + osm_exit_flag = 1; > +} > + > +static void mark_hup_flag(int signum) > +{ > + osm_hup_flag = 1; > +} > + > +static sigset_t saved_sigset; > + > +static void block_signals() > +{ > + sigset_t set; > + sigfillset(&set); > + sigprocmask(SIG_SETMASK, &set, &saved_sigset); > +} > + > +static void setup_signals() > +{ > + struct sigaction act; > + sigfillset(&act.sa_mask); > + act.sa_handler = mark_exit_flag; > + act.sa_flags = 0; > +#ifndef OSM_VENDOR_INTF_OPENIB > + sigaction(SIGINT, &act, NULL); > +#endif > + sigaction(SIGTERM, &act, NULL); > + act.sa_handler = mark_hup_flag; > + sigaction(SIGHUP, &act, NULL); > + sigprocmask(SIG_SETMASK, &saved_sigset, NULL); > +} > +#endif /* __WIN__ */ > + > /********************************************************************** > **********************************************************************/ > -void show_usage(void); > > void > show_usage(void) > @@ -247,7 +292,6 @@ show_usage(void) > > /********************************************************************** > **********************************************************************/ > -void show_menu(void); > > void > show_menu(void) > @@ -764,6 +808,8 @@ main( > if ( cache_options == TRUE ) > osm_subn_write_conf_file( &opt ); > > + block_signals(); > + > status = osm_opensm_init( &osm, &opt ); > if( status != IB_SUCCESS ) > { > @@ -794,9 +840,6 @@ main( > goto Exit; > } > > - /* this should handle ^C etc */ > - osm_reg_sig_handler( &osm ); > - > status = osm_opensm_bind( &osm, guid ); > if( status != IB_SUCCESS ) > { > @@ -817,6 +860,8 @@ main( > } > } > > + setup_signals(); > + > osm_opensm_sweep( &osm ); > /* since osm_opensm_init get opt as RO we'll set the opt value with UI > pfn here */ > /* Now do the registration */ > @@ -839,11 +884,23 @@ main( > In the future, some sort of console interactivity could > be implemented in this loop. > */ > - while( !osm_exit_flag ) > + if (opt.console) { > + printf("\nOpenSM Console\n\n"); > + osm_console_prompt(); > + } > + while( !osm_exit_flag ) { > if (opt.console) > osm_console(&osm); > else > cl_thread_suspend( 10000 ); > + > + if (osm_hup_flag) { > + osm_hup_flag = 0; > + /* a HUP signal should only start a new heavy sweep */ > + osm.subn.force_immediate_heavy_sweep = TRUE; > + cl_event_signal(&osm.sm.signal); > + } > + } > } > > #if 0 > diff --git a/osm/opensm/osm_console.c b/osm/opensm/osm_console.c > index c470b49..43d9f87 100644 > --- a/osm/opensm/osm_console.c > +++ b/osm/opensm/osm_console.c > @@ -39,6 +39,7 @@ > #define _GNU_SOURCE /* for getline */ > #include > #include > +#include > #include > > #define OSM_COMMAND_LINE_LEN 120 > @@ -186,15 +187,27 @@ static void parse_cmd_line(char *line, o > } > } > > +void osm_console_prompt(void) > +{ > + printf("%s", OSM_COMMAND_PROMPT); > + fflush(stdout); > +} > + > void osm_console(osm_opensm_t *p_osm) > { > + struct pollfd pollfd; > char *p_line; > size_t len; > ssize_t n; > > - printf("\nOpenSM Console\n\n"); > - while (1) { > - printf("%s", OSM_COMMAND_PROMPT); > + pollfd.fd = 0; > + pollfd.events = POLLIN; > + pollfd.revents = 0; > + > + if (poll(&pollfd, 1, 10000) <= 0) > + return; > + > + if (pollfd.revents|POLLIN) { > p_line = NULL; > /* Get input line */ > n = getline(&p_line, &len, stdin); > @@ -206,6 +219,7 @@ void osm_console(osm_opensm_t *p_osm) > printf("Input error\n"); > fflush(stdin); > } > + osm_console_prompt(); > } > } > > diff --git a/osm/opensm/osm_opensm.c b/osm/opensm/osm_opensm.c > index 6ca6796..54d0ae3 100644 > --- a/osm/opensm/osm_opensm.c > +++ b/osm/opensm/osm_opensm.c > @@ -54,7 +54,6 @@ > #include > #include > #include > -#include > #include > #include > #include > @@ -149,52 +148,6 @@ osm_opensm_create_mcgroups( > } > > /********************************************************************** > - * SHUT DOWN IS CONTROLLED BY A GLOBAL EXIT FLAG > - **********************************************************************/ > -#ifndef __WIN__ > -static osm_opensm_t *__p_osm_to_signal; > - > -void > -__sig_handler( > - int signum ) > -{ > - static int got_signal = 0; > - > - if( signum != SIGHUP ) > - { > - if( !got_signal ) > - { > - got_signal++; > - printf( "OpenSM: Got signal %d - exiting...\n", signum ); > - osm_exit_flag = 1; > - } > - } > - else > - { > - /* a HUP signal should only start a new heavy sweep */ > - __p_osm_to_signal->subn.force_immediate_heavy_sweep = TRUE; > - osm_state_mgr_process( &__p_osm_to_signal->sm.state_mgr, > - OSM_SIGNAL_SWEEP ); > - } > -} > - > -void > -osm_reg_sig_handler( > - IN osm_opensm_t * const p_osm ) > -{ > - __p_osm_to_signal = p_osm; > -#ifndef OSM_VENDOR_INTF_OPENIB > - cl_reg_sig_hdl( SIGINT, __sig_handler ); > -#endif > - cl_reg_sig_hdl( SIGTERM, __sig_handler ); > - cl_reg_sig_hdl( SIGHUP, __sig_handler ); > - osm_exit_flag = 0; > - > - return; > -} > -#endif /* __WIN__ */ > - > -/********************************************************************** > **********************************************************************/ > ib_api_status_t > osm_opensm_init( > diff --git a/osm/opensm/osm_sm.c b/osm/opensm/osm_sm.c > index 8ace290..e252861 100644 > --- a/osm/opensm/osm_sm.c > +++ b/osm/opensm/osm_sm.c > @@ -87,10 +87,6 @@ __osm_sm_sweeper( > > if( p_sm->thread_state == OSM_THREAD_STATE_INIT ) > { > - osm_log( p_sm->p_log, OSM_LOG_DEBUG, > - "__osm_sm_sweeper: " "Masking ^C Signals\n" ); > - cl_sig_mask_sigint( ); > - > p_sm->thread_state = OSM_THREAD_STATE_RUN; > } > > diff --git a/osm/opensm/osm_vl15intf.c b/osm/opensm/osm_vl15intf.c > index ef18e54..4796a17 100644 > --- a/osm/opensm/osm_vl15intf.c > +++ b/osm/opensm/osm_vl15intf.c > @@ -85,7 +85,6 @@ __osm_vl15_poller( > > if ( p_vl->thread_state == OSM_THREAD_STATE_NONE) > { > - cl_sig_mask_sigint( ); > p_vl->thread_state = OSM_THREAD_STATE_RUN; > } > From yael at mellanox.co.il Mon Feb 27 02:14:17 2006 From: yael at mellanox.co.il (Yael Kalka) Date: 27 Feb 2006 12:14:17 +0200 Subject: [openib-general] [PATCH] OpenSM - osm_vendor_get_all_port_attr - add info Message-ID: <5zlkvxjeh2.fsf@mtl066.yok.mtl.com> Hi Hal, Currently osm_vendor_get_all_port_attr doesn't update the port number information. The following patch adds this information. Thanks, Yael Signed-off-by: Yael Kalka Index: libvendor/osm_vendor_ibumad.c =================================================================== --- libvendor/osm_vendor_ibumad.c (revision 5496) +++ libvendor/osm_vendor_ibumad.c (working copy) @@ -565,8 +565,10 @@ osm_vendor_get_all_port_attr( ib_net64_t *p_guid = portguids, *e = portguids + *p_num_ports; umad_ca_t ca; int lids[*p_num_ports]; + int portnums[*p_num_ports]; int linkstates[*p_num_ports]; int *p_lid = lids; + int *p_portnum = portnums; int *p_linkstates = linkstates; umad_port_t def_port = {""}; int r, i, j; @@ -622,6 +624,7 @@ osm_vendor_get_all_port_attr( portguids[0] = def_port.port_guid; lids[0] = def_port.base_lid; + portnums[0] = def_port.portnum; linkstates[0] = def_port.state; sm_lid = def_port.sm_lid; @@ -642,6 +645,7 @@ osm_vendor_get_all_port_attr( continue; p_attr_array[j].port_guid = portguids[i]; p_attr_array[j].lid = lids[i]; + p_attr_array[j].port_num = portnums[i]; if (j == 0) p_attr_array[j].sm_lid = sm_lid; else From sashak at voltaire.com Mon Feb 27 03:43:58 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 27 Feb 2006 13:43:58 +0200 Subject: [openib-general] Re: [PATCH] opensm: fixes in signal handling In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FDD5@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3F8FDD5@mtlexch01.mtl.com> Message-ID: <20060227114358.GA10242@sashak.voltaire.com> On 12:07 Mon 27 Feb , Yael Kalka wrote: > The patch compiles fine under windows. Fine. Many Thanks. Sasha. From ogerlitz at voltaire.com Mon Feb 27 05:21:20 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 27 Feb 2006 15:21:20 +0200 Subject: [openib-general] [PATCH 3/6] [RFC] iser initiator In-Reply-To: <20060222162507.GB24303@lst.de> References: <20060222162507.GB24303@lst.de> Message-ID: <4402FCD0.4060806@voltaire.com> Christoph Hellwig wrote: >> +static int iser_post_receive_control(struct iscsi_iser_conn *p_iser_conn) ... >> + rx_desc = kmem_cache_alloc(ig.desc_cache, >> + GFP_KERNEL | __GFP_NOFAIL); > __GFP_NOFAIL doesn't work for slab (kmem_cache_alloc/kmalloc/kzalloc/kcalloc) > allocations DONE, in r5505 >> +send_data_out_error: >> + if (p_send_dto != NULL) >> + iser_dto_buffs_release(p_send_dto); >> + if (tx_desc != NULL) >> + kmem_cache_free(ig.desc_cache, tx_desc); > > could you please do the same goto-unwinding style we use elsewhere > in the kernel? That is one label before each unwind step and jump > directly to that instead of adding tons of conditionals in the error path. DONE, in r5507 ------------------------------------------------------------------------ r5507 | ogerlitz | 2006-02-27 15:27:37 +0200 (Mon, 27 Feb 2006) | 4 lines goto-unwinding style - use few goto labels without if()s between them Signed-off-by: Or Gerlitz ------------------------------------------------------------------------ r5505 | ogerlitz | 2006-02-27 14:48:53 +0200 (Mon, 27 Feb 2006) | 6 lines don't use __GFP_NOFAIL flag for memory allocations,also one page_vec is pre-allocated per struct iser_conn and used for all fmr maps done on behalf of this connection, this is possible since tx is serialized. Signed-off-by: Or Gerlitz ------------------------------------------------------------------------ From ogerlitz at voltaire.com Mon Feb 27 05:26:29 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 27 Feb 2006 15:26:29 +0200 Subject: [openib-general] [PATCH 5/6] [RFC] iser handling of memory for RDMA In-Reply-To: <20060222162903.GC24303@lst.de> References: <20060222162903.GC24303@lst.de> Message-ID: <4402FE05.6020605@voltaire.com> Christoph Hellwig wrote: >> + if (cmd_dir == ISER_DIR_OUT) { >> + /* copy the unaligned sg the buffer which is used for RDMA */ >> + struct scatterlist *p_sg = (struct scatterlist *)p_mem->p_buf; >> + int i; >> + char *p; >> + >> + for (p = mem, i = 0; i < p_mem->size; i++) { >> + memcpy(p, >> + page_address(p_sg[i].page) + p_sg[i].offset, >> + p_sg[i].length); >> + p += p_sg[i].length; > > pages you get sent down in a sg list don't have to be kernel mapped, > you need to use kmap or kmap_atomic to access them. DONE in r5506, see the patch below use kmap_atomic instead of page_address in the code copying from/to SG which is unaligned for rdma Signed-off-by: Or Gerlitz Index: iser_memory.c =================================================================== --- iser_memory.c (revision 5505) +++ iser_memory.c (revision 5506) @@ -140,12 +140,14 @@ int iser_start_rdma_unaligned_sg(struct /* copy the unaligned sg the buffer which is used for RDMA */ struct scatterlist *sg = (struct scatterlist *)data->buf; int i; - char *p; + char *p, *from; for (p = mem, i = 0; i < data->size; i++) { + from = kmap_atomic(sg[i].page, KM_USER0); memcpy(p, - page_address(sg[i].page) + sg[i].offset, + from + sg[i].offset, sg[i].length); + kunmap_atomic(from, KM_USER0); p += sg[i].length; } } @@ -185,7 +187,7 @@ void iser_finalize_rdma_unaligned_sg(str if (ctask->dir[ISER_DIR_IN]) { char *mem; struct scatterlist *sg; - unsigned char *p; + unsigned char *p, *to; unsigned int sg_size; int i; @@ -200,9 +202,11 @@ void iser_finalize_rdma_unaligned_sg(str sg_size = ctask->data[ISER_DIR_IN].size; for (p = mem, i = 0; i < sg_size; i++){ - memcpy(page_address(sg[i].page) + sg[i].offset, + to = kmap_atomic(sg[i].page, KM_SOFTIRQ0); + memcpy(to + sg[i].offset, p, sg[i].length); + kunmap_atomic(to, KM_SOFTIRQ0); p += sg[i].length; } From jackm at mellanox.co.il Mon Feb 27 05:34:21 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Mon, 27 Feb 2006 15:34:21 +0200 Subject: [openib-general] Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID In-Reply-To: <43FF47E1.4090406@ichips.intel.com> References: <1140798219.4336.4.camel@hal.voltaire.com> <43FF47E1.4090406@ichips.intel.com> Message-ID: <200602271534.22167.jackm@mellanox.co.il> On Friday 24 February 2006 19:52, Sean Hefty wrote: > I was thinking about how we can reduce some of the inefficiencies. > Currently, we only track if a request is waiting for a response or not. We > can add a new state indicating that a response is in progress, which would > be set when the first segment of a response is received. This would be > used to suppress duplicate requests. There is still a race condition here. The duplicate request could go out while the first response segment is in transit (particularly true for requests which generate huge responses!). This fix is OK, but does not absolve the responding side from checking as well. > > On the receive side, I was considering adding an API that the user would > invoke to indicate that a response was being generated. The MAD layer > would queue this information, and a received request would be checked > against this queue to determine if it were a duplicate. When the response > is sent, the queued information would be removed. I think that we may be > able to use such an API to support dual-sided RMPP as well. There is still a race here -- between user indicating that a response will be generated, and a new request arriving. Not serious, though since presumably a duplicate request will only be issued after a significant timeout (seconds), and this API would be invoked immediately. This does demand changes in user code, which checking at the mad-send time does not. This is also more medium term, and will certainly not be in time for the upcoming release. I recommend that we use the mad duplicate RMPP send patch now (since it -- or something like it -- will still be needed when we do handling at the requester side, and at the receive side of the responder). This fix is admittedly incomplete, since it not as efficient as I would like (e.g., the duplicate request is still processed, and is thrown out only after all the processing is complete) -- but it does fix the problem. The Tid/GID issue is really a separate issue, and should be treated as such. I'm including the patch here, to save looking for it. Jack >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Prevent issuing multiple MAD transactions with the same TID. Could happen if duplicate requests are posted. Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Index: latest/drivers/infiniband/core/mad.c =================================================================== --- latest.orig/drivers/infiniband/core/mad.c 2006-01-16 18:19:55.000000000 +0200 +++ latest/drivers/infiniband/core/mad.c 2006-01-19 11:41:42.000000000 +0200 @@ -907,6 +907,20 @@ return ret; } +static inline int is_rmpp_data(struct ib_mad *mad) +{ + /* check if has rmpp header */ + if (mad->mad_hdr.mgmt_class != IB_MGMT_CLASS_SUBN_ADM && + (mad->mad_hdr.mgmt_class < IB_MGMT_CLASS_VENDOR_RANGE2_START || + mad->mad_hdr.mgmt_class > IB_MGMT_CLASS_VENDOR_RANGE2_END)) + return 0; + + return ((ib_get_rmpp_flags(&((struct ib_rmpp_mad *)mad)->rmpp_hdr) & + IB_MGMT_RMPP_FLAG_ACTIVE) && + ((struct ib_rmpp_mad *)mad)->rmpp_hdr.rmpp_type == + IB_MGMT_RMPP_TYPE_DATA); +} + /* * ib_post_send_mad - Posts MAD(s) to the send queue of the QP associated * with the registered client @@ -964,6 +979,13 @@ /* Reference MAD agent until send completes */ atomic_inc(&mad_agent_priv->refcount); spin_lock_irqsave(&mad_agent_priv->lock, flags); + if (is_rmpp_data(send_buf->mad) && + ib_find_send_mad(mad_agent_priv, mad_send_wr->tid)) { + /* Duplicate send request */ + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + atomic_dec(&mad_agent_priv->refcount); + return -EBUSY; + } list_add_tail(&mad_send_wr->agent_list, &mad_agent_priv->send_list); spin_unlock_irqrestore(&mad_agent_priv->lock, flags); From mst at mellanox.co.il Mon Feb 27 06:01:13 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 27 Feb 2006 16:01:13 +0200 Subject: [openib-general] Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID In-Reply-To: <200602271534.22167.jackm@mellanox.co.il> References: <200602271534.22167.jackm@mellanox.co.il> Message-ID: <20060227140113.GS19855@mellanox.co.il> Quoting Jack Morgenstein : > Prevent issuing multiple MAD transactions with the same TID. > Could happen if duplicate requests are posted. > > Signed-off-by: Jack Morgenstein > Signed-off-by: Michael S. Tsirkin This is actually affecting duplicate responses as well, isn't it? As far as I remember, duplicate RMPP responses corrupting each other is something that we saw in the lab with opensm, so it would be nice to have that fixed for 2.6.17 (or even for 2.6.16 if possible?). I think the problem with this specific patch was that ib_find_send_mad ignores the remote address (and method?) when doing lookup, so its wrong to use it to check that the transaction is in progress. This shouldnt be hard to fix, though, should it? We should just check the packet method, and compare source/destination GIDs as appropriate. The remote GID can be got from ib_query_ah. Right? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Mon Feb 27 06:10:48 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 27 Feb 2006 16:10:48 +0200 Subject: [openib-general] [PATCH] mthca: fence support Message-ID: <20060227141048.GU19855@mellanox.co.il> Add support to IB_SEND_FENCE in post_send. Signed-off-by: Dotan Barak Signed-off-by: Michael S. Tsirkin Index: last_stable/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- last_stable.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2006-02-26 13:10:59.000000000 +0200 +++ last_stable/drivers/infiniband/hw/mthca/mthca_qp.c 2006-02-26 13:13:02.000000000 +0200 @@ -1602,7 +1602,9 @@ int mthca_tavor_post_send(struct ib_qp * mthca_opcode[wr->opcode]); wmb(); ((struct mthca_next_seg *) prev_wqe)->ee_nds = - cpu_to_be32((size0 ? 0 : MTHCA_NEXT_DBD) | size); + cpu_to_be32((size0 ? 0 : MTHCA_NEXT_DBD) | size | + ((wr->send_flags & IB_SEND_FENCE) ? + MTHCA_NEXT_FENCE : 0)); if (!size0) { size0 = size; @@ -1964,7 +1966,9 @@ int mthca_arbel_post_send(struct ib_qp * mthca_opcode[wr->opcode]); wmb(); ((struct mthca_next_seg *) prev_wqe)->ee_nds = - cpu_to_be32(MTHCA_NEXT_DBD | size); + cpu_to_be32(MTHCA_NEXT_DBD | size | + ((wr->send_flags & IB_SEND_FENCE) ? + MTHCA_NEXT_FENCE : 0)); if (!size0) { size0 = size; -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Mon Feb 27 06:12:26 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 27 Feb 2006 16:12:26 +0200 Subject: [openib-general] [PATCH] libmthca: fence support Message-ID: <20060227141226.GV19855@mellanox.co.il> Add support to IBV_SEND_FENCE in post_send. Signed-off-by: Dotan Barak Signed-off-by: Michael S. Tsirkin Index: last_stable/src/userspace/libmthca/src/qp.c =================================================================== --- last_stable.orig/src/userspace/libmthca/src/qp.c 2006-02-26 13:10:42.000000000 +0200 +++ last_stable/src/userspace/libmthca/src/qp.c 2006-02-26 13:11:54.000000000 +0200 @@ -282,7 +282,9 @@ int mthca_tavor_post_send(struct ibv_qp mthca_opcode[wr->opcode]); ((struct mthca_next_seg *) prev_wqe)->ee_nds = - htonl((size0 ? 0 : MTHCA_NEXT_DBD) | size); + htonl((size0 ? 0 : MTHCA_NEXT_DBD) | size | + ((wr->send_flags & IBV_SEND_FENCE) ? + MTHCA_NEXT_FENCE : 0)); if (!size0) { size0 = size; @@ -633,7 +634,9 @@ int mthca_arbel_post_send(struct ibv_qp mthca_opcode[wr->opcode]); mb(); ((struct mthca_next_seg *) prev_wqe)->ee_nds = - htonl(MTHCA_NEXT_DBD | size); + htonl(MTHCA_NEXT_DBD | size | + ((wr->send_flags & IBV_SEND_FENCE) ? + MTHCA_NEXT_FENCE : 0)); if (!size0) { size0 = size; -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From jackm at mellanox.co.il Mon Feb 27 06:16:58 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Mon, 27 Feb 2006 16:16:58 +0200 Subject: [openib-general] Re: [PATCH] ib_mad: =?iso-8859-1?q?prevent=09duplicateoutstanding_MADtransactions_with_same?= =?iso-8859-1?q?_TID?= In-Reply-To: <20060227140113.GS19855@mellanox.co.il> References: <200602271534.22167.jackm@mellanox.co.il> <20060227140113.GS19855@mellanox.co.il> Message-ID: <200602271616.58922.jackm@mellanox.co.il> On Monday 27 February 2006 16:01, Michael S. Tsirkin wrote: > I think the problem with this specific patch was that ib_find_send_mad > ignores the remote address (and method?) when doing lookup, > so its wrong to use it to check that the transaction is in progress. > This shouldnt be hard to fix, though, should it? ib_find_send_mad is also used on the qp receive side for matching arriving packets ( RMPP ACK/NACK) to waiting sessions (the RMPP responder session). using ib_find_send_mad on the send side is no worse than using it on the response side. The fix I submitted checks for RMPP sessions ONLY. If there is a duplicate session problem (same TID, from different GIDs), that problem will still exist even with the submitted patch when ACK/NACKs arrive -- since matching of incoming packets to MAD sends waiting for responses is currently done by TID alone. This is a separate fix. i.e., there are 2 separate problems. 1. RMPP duplicate sessions with same host -- due to dropped segments and requester retries. 2. separate RMPP sessions, where TIDs are identical, but GIDs are different. The included patch fixes the first. > > We should just check the packet method, and compare source/destination GIDs > as appropriate. The remote GID can be got from ib_query_ah. Right? From rep.nop at aon.at Mon Feb 27 07:40:27 2006 From: rep.nop at aon.at (Bernhard Fischer) Date: Mon, 27 Feb 2006 16:40:27 +0100 Subject: [openib-general] [patch] tvflash configure checks for libpci Message-ID: <20060227154027.GB7849@aon.at> Hi, tvflash does check if pci/pci.h is installed but does not error out if it is not found. This leads to build-errors later on as the include is not found. Attached patch would - check for pci/pci.h and bail out if it is not found. - check for needed functions provided by libpci Signed-off-by: Bernhard Fischer Please consider applying something to this effect. Thank you -------------- next part -------------- Index: configure.in =================================================================== --- configure.in (revision 5203) +++ configure.in (working copy) @@ -11,16 +11,25 @@ # Checks for programs. AC_PROG_CC -# Checks for libraries. -AC_CHECK_LIB(pci, pci_alloc) - # Checks for header files. AC_HEADER_STDC -AC_CHECK_HEADERS(fcntl.h limits.h stdlib.h string.h unistd.h pci/pci.h) +AC_CHECK_HEADERS(fcntl.h limits.h stdlib.h string.h unistd.h) +AC_CHECK_HEADER([pci/pci.h], [], + AC_MSG_ERROR([ not found.])) + # Checks for typedefs, structures, and compiler characteristics. # Checks for library functions. AC_FUNC_MALLOC AC_CHECK_FUNCS(memset strchr strtoul) +AC_CHECK_LIB(pci, pci_init, [], + AC_MSG_ERROR([libpci not found. +])) +AC_CHECK_FUNCS(pci_init pci_scan_bus pci_alloc \ + pci_cleanup pci_fill_info pci_write_long pci_read_long \ + pci_read_word, [], + AC_MSG_ERROR([libpci functions not found. +])) + AC_OUTPUT(Makefile tvflash.spec) From parks at lanl.gov Mon Feb 27 08:54:41 2006 From: parks at lanl.gov (Parks Fields) Date: Mon, 27 Feb 2006 09:54:41 -0700 Subject: [openib-general] Open Fabrics Symposium Message-ID: <6.2.3.4.2.20060227095302.01f6c050@ccn-mail.lanl.gov> HI when booking for Open Fabrics Symposium, and one day at IDF is it 250 + the 295 or does the 295 cover both ?? thanks parks -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Mon Feb 27 09:34:59 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 27 Feb 2006 19:34:59 +0200 Subject: [openib-general] RFC: SDP plans Message-ID: <20060227173459.GB19855@mellanox.co.il> I started preparing a stable linux SDP implementation, with the eye towards mainline inclusion. The idea is to get to a drastically simple code base and get this admitted in mainline, then add enhancements. The plan (as compared to existing SDP implementation) includes: - Use CMA API - Reuse generic code from sock.c SO_SNDBUF/SO_RCVBUF should work properly - Use sock_lock_t and simple spin_lock_bh for socket locking - Use skbuff and standard skbuff queues (in struct sock) for incoming/outgoing messages - Implement transport-level queues by simple circular buffer, attach BSDH by s/g - Set socket bits to signal the need for control messages - Single CQ, perform all CQ polling from interrupt context - Code must be sparse-clean, keep network data in __beXX structures - Proper use of DMA API - Use sysfs for statistics, entry per socket - Only support the BCopy mode for starters - any advertisements will be answered with sendsm - Only support synchronous operations for starters: applications can use userspace AIO emulation, and hopefully generic AIO support in kernel will mature meanwhile. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From iod00d at hp.com Mon Feb 27 09:42:37 2006 From: iod00d at hp.com (Grant Grundler) Date: Mon, 27 Feb 2006 09:42:37 -0800 Subject: [openib-general] Performance optimization In-Reply-To: <43FA29A1.5020301@uci.edu> References: <43FA29A1.5020301@uci.edu> Message-ID: <20060227174237.GH31654@esmail.cup.hp.com> On Mon, Feb 20, 2006 at 12:42:09PM -0800, Frithjof Kruggel wrote: > However, the performance as measured by the read/ > write tests is "only" about 200 MB/s - I expected > figures in the order of 400-500 MB/s. Can you give > me some step-by-step list how to optimize performance? Are the opteron nodes single CPU or SMP? If SMP, you will need to look at binding processes to specific CPUs for optimal performance. Read "man taskset" (apt-get install schedutils) Which tests are you running? (Michael Tsirkin already asked but I did't see the answer) grant From bill.boas at gmail.com Mon Feb 27 09:44:05 2006 From: bill.boas at gmail.com (Bill Boas) Date: Mon, 27 Feb 2006 09:44:05 -0800 Subject: [openib-general] RFC: SDP plans In-Reply-To: <20060227173459.GB19855@mellanox.co.il> References: <20060227173459.GB19855@mellanox.co.il> Message-ID: <19a929370602270944v30f79516id428b0913bd04a08@mail.gmail.com> Michael, Having a stable, performant SDP in Rel. 1.0 available from Novel and RedHat is absolutely critical for Wall Street and others. If anyone diagrees with this requirement please speak up! Bill. On 2/27/06, Michael S. Tsirkin wrote: > > I started preparing a stable linux SDP implementation, > with the eye towards mainline inclusion. > > The idea is to get to a drastically simple code base and get this admitted > in > mainline, then add enhancements. > > The plan (as compared to existing SDP implementation) includes: > - Use CMA API > - Reuse generic code from sock.c > SO_SNDBUF/SO_RCVBUF should work properly > - Use sock_lock_t and simple spin_lock_bh for socket locking > - Use skbuff and standard skbuff queues (in struct sock) > for incoming/outgoing messages > - Implement transport-level queues by simple circular buffer, > attach BSDH by s/g > - Set socket bits to signal the need for control messages > - Single CQ, perform all CQ polling from interrupt context > - Code must be sparse-clean, keep network data in __beXX structures > - Proper use of DMA API > - Use sysfs for statistics, entry per socket > - Only support the BCopy mode for starters - > any advertisements will be answered with sendsm > - Only support synchronous operations for starters: applications > can use userspace AIO emulation, and hopefully generic AIO support > in kernel will mature meanwhile. > > -- > Michael S. Tsirkin > Staff Engineer, Mellanox Technologies > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Mon Feb 27 09:59:06 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 27 Feb 2006 09:59:06 -0800 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: References: <1140801374.1158.22.camel@camp4.serpentine.com> Message-ID: <44033DEA.8060105@ichips.intel.com> Roland Dreier wrote: > I've said this before, but I think the simplest way to handle kernel > components is to simply say we support vanilla upstream kernels. If > your favorite feature isn't upstream, then this should motivate you to > get it upstream. I agree. If code isn't good enough to merge upstream, then we should treat it as not being ready for release. It doesn't make sense to support it through a separate release. - Sean From mshefty at ichips.intel.com Mon Feb 27 10:00:55 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 27 Feb 2006 10:00:55 -0800 Subject: [openib-general] Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID In-Reply-To: <20060225170316.GA15973@mellanox.co.il> References: <200602231814.26918.jackm@mellanox.co.il> <200602231825.19333.jackm@mellanox.co.il> <43FDF106.8070403@ichips.intel.com> <20060225170316.GA15973@mellanox.co.il> Message-ID: <44033E57.8090905@ichips.intel.com> Michael S. Tsirkin wrote: > I think we can use the method field for this. > C15-0.1.18: > > ... > . The method used for all packets sent from the Receiver to the Sender > shall be SubnAdmGetTable() or SubnAdmGetTraceTable(), depending > on which initiated the transfer. > > . The method used for all packets sent from the Sender to the Receiver > shall be SubnAdmGetTableResp(). I'd like to resist putting protocol specific knowledge into the MAD layer as much as possible. Such a fix is unlikely to apply in more generic cases. I believe that this is an architectural issue that would be best addressed at that level. - Sean From mst at mellanox.co.il Mon Feb 27 10:11:56 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 27 Feb 2006 20:11:56 +0200 Subject: [openib-general] Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID In-Reply-To: <44033E57.8090905@ichips.intel.com> References: <200602231814.26918.jackm@mellanox.co.il> <200602231825.19333.jackm@mellanox.co.il> <43FDF106.8070403@ichips.intel.com> <20060225170316.GA15973@mellanox.co.il> <44033E57.8090905@ichips.intel.com> Message-ID: <20060227181156.GC17414@mellanox.co.il> Quoting r. Sean Hefty : > >I think we can use the method field for this. > >C15-0.1.18: > > > >... > >. The method used for all packets sent from the Receiver to the Sender > >shall be SubnAdmGetTable() or SubnAdmGetTraceTable(), depending > >on which initiated the transfer. > > > >. The method used for all packets sent from the Sender to the Receiver > >shall be SubnAdmGetTableResp(). > > I'd like to resist putting protocol specific knowledge into the MAD layer > as much as possible. Such a fix is unlikely to apply in more generic > cases. I believe that this is an architectural issue that would be best > addressed at that level. I dont see a way around this: the dirt is at the spec level. You just said: I don't believe that there's any way to know if an abort is for an RMPP message that is being sent, versus received. Did you change your mind then? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mshefty at ichips.intel.com Mon Feb 27 10:16:03 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 27 Feb 2006 10:16:03 -0800 Subject: [openib-general] Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID In-Reply-To: <200602271534.22167.jackm@mellanox.co.il> References: <1140798219.4336.4.camel@hal.voltaire.com> <43FF47E1.4090406@ichips.intel.com> <200602271534.22167.jackm@mellanox.co.il> Message-ID: <440341E3.9010502@ichips.intel.com> Jack Morgenstein wrote: >>I was thinking about how we can reduce some of the inefficiencies. >>Currently, we only track if a request is waiting for a response or not. We >>can add a new state indicating that a response is in progress, which would >>be set when the first segment of a response is received. This would be >>used to suppress duplicate requests. > > There is still a race condition here. The duplicate request could go out > while the first response segment is in transit (particularly true for > requests which generate huge responses!). > > This fix is OK, but does not absolve the responding side from checking as > well. Yes - I'm aware that there's still a race here, but I don't see a way around it on the send side. The goal here is to reduce some of the inefficiencies. Consider the existing ib_sa.h interface. A request is not automatically retried by the MAD layer, so will time out after 1 attempt. The RMPP response will be reassembled, then tossed. The user may retry the request, but will receive another TID when doing so. >>On the receive side, I was considering adding an API that the user would >>invoke to indicate that a response was being generated. The MAD layer >>would queue this information, and a received request would be checked >>against this queue to determine if it were a duplicate. When the response >>is sent, the queued information would be removed. I think that we may be >>able to use such an API to support dual-sided RMPP as well. > > There is still a race here -- between user indicating that a response will be > generated, and a new request arriving. Not serious, though since presumably > a duplicate request will only be issued after a significant timeout (seconds), > and this API would be invoked immediately. > This does demand changes in user code, which checking at the mad-send > time does not. Unless the ULP does all the checking, there will always be a race. What this does do is decrease the size of the window for detecting duplicate requests. We want to detect duplicate requests sooner to avoid as much processing as possible. This also leaves control of request-response handling to the ULP. > I recommend that we use the mad duplicate RMPP send patch now (since it -- or > something like it -- will still be needed when we do handling at the > requester side, and at the receive side of the responder). This fix is > admittedly incomplete, since it not as efficient as I would like (e.g., the > duplicate request is still processed, and is thrown out only after all the > processing is complete) -- but it does fix the problem. The only problem that these patches seem to address are inefficiencies processing a duplicate request and sending some extra MADs. Is there a more severe problem that you can point to? It also seems that we're unlikely to hit any of these problems with the current ib_sa interface. - Sean From mst at mellanox.co.il Mon Feb 27 10:18:47 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 27 Feb 2006 20:18:47 +0200 Subject: [openib-general] RFC: SDP plans In-Reply-To: <19a929370602270944v30f79516id428b0913bd04a08@mail.gmail.com> References: <20060227173459.GB19855@mellanox.co.il> <19a929370602270944v30f79516id428b0913bd04a08@mail.gmail.com> Message-ID: <20060227181847.GA19265@mellanox.co.il> Quoting Bill Boas : > Michael, > > Having a stable, performant SDP in Rel. 1.0 available from Novel and RedHat is > absolutely critical for Wall Street and others. > > If anyone diagrees with this requirement please speak up! > > Bill. I think I agree with Roland and Sean that its best to avoid distributing kernel level code in Release 1.0. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mshefty at ichips.intel.com Mon Feb 27 10:17:19 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 27 Feb 2006 10:17:19 -0800 Subject: [openib-general] Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID In-Reply-To: <20060227181156.GC17414@mellanox.co.il> References: <200602231814.26918.jackm@mellanox.co.il> <200602231825.19333.jackm@mellanox.co.il> <43FDF106.8070403@ichips.intel.com> <20060225170316.GA15973@mellanox.co.il> <44033E57.8090905@ichips.intel.com> <20060227181156.GC17414@mellanox.co.il> Message-ID: <4403422F.2000408@ichips.intel.com> Michael S. Tsirkin wrote: > I dont see a way around this: the dirt is at the spec level. Correct - I believe this is an architectural issue that needs to be addressed by changes to the spec. - Sean From mst at mellanox.co.il Mon Feb 27 10:27:46 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 27 Feb 2006 20:27:46 +0200 Subject: [openib-general] Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID In-Reply-To: <4403422F.2000408@ichips.intel.com> References: <200602231814.26918.jackm@mellanox.co.il> <200602231825.19333.jackm@mellanox.co.il> <43FDF106.8070403@ichips.intel.com> <20060225170316.GA15973@mellanox.co.il> <44033E57.8090905@ichips.intel.com> <20060227181156.GC17414@mellanox.co.il> <4403422F.2000408@ichips.intel.com> Message-ID: <20060227182746.GB19265@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID > > >I dont see a way around this: the dirt is at the spec level. > > Correct - I believe this is an architectural issue that needs to be > addressed by changes to the spec. Okay. But, note this affects ACKs as well as ABORTs (with multi-packet requests): Host A sends an RMPP request message to host B with TID=3 Host B sends an RMPP request message to host A with TID=3. Now if A generates an RMPP response it has TID=3. If B sends ACK, host A has no idea which transaction is being ACKed. What kind of spec change do you envision? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mshefty at ichips.intel.com Mon Feb 27 10:25:39 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 27 Feb 2006 10:25:39 -0800 Subject: [openib-general] RFC: SDP plans In-Reply-To: <19a929370602270944v30f79516id428b0913bd04a08@mail.gmail.com> References: <20060227173459.GB19855@mellanox.co.il> <19a929370602270944v30f79516id428b0913bd04a08@mail.gmail.com> Message-ID: <44034423.8090109@ichips.intel.com> Bill Boas wrote: > Having a stable, performant SDP in Rel. 1.0 available from Novel and > RedHat is absolutely critical for Wall Street and others. Then release 1.0 will need to wait until SPD has been updated. - Sean From mst at mellanox.co.il Mon Feb 27 10:39:05 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 27 Feb 2006 20:39:05 +0200 Subject: [openib-general] Re: Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID In-Reply-To: <440341E3.9010502@ichips.intel.com> References: <1140798219.4336.4.camel@hal.voltaire.com> <43FF47E1.4090406@ichips.intel.com> <200602271534.22167.jackm@mellanox.co.il> <440341E3.9010502@ichips.intel.com> Message-ID: <20060227183905.GC19265@mellanox.co.il> Quoting r. Sean Hefty : > The only problem that these patches seem to address are inefficiencies > processing a duplicate request and sending some extra MADs. Is there a > more severe problem that you can point to? Note that once you have multiple outstanding transactions with the same TID/GID/method, the specific RMPP transaction will be sure to fail since ACKs will get matched to the wrong transaction. We are actually seeing these failures on big clusters when a diagnostic tool tries to get a list of all nodes from opensm: it seems to often retry the request MAD while the SA node gets round to responding to the first one. > It also seems that we're unlikely to hit any of these problems with the > current ib_sa interface. > > - Sean Possibly - we are seeing these when posting things all node port info request on top of umad. The problem appears on the opensm side. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Mon Feb 27 10:42:12 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 27 Feb 2006 20:42:12 +0200 Subject: [openib-general] RFC: SDP plans In-Reply-To: <44034423.8090109@ichips.intel.com> References: <20060227173459.GB19855@mellanox.co.il> <19a929370602270944v30f79516id428b0913bd04a08@mail.gmail.com> <44034423.8090109@ichips.intel.com> Message-ID: <20060227184212.GD19265@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [openib-general] RFC: SDP plans > > Bill Boas wrote: > >Having a stable, performant SDP in Rel. 1.0 available from Novel and > >RedHat is absolutely critical for Wall Street and others. > > Then release 1.0 will need to wait until SPD has been updated. I'm not sure what the point would be: is that in case SDP misses 2.6.17? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mshefty at ichips.intel.com Mon Feb 27 10:50:37 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 27 Feb 2006 10:50:37 -0800 Subject: [openib-general] RFC: SDP plans In-Reply-To: <20060227184212.GD19265@mellanox.co.il> References: <20060227173459.GB19855@mellanox.co.il> <19a929370602270944v30f79516id428b0913bd04a08@mail.gmail.com> <44034423.8090109@ichips.intel.com> <20060227184212.GD19265@mellanox.co.il> Message-ID: <440349FD.1050103@ichips.intel.com> Michael S. Tsirkin wrote: >>>Having a stable, performant SDP in Rel. 1.0 available from Novel and >>>RedHat is absolutely critical for Wall Street and others. >> >>Then release 1.0 will need to wait until SPD has been updated. > > I'm not sure what the point would be: is that in case SDP misses 2.6.17? My understanding is that SDP is not ready for merging upstream. If having a stable release of SDP is a requirement for release 1.0, then the release should wait until you've completed your updates, and it is ready for merging upstream. - Sean From mshefty at ichips.intel.com Mon Feb 27 11:02:55 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 27 Feb 2006 11:02:55 -0800 Subject: [openib-general] Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID In-Reply-To: <20060227182746.GB19265@mellanox.co.il> References: <200602231814.26918.jackm@mellanox.co.il> <200602231825.19333.jackm@mellanox.co.il> <43FDF106.8070403@ichips.intel.com> <20060225170316.GA15973@mellanox.co.il> <44033E57.8090905@ichips.intel.com> <20060227181156.GC17414@mellanox.co.il> <4403422F.2000408@ichips.intel.com> <20060227182746.GB19265@mellanox.co.il> Message-ID: <44034CDF.9030204@ichips.intel.com> Michael S. Tsirkin wrote: > Okay. But, note this affects ACKs as well as ABORTs (with multi-packet > requests): > > Host A sends an RMPP request message to host B with TID=3 > Host B sends an RMPP request message to host A with TID=3. > Now if A generates an RMPP response it has TID=3. > > If B sends ACK, host A has no idea which transaction is being ACKed. Bah... can we distinguish which transaction is being ACKed by the response bit? There's a slight difference between the ACK and abort. ACKs always match with sends, whereas an abort can match with a send or receive. > What kind of spec change do you envision? Maybe having the abort carry whether a send or receive is being aborted? We may need to start with identifying which stop/abort codes can match with sends and receives, then creating two codes. - Sean From mshefty at ichips.intel.com Mon Feb 27 11:12:54 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 27 Feb 2006 11:12:54 -0800 Subject: [openib-general] Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID In-Reply-To: <20060227183905.GC19265@mellanox.co.il> References: <1140798219.4336.4.camel@hal.voltaire.com> <43FF47E1.4090406@ichips.intel.com> <200602271534.22167.jackm@mellanox.co.il> <440341E3.9010502@ichips.intel.com> <20060227183905.GC19265@mellanox.co.il> Message-ID: <44034F36.2000402@ichips.intel.com> Michael S. Tsirkin wrote: > Note that once you have multiple outstanding transactions with the same > TID/GID/method, the specific RMPP transaction will be sure to fail > since ACKs will get matched to the wrong transaction. As long as all of the ACKs match to the same RMPP response transaction, this should be okay. Some of the ACKs will be interpreted as old/duplicates and be discarded. The first response should be reassembled on the requester side. Additional responses may time out waiting for an ACK that gets matched to another request, but that shouldn't matter. If this is not the case, I'd like to understand why this isn't happening. There may be a more serious issue that we're overlooking. - Sean From betsy at pathscale.com Mon Feb 27 11:25:02 2006 From: betsy at pathscale.com (Betsy Zeller) Date: Mon, 27 Feb 2006 11:25:02 -0800 Subject: [openib-general] RFC: SDP plans In-Reply-To: <440349FD.1050103@ichips.intel.com> References: <20060227173459.GB19855@mellanox.co.il> <19a929370602270944v30f79516id428b0913bd04a08@mail.gmail.com> <44034423.8090109@ichips.intel.com> <20060227184212.GD19265@mellanox.co.il> <440349FD.1050103@ichips.intel.com> Message-ID: <1141068303.8307.175.camel@sarium.internal.keyresearch.com> Adding Bryan O'Sullivan (OpenIB 1.0 release manager) to cc list. -Betsy On Mon, 2006-02-27 at 10:50 -0800, Sean Hefty wrote: > Michael S. Tsirkin wrote: > >>>Having a stable, performant SDP in Rel. 1.0 available from Novel and > >>>RedHat is absolutely critical for Wall Street and others. > >> > >>Then release 1.0 will need to wait until SPD has been updated. > > > > I'm not sure what the point would be: is that in case SDP misses 2.6.17? > > My understanding is that SDP is not ready for merging upstream. If having a > stable release of SDP is a requirement for release 1.0, then the release should > wait until you've completed your updates, and it is ready for merging upstream. > > - Sean -- Betsy Zeller Director of Software Engineering PathScale, Inc 2071 Stierlin Court, Suite 200 1-650-934-8088 From mst at mellanox.co.il Mon Feb 27 11:33:11 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 27 Feb 2006 21:33:11 +0200 Subject: [openib-general] RFC: SDP plans In-Reply-To: <440349FD.1050103@ichips.intel.com> References: <20060227173459.GB19855@mellanox.co.il> <19a929370602270944v30f79516id428b0913bd04a08@mail.gmail.com> <44034423.8090109@ichips.intel.com> <20060227184212.GD19265@mellanox.co.il> <440349FD.1050103@ichips.intel.com> Message-ID: <20060227193311.GA20064@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [openib-general] RFC: SDP plans > > Michael S. Tsirkin wrote: > >>>Having a stable, performant SDP in Rel. 1.0 available from Novel and > >>>RedHat is absolutely critical for Wall Street and others. > >> > >>Then release 1.0 will need to wait until SPD has been updated. > > > >I'm not sure what the point would be: is that in case SDP misses 2.6.17? > > My understanding is that SDP is not ready for merging upstream. If having > a stable release of SDP is a requirement for release 1.0, then the release > should wait until you've completed your updates, and it is ready for > merging upstream. Right, but SDP kernel/user interface is well defined, besides the protocol family. Its basically the regular socket interface + unbind option. So its possible to make the protocol family configurable in userspace sdp library (libsdp) and then include libsdp in Release 1.0 and it will work with whatever SDP code makes it upstream. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Mon Feb 27 11:36:37 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 27 Feb 2006 21:36:37 +0200 Subject: [openib-general] Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID In-Reply-To: <44034F36.2000402@ichips.intel.com> References: <1140798219.4336.4.camel@hal.voltaire.com> <43FF47E1.4090406@ichips.intel.com> <200602271534.22167.jackm@mellanox.co.il> <440341E3.9010502@ichips.intel.com> <20060227183905.GC19265@mellanox.co.il> <44034F36.2000402@ichips.intel.com> Message-ID: <20060227193637.GB20064@mellanox.co.il> Quoting Sean Hefty : > As long as all of the ACKs match to the same RMPP response transaction, > this should be okay. Some of the ACKs will be interpreted as > old/duplicates and be discarded. The first response should be reassembled > on the requester side. Additional responses may time out waiting for an ACK > that gets matched to another request, but that shouldn't matter. > > If this is not the case, I'd like to understand why this isn't happening. > There may be a more serious issue that we're overlooking. What you say makes sense, but I think thats not the case. I'll let Jack speak about this tomorrow - it was him that debugged this. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From sashak at voltaire.com Mon Feb 27 11:37:32 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 27 Feb 2006 21:37:32 +0200 Subject: [openib-general] [PATCH] osm/complib: remove unused files Message-ID: <20060227193732.GB10270@sashak.voltaire.com> Hello, This removes some unused files from complib. Sasha. Remove some unused files from complib Signed-off-by: Sasha Khapyorsky --- osm/complib/Makefile.am | 11 - osm/complib/cl_complib.c | 18 - osm/complib/cl_device.c | 145 ------- osm/complib/cl_syshelper.c | 72 ---- osm/complib/cl_waitobj.c | 215 ----------- osm/include/Makefile.am | 3 osm/include/complib/cl_device.h | 712 ------------------------------------ osm/include/complib/cl_syshelper.h | 238 ------------ osm/include/complib/cl_waitobj.h | 369 ------------------- 9 files changed, 4 insertions(+), 1779 deletions(-) diff --git a/osm/complib/Makefile.am b/osm/complib/Makefile.am index 9dbf7a1..c92bffe 100644 --- a/osm/complib/Makefile.am +++ b/osm/complib/Makefile.am @@ -17,14 +17,14 @@ else libosmcomp_version_script = endif -libosmcomp_la_SOURCES = cl_async_proc.c cl_complib.c cl_device.c \ +libosmcomp_la_SOURCES = cl_async_proc.c cl_complib.c \ cl_dispatcher.c cl_event.c cl_event_wheel.c \ cl_list.c cl_log.c cl_map.c cl_memory.c \ cl_memory_osd.c cl_obj.c cl_perf.c cl_pool.c \ cl_ptr_vector.c cl_reqmgr.c \ cl_spinlock.c cl_statustext.c \ - cl_syshelper.c cl_thread.c cl_threadpool.c \ - cl_timer.c cl_vector.c cl_waitobj.c \ + cl_thread.c cl_threadpool.c \ + cl_timer.c cl_vector.c \ ib_statustext.c libosmcomp_la_LDFLAGS = -version-info $(complib_api_version) \ -export-dynamic $(libosmcomp_version_script) @@ -40,7 +40,6 @@ libosmcompinclude_HEADERS = $(srcdir)/.. $(srcdir)/../include/complib/cl_comppool.h \ $(srcdir)/../include/complib/cl_debug.h \ $(srcdir)/../include/complib/cl_debug_osd.h \ - $(srcdir)/../include/complib/cl_device.h \ $(srcdir)/../include/complib/cl_dispatcher.h \ $(srcdir)/../include/complib/cl_event.h \ $(srcdir)/../include/complib/cl_event_wheel.h \ @@ -68,7 +67,6 @@ libosmcompinclude_HEADERS = $(srcdir)/.. $(srcdir)/../include/complib/cl_reqmgr.h \ $(srcdir)/../include/complib/cl_spinlock.h \ $(srcdir)/../include/complib/cl_spinlock_osd.h \ - $(srcdir)/../include/complib/cl_syshelper.h \ $(srcdir)/../include/complib/cl_thread.h \ $(srcdir)/../include/complib/cl_thread_osd.h \ $(srcdir)/../include/complib/cl_threadpool.h \ @@ -81,8 +79,7 @@ libosmcompinclude_HEADERS = $(srcdir)/.. $(srcdir)/../include/complib/cl_timer_osd.h \ $(srcdir)/../include/complib/cl_types.h \ $(srcdir)/../include/complib/cl_types_osd.h \ - $(srcdir)/../include/complib/cl_vector.h \ - $(srcdir)/../include/complib/cl_waitobj.h + $(srcdir)/../include/complib/cl_vector.h # headers are distributed as part of the include dir EXTRA_DIST = $(srcdir)/libosmcomp.spec.in $(srcdir)/libosmcomp.map \ diff --git a/osm/complib/cl_complib.c b/osm/complib/cl_complib.c index 9eb475b..5a3ab62 100644 --- a/osm/complib/cl_complib.c +++ b/osm/complib/cl_complib.c @@ -40,10 +40,7 @@ #endif /* HAVE_CONFIG_H */ #include -#include #include -#include -#include #include #include @@ -77,17 +74,6 @@ complib_init(void) cl_status_t status = CL_SUCCESS; /* - * System Helper Init. - */ - status = __cl_user_syshelper_init(); - if( status != CL_SUCCESS ) - { - cl_msg_out( "__init: failed to init syshelper (%s)\n", - CL_STATUS_MSG( status ) ); - exit(1); - } - - /* * Timer Init */ @@ -105,16 +91,12 @@ __attribute (( destructor )) complib_fini(void) { __cl_timer_prov_destroy(); - - __cl_user_syshelper_exit(); } void complib_exit(void) { __cl_timer_prov_destroy(); - - __cl_user_syshelper_exit(); } boolean_t diff --git a/osm/complib/cl_device.c b/osm/complib/cl_device.c deleted file mode 100644 index b9adad1..0000000 --- a/osm/complib/cl_device.c +++ /dev/null @@ -1,145 +0,0 @@ -/* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. - * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - * $Id$ - */ - -#if HAVE_CONFIG_H -# include -#endif /* HAVE_CONFIG_H */ - -/* - * Standard user mode includes - */ - -#include -#include -#include -#include -#include -#include - -#include -#include -#include - - -cl_status_t -cl_open_device( - IN cl_dev_name_t device_name, - IN cl_dev_handle_t *p_dev_handle ) -{ - /* sanity check */ - if ( p_dev_handle == NULL) - { - return CL_INVALID_PARAMETER; - } - - cl_dbg_out ("cl_open_device: opening device %s\n", - device_name); - - *p_dev_handle = open(device_name, O_RDWR); - - if (*p_dev_handle < 0) - { - *p_dev_handle = 0; - cl_msg_out("cl_open_dev: error opening %s (%s)\n", - device_name, strerror(errno)); - return CL_ERROR; - } - else - { - return CL_SUCCESS; - } -} - -void -cl_close_device( - IN cl_dev_handle_t dev_handle ) -{ - int status = 0; - - status = close (dev_handle); - if (status) - { - cl_msg_out("cl_close_device: error closing device (%s)\n", - strerror(errno)); - } - return; -} - -cl_status_t -cl_ioctl_device( - IN cl_dev_handle_t dev_handle, - IN uint32_t command, - IN void *p_buf, - IN uintn_t buf_size, - OUT uintn_t *p_num_bytes_ret ) -{ - cl_ioctl_info_t ioctl_args; - int retval = 0; - - /* - * Fill up ioctl_args and issue a real ioctl - */ - ioctl_args.command = command; - ioctl_args.p_buf = p_buf; - ioctl_args.buf_size = buf_size; - ioctl_args.num_bytes_ret = 0; /* for now */ - ioctl_args.io_status = CL_SUCCESS; /* lets start here */ - - retval = ioctl(dev_handle, command, &ioctl_args); - - if (retval != 0) - { - cl_msg_out("cl_ioctl_device: error (%s) issuing command (0x%x)\n", - strerror(errno), command); - return CL_ERROR; - } - - /* - * Set the Number of bytes returned from the Kernel. - * The driver sets the number of bytes returned in - * ioctl_args.num_bytes_ret - */ - if (p_num_bytes_ret != NULL) - { - *p_num_bytes_ret = ioctl_args.num_bytes_ret; - } - - /* - * Return the status received from the Kernel - */ - - return (ioctl_args.io_status); -} diff --git a/osm/complib/cl_syshelper.c b/osm/complib/cl_syshelper.c deleted file mode 100644 index f95408b..0000000 --- a/osm/complib/cl_syshelper.c +++ /dev/null @@ -1,72 +0,0 @@ -/* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. - * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - * $Id$ - */ - - -#if HAVE_CONFIG_H -# include -#endif /* HAVE_CONFIG_H */ - -#include -#include -#include -#include -#include - -#include -#include -#include -#include -#include - -/* - * Open the system helper device and prepare it for use - */ -cl_status_t -__cl_user_syshelper_init(void) -{ - cl_status_t status = CL_SUCCESS; - - /* Nothing to do. Just a place holder */ - return status; -} - -void -__cl_user_syshelper_exit(void) -{ - /* Nothing to do. Just a place holder */ - - return; -} diff --git a/osm/complib/cl_waitobj.c b/osm/complib/cl_waitobj.c deleted file mode 100644 index 00cc8e7..0000000 --- a/osm/complib/cl_waitobj.c +++ /dev/null @@ -1,215 +0,0 @@ -/* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. - * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - * $Id$ - */ - -/* - * Abstract: - * This module defines - * - * Environment: - * Linux User Mode - * - * $Revision: 1.5 $ - */ - -#if HAVE_CONFIG_H -# include -#endif /* HAVE_CONFIG_H */ - -#include -#include -#include -#include -#include - -#include -#include -#include -#include -#include -#include - - -cl_status_t -cl_create_wait_object( - IN const boolean_t auto_reset, - OUT cl_wait_obj_handle_t *p_wait_obj_handle ) -{ - cl_status_t status = CL_SUCCESS; - uintn_t command; - cl_create_wait_obj_params_t ioctl_buf; - uintn_t num_bytes_returned; - cl_dev_handle_t device_handle; - - /* - * First open sysdev to get a new FD - */ - status = cl_open_device( SYSHELP_DEVICE_NAME, &device_handle ); - if ( status != CL_SUCCESS ) - { - cl_msg_out( "Failed to open device %s, status (%s)\n", - SYSHELP_DEVICE_NAME, CL_STATUS_MSG(status) ); - return status; - } - - command = CREATE_WAIT_OBJ; - ioctl_buf.auto_reset = auto_reset; - - status = cl_ioctl_device(device_handle, - command, - &ioctl_buf, /* result */ - sizeof (cl_create_wait_obj_params_t), - &num_bytes_returned); - - status = (CL_SUCCESS != status) ? status : ioctl_buf.status; - - if ( status != CL_SUCCESS) - { - cl_msg_out("create_wait_object: failed to create waitobject (%s)\n", - CL_STATUS_MSG(status) ); - return status; - } - - *p_wait_obj_handle = (cl_wait_obj_handle_t)device_handle; - - return status; -} - -cl_status_t -cl_wait_on_wait_object( - IN cl_wait_obj_handle_t wait_obj_handle, - IN uint32_t wait_u_sec ) -{ - cl_status_t status = CL_SUCCESS; - cl_wait_ioctl_params_t ioctl_params; - uintn_t command; - cl_dev_handle_t device_handle = (cl_dev_handle_t)wait_obj_handle; - - /* fill out the ioctl parameters */ - ioctl_params.wait_u_sec = wait_u_sec; - ioctl_params.wait_status = CL_SUCCESS; - - command = WAITON_WAIT_OBJ; - - status = cl_ioctl_device(device_handle, - command, - &ioctl_params, - sizeof(cl_wait_ioctl_params_t), - NULL ); - - if ( status != CL_SUCCESS) - { - cl_msg_out("wait_on_wait_object: cl_ioctl_device failed (%s)\n", - CL_STATUS_MSG(status) ); - return status; - } - - /* - * if the ioctl returned successfully, return the status - * of the wait - */ - - return (ioctl_params.wait_status); -} - -cl_status_t -cl_signal_wait_object( - IN cl_wait_obj_handle_t wait_obj_handle ) -{ - uintn_t command; - cl_status_t status = CL_SUCCESS; - void *p_in_buf; - cl_dev_handle_t device_handle = (cl_dev_handle_t)wait_obj_handle; - - p_in_buf = wait_obj_handle; - - command = TRIGGER_WAIT_OBJ; - - status = cl_ioctl_device(device_handle, - command, - &p_in_buf, - sizeof(p_in_buf), - NULL); - - if ( status != CL_SUCCESS) - { - cl_msg_out("trigger_wait_object: cl_ioctl_device failed (%s)\n", - CL_STATUS_MSG(status) ); - return status; - } - - return status; -} - -cl_status_t -cl_destroy_wait_object( - IN cl_wait_obj_handle_t wait_obj_handle ) -{ - cl_status_t status = CL_SUCCESS; - cl_dev_handle_t device_handle = (cl_dev_handle_t)wait_obj_handle; - - cl_close_device( device_handle ); - - return status; -} - -cl_status_t -cl_clear_wait_object( - IN cl_wait_obj_handle_t wait_obj_handle ) -{ - uintn_t command; - cl_status_t status = CL_SUCCESS; - void *p_in_buf; - cl_dev_handle_t device_handle = (cl_dev_handle_t)wait_obj_handle; - - p_in_buf = wait_obj_handle; - - command = RESET_WAIT_OBJ; - - status = cl_ioctl_device(device_handle, - command, - &p_in_buf, - sizeof(p_in_buf), - NULL); - - if ( status != CL_SUCCESS) - { - cl_msg_out("clear_wait_object: cl_ioctl_device failed (%s)\n", - CL_STATUS_MSG(status) ); - return status; - } - - return status; -} diff --git a/osm/include/Makefile.am b/osm/include/Makefile.am index 68e60bf..f487b30 100644 --- a/osm/include/Makefile.am +++ b/osm/include/Makefile.am @@ -135,8 +135,6 @@ EXTRA_DIST = \ $(srcdir)/complib/cl_qpool.h \ $(srcdir)/complib/cl_qlist.h \ $(srcdir)/complib/cl_reqmgr.h \ - $(srcdir)/complib/cl_waitobj.h \ - $(srcdir)/complib/cl_device.h \ $(srcdir)/complib/cl_vector.h \ $(srcdir)/complib/cl_byteswap_osd.h \ $(srcdir)/complib/cl_qlockpool.h \ @@ -149,7 +147,6 @@ EXTRA_DIST = \ $(srcdir)/complib/cl_list.h \ $(srcdir)/complib/cl_atomic.h \ $(srcdir)/complib/cl_map.h \ - $(srcdir)/complib/cl_syshelper.h \ $(srcdir)/complib/cl_timer.h \ $(srcdir)/complib/cl_event.h \ $(srcdir)/complib/cl_log.h \ diff --git a/osm/include/complib/cl_device.h b/osm/include/complib/cl_device.h deleted file mode 100644 index 5cd44ef..0000000 --- a/osm/include/complib/cl_device.h +++ /dev/null @@ -1,712 +0,0 @@ -/* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. - * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - * $Id$ - */ - - - -/* - * Abstract: - * This module defines the data structure and APIs for the kernel mode - * component of the device framework. - * - * Environment: - * Linux Kernel Mode - * - * $Revision: 1.5 $ - */ - - -#ifndef _CL_DEVICE_H_ -#define _CL_DEVICE_H_ - - -#include -#include -#include -#include -#include -#ifdef __KERNEL__ -#include -#endif - -#ifdef __cplusplus -# define BEGIN_C_DECLS extern "C" { -# define END_C_DECLS } -#else /* !__cplusplus */ -# define BEGIN_C_DECLS -# define END_C_DECLS -#endif /* __cplusplus */ - -BEGIN_C_DECLS - - -/****h* Component Library/Device Framework -* NAME -* Device Framework -* -* DESCRIPTION -* The device framework provides functionality exchanging information between -* kernel mode and user mode components. -* -* In kernel mode, the device framework provides functionality for creating -* devices and providing various entry points that are called when an -* application using the user-mode device framework accesses the device. -* -* In user mode, the device framework provides applications with a simplified -* device usage model, allowing access to a device created using the kernel -* mode device framework. -*********/ - - -/* - * Generic way to define a IOCTL_CMD - */ - -#define IOCTL_CMD(dev_id, command) _IO((dev_id), (command)) - - -/****d* Component Library: Device Framework/cl_dev_name_t -* NAME -* cl_dev_name_t -* -* DESCRIPTION -* cl_dev_name defines the name of the system device being created. -* -* SYNOPSIS -*/ -typedef char *cl_dev_name_t; -/* -* NOTES -* In Linux this string turns is similar to "/dev/iba0". -*********/ - - -/****s* Component Library: Device Framework/cl_ioctl_info_t -* NAME -* cl_ioctl_info_t -* -* PURPOSE -* Defines the command and the input parameters for an ioctl passed -* in from user mode. -* -* SYNOPSIS -*/ -typedef struct _cl_ioctl_info -{ - uintn_t command; /* IOCTL Command */ - void *p_buf; /* Pointer to the input buffer */ - uintn_t buf_size; /* Size of the input buffer */ - uintn_t num_bytes_ret; /* Bytes returned by the ioctl */ - cl_status_t io_status; /* Status of the IOCTL */ - -} cl_ioctl_info_t; -/**********/ - - -#ifdef __KERNEL__ - -/* Linux Kernel Mode */ - -/****d* Component Library: Device Framework/cl_dev_handle_t (Kernel Mode) -* NAME -* cl_dev_handle_t (Kernel Mode) -* -* DESCRIPTION -* Handle to an device framework created device object. -* -* SYNOPSIS -*/ -typedef void *cl_dev_handle_t; -/*********/ - - -/****d* Component Library: Device Framework/cl_pfn_dev_open_t -* NAME -* cl_pfn_dev_open_t -* -* DESCRIPTION -* Prototype of the driver exported "open" function. This function -* is called when a user mode application opens the device -* -* SYNOPSIS -*/ -typedef cl_status_t -(*cl_pfn_dev_open_t)( - IN void *p_device_context, - OUT void **pp_open_context ); - -/* -* PARAMETERS -* p_device_context -* [in] device specific context passed into the open function. This -* is defined when the device is first created. -* -* pp_open_context -* [out] context specific to this open. This is set by the open call -* and is passed into the read, write, mmap, ioctl and close calls. -* -* RETURN VALUE -* This function returns a status of type cl_status_t -* TBD -* -* NOTES -* -* SEE ALSO -* Device Framework, cl_pfn_dev_close_t, cl_pfn_dev_ioctl_t, cl_pfn_dev_mmap_t -*************/ - - -/****d* Component Library: Device Framework/cl_pfn_dev_close_t -* NAME -* cl_pfn_dev_close_t -* -* DESCRIPTION -* Prototype of the driver exported "close" function. This function -* is called when a user mode application closes the device. -* -* SYNOPSIS -*/ -typedef cl_status_t -(*cl_pfn_dev_close_t)( - IN void *p_device_context, - IN void *p_open_context ); - -/* -* PARAMETERS -* p_device_context -* [in] device specific context passed into the close function. This -* is defined when the device is first created. -* -* p_open_context -* [in] context specific to this open. This is set by the open call. -* -* RETURN VALUE -* This function returns a status of type cl_status_t -* TBD -* -* NOTES -* -* SEE ALSO -* Device Framework, cl_pfn_dev_open_t, cl_pfn_dev_ioctl_t, cl_pfn_dev_mmap_t -*************/ - - -/****d* Component Library: Device Framework/cl_pfn_dev_ioctl_t -* NAME -* cl_pfn_dev_ioctl_t -* -* DESCRIPTION -* Prototype of the driver exported "ioctl" function. This function -* is called when a user mode application issues and IO control to -* the device. -* -* SYNOPSIS -*/ -typedef void -(*cl_pfn_dev_ioctl_t)( - IN void *p_device_context, - IN void *p_open_context, - IN OUT cl_ioctl_info_t *p_ioctl_info ); -/* -* PARAMETERS -* p_device_context -* [in] device specific context passed into the open function. This -* is created when the device is first created. -* -* p_open_context -* [in] context specific to this open. This is set by the open call. -* -* p_ioctl_info -* [in/out] On input contains the input parameters such as the command, -* the input buffer and it's size. On output, contains the output -* buffer and the number of bytes returned. -* -* RETURN VALUE -* This function does not return a value. -* -* NOTES -* The IOCTL completion status is returned in the status field of the -* p_ioctl_info structure. -* -* If set to CL_SUCCESS, the command completed successfully. The results -* are returned in the same buffer (p_ioclt_info->p_in_buf) and num_ret_bytes -* are copied by the device framework into the output buffer and returned -* back to user mode. -* -* If set to CL_PENDING, the command did not complete (further processing -* required by the kernel handler). If the user application does not specify -* an OS wait object (see cl_ioctl_info_t), the device framework will suspend -* the execution of the current thread (put it to sleep). If a wait object is -* provided, the device framework will return immediately. The user mode -* application must check the status of the command and get results using -* get_ioctl_status -* -* Other failure status values will cause the device framework to perform -* all necessary cleanup for the IOCTL before returning to the user. -* -* The mechanism to support pending ioclts is currently not supported. -* -* SEE ALSO -* Device Framework, cl_pfn_dev_open_t, cl_pfn_dev_close_t, cl_pfn_dev_mmap_t -************/ - -/****d* Component Library: Device Framework/cl_pfn_dev_cancel_ioctl_t -* NAME -* cl_pfn_dev_cancel_ioctl_t -* -* DESCRIPTION -* Prototype of the driver exported "ioctl" function. This function -* is called when a blocked ioctl is interrupted due to a signal -* -* SYNOPSIS -*/ -typedef void -(*cl_pfn_dev_cancel_ioctl_t)( - IN void *p_device_context, - IN void *p_open_context, - IN cl_ioctl_info_t *p_ioctl_info ); -/* -* PARAMETERS -* p_device_context -* [in] device specific context passed into the open function. This -* is created when the device is first created. -* -* p_open_context -* [in] context specific to this open. This is set by the open call. -* -* p_ioctl_info -* [in] contains the kernel ioctl buffer -* -* RETURN VALUE -* This function does not return a value. -* -* NOTES -* -* SEE ALSO -* Device Framework, cl_pfn_dev_open_t, cl_pfn_dev_close_t, cl_pfn_dev_mmap_t -************/ - -/****d* Component Library: Device Framework/cl_pfn_dev_mmap_t -* NAME -* cl_pfn_dev_mmap_t -* -* DESCRIPTION -* Prototype of the driver exported "mmap" function. This function -* is called when a user mode application calls mmap on this device. -* -* SYNOPSIS -*/ -typedef cl_status_t -(*cl_pfn_dev_mmap_t)( - IN void *p_device_context, - IN void *p_open_context, - IN struct vm_area_struct *p_vma ); -/* -* PARAMETERS -* p_device_context -* [in] device specific context passed into the open function. This -* is created when the device is first created. -* -* pp_open_context -* [out] context specific to this open. This is set by the open call. -* -* RETURN VALUE -* CL_SUCCESS if the operation was successful. -* -* CL_ERROR if the operation failed. -* -* SEE ALSO -* Device Framework, cl_pfn_dev_open_t, cl_pfn_dev_close_t, cl_pfn_dev_ioctl_t -************/ - - -/****s* Component Library: Device Framework/cl_dev_info_t -* NAME -* cl_dev_info_t -* -* DESCRIPTION -* Provides information about for creation of a system device. -* -* SYNOPSIS -*/ -typedef struct _cl_dev_info -{ - cl_dev_name_t name; /* name of the device to create */ - void *p_device_context; /* device context */ - cl_pfn_dev_open_t pfn_open; /* open function */ - cl_pfn_dev_close_t pfn_close; /* close function */ - cl_pfn_dev_ioctl_t pfn_ioctl; /* ioctl function */ - cl_pfn_dev_cancel_ioctl_t pfn_cancel_ioctl; /* cancel ioctl function */ - cl_pfn_dev_mmap_t pfn_mmap; /* mmap function */ - uint32_t max_use_count; /* maximum usage count */ - -} cl_dev_info_t; -/* -* FIELDS -* name -* Name of the device to create. -* -* p_device_context -* Device context. -* -* pfn_open -* open function. -* -* pfn_close -* close function. -* -* pfn_ioctl -* ioctl function. -* -* pfn_mmap -* mmap function. -* -* max_use_count -* maximum usage count. -* -* NOTES -* The consumer of the device framework defines the name of the device, -* a device context and various handlers and registers with the device -* framework. -* -* SEE ALSO -* Device Framework, cl_pfn_dev_open_t, cl_pfn_dev_close_t, -* cl_pfn_dev_ioctl_t, cl_pfn_dev_mmap_t -**********/ - - -/****f* Component Library: Device Framework/cl_dev_create -* NAME -* cl_dev_create -* -* DESCRIPTION -* This function creates a device accessible from a user mode application. -* The device (a character device) can be accessed by opening it. -* -* SYNOPSIS -*/ -cl_status_t -cl_dev_create( - IN cl_dev_info_t *p_dev_info, - OUT cl_dev_handle_t *ph_dev ); -/* -* PARAMETERS -* p_dev_info -* [in] Pointer to the structure containing information about -* the system device that needs to be created. -* -* ph_dev -* [out] Handle to the created system device. -* -* RETURN VALUE -* CL_SUCCESS -* The device is created successfully. -* -* CL_INVALID_PARAMETER -* An input parameter was invalid. -* -* CL_ERROR -* The function failed to create a device. -* -* SEE ALSO -* Device Framework, cl_delete_device -*********/ - -/****d* Component Library: Device Framework/cl_delete_device -* NAME -* cl_delete_device -* -* DESCRIPTION -* Delete a system device. -* -* SYNOPSIS -*/ -void -cl_dev_destroy( - IN cl_dev_handle_t h_dev ); -/* -* PARAMETERS -* h_dev -* [in] handle to the system device. This is returned by cl_dev_create. -* -* RETURN VALUE -* This function does not return a value. -* -* SEE ALSO -* Device Framework, cl_dev_create -*********/ - - -/****d* Component Library: Device Framework/cl_complete_io -* NAME -* cl_complete_io -* -* DESCRIPTION -* This function is used by the driver ioctl handler to indicate that a -* pending ioctl command has completed. The device framework uses this -* "notification" to wake up the blocked user mode thread and complete -* the ioctl. The user mode thread is blocked when the driver ioctl handler -* returns CL_PENDING (cannot complete the ioclt command immediately). -* -* SYNOPSIS -*/ -cl_status_t -cl_complete_io( - IN cl_ioctl_info_t *p_ioctl_info ); -/* -* PARAMETERS -* p_ioctl_info -* [in] Pointer to the kernel ioctl info structure that was passed in -* by the device framework into the ioctl handler. -* -* RETURN VALUE -* CL_SUCCESS -* Completed pending ioctl operation from a previous ioctl command -* successfully. -* -* CL_INVALID_PARAMETER -* The p_ioctl_info parameter was not valid. -* -* NOTES -* When the kernel ioctl handler returns CL_PENDING for a command, the -* user mode application is blocked (if the OS wait object is NULL) or -* returned with CL_PENDING (if a wait object is specified). If a wait -* object is specified, the wait object is now signaled and the user mode -* application needs to call get_ioctl_status() to get results of the -* ioctl command (This isn't supported yet). If the OS wait object isn't -* specified, and the user mode application was blocked by the driver -* framework, it is unblocked now and returned. -* -* SEE ALSO -* Device Framework, cl_dev_create, cl_delete_device. -*********/ - - -/****f* Component Library: Device Framework/cl_dev_init -* NAME -* cl_dev_init -* -* DESCRIPTION -* Initialize internal data structures for creating and managing system -* devices and this framework. -* -* SYNOPSIS -*/ -cl_status_t -__cl_dev_frmwk_init(void); -/* -* RETURN VALUE -* CL_SUCCESS -* If the system device framework was initialized successfully. -* -* CL_INSUFFICIENT_RESOURCES -* Failed to allocate/acquire some system resource. -* -* NOTES -* This function is called by drivers needing to use the driver framework. -* For instance, the driver might need to support IOCTL calls. -* -* SEE ALSO -* Device Framework, cl_dev_destroy -*********/ - -/****f* Component Library: Device Framework/cl_dev_destroy -* NAME -* cl_dev_destroy -* -* DESCRIPTION -* Cleanup all internal data structures, created in system_dev_init, -* to support system devices and this framework -* -* SYNOPSIS -*/ -void -__cl_dev_frmwk_destroy(void); -/* -* RETURN VALUE -* This function does not return a value. -* -* NOTES -* This function is called by a driver that needed device framework support -* and called system_dev_init. This function cleans up everything allocated -* in system_dev_init. -* -* SEE ALSO -* Device Framework, cl_dev_init -*********/ - -cl_status_t -cl_get_open_context( - IN int fd, - OUT void** p_open_context, - OUT cl_dev_info_t** p_dev_info ); - - -#else /* __KERNEL__ */ - -/* Linux User Mode */ - - -/****d* Component Library: Device Framework/cl_dev_handle_t (User Mode) -* NAME -* cl_dev_handle_t (User Mode) -* -* DESCRIPTION -* Handle to an device framework created device object. -* -* SYNOPSIS -*/ -typedef intn_t cl_dev_handle_t; -/*********/ - - -/****f* Component Library: Device Framework/cl_open_device -* NAME -* cl_open_device -* -* DESCRIPTION -* Opens a system device. A user mode application needs to open a device -* and use the returned device handle to use the device. -* -* The device must be closed when the application has finished using the -* device. -* -* SYNOPSIS -*/ -cl_status_t -cl_open_device( - IN cl_dev_name_t device_name, - OUT cl_dev_handle_t *ph_dev ); - -/* -* PARAMETERS -* device_name -* [in] Name of the device to open. This is the same name that was used to -* create the kernel mode device. -* -* ph_dev -* [out] pointer to the location that holds the device handle. This handle -* is used to reference this device in other calls like cl_close_device etc. -* -* RETURN VALUE -* CL_SUCCESS -* The open call succeeded. -* -* CL_INVALID_PARAMETER -* The consumer passed in an invalid parameter. -* -* CL_ERROR -* The device does not exist, open failed. -* -* NOTES -* A kernel driver must use the kernel mode device framework to create this -* device before it can be opened. -* -* SEE ALSO -* Device Framework, cl_close_device, cl_iocl_device. -********/ - - -/****f* Component Library: Device Framework/cl_close_device -* NAME -* cl_close_device -* -* DESCRIPTION -* Closes an open system device. -* -* SYNOPSIS -*/ -void -cl_close_device( - IN cl_dev_handle_t h_dev ); - -/* -* PARAMETERS -* h_dev -* [in] Handle to an open device returned by a previous call -* to cl_open_device. -* -* RETURN VALUE -* This function does not return a value. -* -* SEE ALSO -* cl_open_device, cl_ioctl_device -********/ - - -/****f* Component Library: Device Framework/cl_ioctl_device -* NAME -* cl_ioctl_device -* -* DESCRIPTION -* Issue an io control (ioctl) operation to a device. The device must -* be opened before this can be done. -* -* SYNOPSIS -*/ -cl_status_t -cl_ioctl_device( - IN cl_dev_handle_t dev_handle, - IN uint32_t command, - IN void *p_buf, - IN uintn_t buf_size, - OUT uintn_t *p_num_bytes_ret ); -/* -* PARAMETERS -* dev_handle -* [in] Handle to an open device. Returned by cl_open_device. -* command -* [in] The ioctl command for the kernel handler. -* p_buf -* [in] pointer to the input buffer to pass arguments to the kernel -* handler. The same buffer will hold the results too. -* buf_size -* [in] size of the previous argument (p_buf). -* p_num_bytes_ret -* [in] actual number of bytes returned in the output buffer by the -* kernel handler. -* -* RETURN VALUE -* The ioctl status returned by the kernel handler. -* -* SEE ALSO -* Device Framework, cl_open_device, cl_close_device. -********/ - -#endif // __KERNEL__ - -END_C_DECLS - -#endif // _CL_DEVICE_H_ diff --git a/osm/include/complib/cl_syshelper.h b/osm/include/complib/cl_syshelper.h deleted file mode 100644 index 36ed206..0000000 --- a/osm/include/complib/cl_syshelper.h +++ /dev/null @@ -1,238 +0,0 @@ -/* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. - * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - * $Id$ - */ - - - -/* - * Abstract: - * This header file defines data structures and APIs for the system helper - * module of the component library. - * - * Environment: - * Linux Kernel and User Mode. - * - * $Revision: 1.7 $ - */ - - -#ifndef _CL_SYSHELPER_H_ -#define _CL_SYSHELPER_H_ - -/****h* Component Library/System Helper -* NAME -* System Helper -* -* DESCRIPTION -* Provides ioctl support and handle validation for wait objects. -* -********/ - -#ifdef __KERNEL__ - -#include -#include - -#endif //__KERNEL__ - -#include -#include -#include -#include -#include -#include -#include - -#ifdef __cplusplus -# define BEGIN_C_DECLS extern "C" { -# define END_C_DECLS } -#else /* !__cplusplus */ -# define BEGIN_C_DECLS -# define END_C_DECLS -#endif /* __cplusplus */ - -BEGIN_C_DECLS - -#define SYSHELP_DEVICE_NAME "/dev/cl_dev" -#define SYSDEV_KEY '#' - -#ifdef __KERNEL__ - -/****f* Component Library: System Helper/__cl_syshelper_init -* NAME -* __cl_syshelper_init -* -* DESCRIPTION -* Initializes the system helper data structures and prepares it -* for use. -* -* SYNOPSIS -*/ -cl_status_t -__cl_syshelper_init(void); - -/* -* PARAMETERS -* None. -* -* RETURN VALUES -* CL_SUCCESS -* The initialization completed successfully. -* CL_ERROR -* Could not create the device. -* -* NOTES -* -* SEE ALSO -* __cl_syshelper_exit -********/ - -/****f* Component Library: System Helper/__cl_syshelper_exit -* NAME -* __cl_syshelper_exit -* -* DESCRIPTION -* Releases the resources used by syshelper and destroys the device. -* -* SYNOPSIS -*/ -void -__cl_syshelper_exit(void); - -/* -* PARAMETERS -* None -* -* RETURN VALUES -* None. -* -* NOTES -* -* SEE ALSO -* __cl_syshelper_init -*********/ - -#else // __KERNEL__ - -/* - * User mode only - */ - -/****f* Component Library: System Helper/__cl_user_syshelper_init -* NAME -* __cl_user_syshelper_init -* -* DESCRIPTION -* Initialize the system helper in user mode. -* -* SYNOPSIS -*/ -cl_status_t -__cl_user_syshelper_init(void); -/* -* PARAMETERS -* None. -* -* RETURN VALUES -* CL_SUCCESS -* The initialization completed successfully. -* -* NOTES -* -* SEE ALSO -*****/ - -/****f* Component Library: System Helper/__cl_user_syshelper_exit -* NAME -* __cl_user_syshelper_exit -* -* DESCRIPTION -* Cleanup the system helper in user mode. -* -* SYNOPSIS -*/ -void -__cl_user_syshelper_exit(void); -/* -* PARAMETERS -* None. -* -* RETURN VALUES -* None. -* -* NOTES -* -* SEE ALSO -******/ - -#endif // __KERNEL__ - -/* - * Shared between user and kernel mode. - */ - -/****d* Component Library: System Helper/cl_syshelper_ops_t -* NAME -* syshelper_ops_t -* -* DESCRIPTION -* -* SYNOPSIS -*/ -typedef enum cl_syshelper_ops -{ - create_wait_obj = 1, - waiton_wait_obj, - trigger_wait_obj, - reset_wait_obj, - syshelp_ioctl_max /* always at the end of the list */ - -} cl_syshelper_ops_t; -/**********/ - -/* - * Various Opration Allowable on the System Helper - */ -#define CREATE_WAIT_OBJ \ - IOCTL_CMD(SYSDEV_KEY, create_wait_obj) -#define WAITON_WAIT_OBJ \ - IOCTL_CMD(SYSDEV_KEY, waiton_wait_obj) -#define TRIGGER_WAIT_OBJ \ - IOCTL_CMD(SYSDEV_KEY, trigger_wait_obj) -#define RESET_WAIT_OBJ \ - IOCTL_CMD(SYSDEV_KEY, reset_wait_obj) -END_C_DECLS - -#endif //_CL_SYSHELPER_H_ diff --git a/osm/include/complib/cl_waitobj.h b/osm/include/complib/cl_waitobj.h deleted file mode 100644 index 58dd938..0000000 --- a/osm/include/complib/cl_waitobj.h +++ /dev/null @@ -1,369 +0,0 @@ -/* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. - * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - * $Id$ - */ - - - -/* - * Abstract: - * This header file defines the data structures and APIs for implementing - * wait objects. - * - * Environment: - * Linux Kernel Mode - * - * $Revision: 1.7 $ - */ - -#ifndef _CL_WAITOBJ_H_ -#define _CL_WAITOBJ_H_ - -#ifdef __cplusplus -# define BEGIN_C_DECLS extern "C" { -# define END_C_DECLS } -#else /* !__cplusplus */ -# define BEGIN_C_DECLS -# define END_C_DECLS -#endif /* __cplusplus */ - -BEGIN_C_DECLS - -/****h* Component Library/Wait Objects -* NAME -* Wait Object -* -* DESCRIPTION -* The Wait Object provides the capability for a user mode process to -* create and wait on a kernel event. An action on a wait object can -* be done from both a user mode thread as well as a kernel thread. -* -******/ - -#include - -/****d* Component Library: Wait Objects/cl_wait_obj_handle_t -* NAME -* cl_wait_obj_handle_t -* -* DESCRIPTION -* Defines the handle for an OS wait object. -* -* SYNOPSIS -*/ -typedef void *cl_wait_obj_handle_t; -/* -* -******/ - -/****i* Component Library: Wait Objects/cl_wait_ioctl_params_t -* NAME -* cl_wait_ioctl_params -* -* DESCRIPTION -* Defines parameters for the ioctl call to implement wait objects -* -* SYNOPSIS -*/ -typedef struct cl_wait_ioctl_params -{ - uint32_t wait_u_sec; - cl_status_t wait_status; - -} cl_wait_ioctl_params_t; -/* -* -******/ - -/****i* Component Library: Wait Objects/cl_create_wait_obj_params_t -* NAME -* cl_create_wait_obj_params -* -* DESCRIPTION -* Defines parameters for the ioctl call to create wait objects -* -* SYNOPSIS -*/ -typedef struct cl_create_wait_obj_params -{ - // Input - boolean_t auto_reset; - // Output - cl_status_t status; - -} cl_create_wait_obj_params_t; -/* -* -******/ - - -#if defined(__KERNEL__) -/* - * Kernel Mode Support for Wait Objects - */ - -/* Internal helper functions for wait object */ - -cl_status_t -__cl_create_wait_object( - IN boolean_t auto_reset, - OUT cl_wait_obj_handle_t *p_wait_obj_handle); - -cl_status_t -__cl_wait_on_wait_object( - IN cl_wait_obj_handle_t wait_obj_handle, - IN uint32_t wait_u_sec ); - -cl_status_t -__cl_signal_wait_object( - IN cl_wait_obj_handle_t wait_obj_handle); - -cl_status_t -__cl_clear_wait_object( - IN cl_wait_obj_handle_t wait_obj_handle); - -cl_status_t -__cl_destroy_wait_object( - IN cl_wait_obj_handle_t wait_obj_handle); - - -/****i* Component Library: Wait Objects/cl_get_kernel_wait_object -* NAME -* cl_get_kernel_wait_object -* -* DESCRIPTION -* cl_get_kernel_wait_object -- Validates the wait object handle and -* returns the kernel wait object handle. -* -* SYNOPSIS -*/ -cl_wait_obj_handle_t -cl_get_kernel_wait_object( - IN cl_wait_obj_handle_t user_mode_handle ); -/* -* PARAMETERS -* user_mode_handle -* A handle to the wait object passed from user mode. -* -* RETURN VALUES -* On successful validation, returns the kernel wait object handle. -* Kernel threads use this handle to do any appropriate action -* on this wait object. -* On failure, returns NULL. -* -* NOTES -* This API is used only in the kernel. -* -* SEE ALSO -* cl_create_wait_object, cl_destroy_wait_object, cl_wait_on_wait_object, -* cl_signal_wait_object, cl_clear_wait_object. -* -******/ - -/******/ - -#endif //__KERNEL__ - -/* - * Shared between user and kernel mode - */ - -/****f* Component Library: Wait Objects/cl_create_wait_object -* NAME -* cl_create_wait_object -* -* DESCRIPTION -* cl_create_wait_object -- Creates a wait object. -* -* SYNOPSIS -*/ -cl_status_t -cl_create_wait_object( - IN const boolean_t auto_reset, - OUT cl_wait_obj_handle_t *p_wait_obj_handle ); -/* -* PARAMETERS -* auto_reset -* Specifies whether the signaled state should be reset automatically -* or manually. If set to TRUE, the state will be reset automatically. -* p_wait_obj_handle -* On successful creation, returns the wait object handle. -* -* RETURN VALUES -* CL_SUCCESS -* The wait object was created successfully. -* CL_ERROR -* The wait object creation failed. -* NOTES -* Used in both kernel mode as well as user mode. -* -* SEE ALSO -* cl_destroy_wait_object, cl_wait_on_wait_object, -* cl_signal_wait_object, cl_clear_wait_object. -* -******/ - -/****f* Component Library: Wait Objects/cl_destroy_wait_object -* NAME -* cl_destroy_wait_object -* -* DESCRIPTION -* cl_destroy_wait_object -- Destroys a wait object. -* -* SYNOPSIS -*/ -cl_status_t -cl_destroy_wait_object( - IN cl_wait_obj_handle_t wait_obj_handle ); -/* -* PARAMETERS -* wait_obj_handle -* A handle to the wait object that needs to be destroyed. -* -* RETURN VALUES -* CL_SUCCESS -* The wait object handle is destroyed. -* NOTES -* Used in both kernel mode as well as user mode. -* -* SEE ALSO -* cl_create_wait_object, cl_wait_on_wait_object, -* cl_signal_wait_object, cl_clear_wait_object. -* -*********/ - -/****f* Component Library: Wait Objects/cl_wait_on_wait_object -* NAME -* cl_wait_on_wait_object -* DESCRIPTION -* cl_wait_on_wait_object -- Wait on this wait object until signalled -* or timed out. -* -* SYNOPSIS -*/ -cl_status_t -cl_wait_on_wait_object( - IN cl_wait_obj_handle_t wait_obj_handle, - IN uint32_t wait_u_sec ); -/* -* PARAMETERS -* wait_obj_handle -* A handle to the wait object that the thread needs to be wait on. -* wait_u_sec -* The number of micro seconds to wait before timing out. -* -* RETURN VALUES -* CL_SUCCESS -* The wait completed successfully and the event is signalled. -* CL_ERROR -* Some error happened during the wait. -* CL_NOT_DONE -* The wait got interrupted due to some signal. -* CL_TIMEOUT -* The wait timed out. -* -* NOTES -* Used in both kernel mode as well as user mode. -* -* SEE ALSO -* cl_create_wait_object, cl_destroy_wait_object, -* cl_signal_wait_object, cl_clear_wait_object. -* -*********/ - -/****f* Component Library: Wait Objects/cl_signal_wait_object -* NAME -* cl_signal_wait_object -* -* DESCRIPTION -* cl_signal_wait_object -- Signal the wait object. The thread -* that is waiting on this will be woken up. -* -* SYNOPSIS -*/ -cl_status_t -cl_signal_wait_object( - IN cl_wait_obj_handle_t wait_obj_handle ); -/* -* PARAMETERS -* wait_obj_handle -* A handle to the wait object that needs to be signaled. -* -* RETURN VALUES -* CL_SUCCESS -* The wait object is successfully signaled. -* -* NOTES -* Used in both kernel mode as well as user mode. -* -* SEE ALSO -* cl_create_wait_object, cl_destroy_wait_object, -* cl_wait_on_wait_object, cl_clear_wait_object. -* -*********/ - -/****f* Component Library: Wait Objects/cl_clear_wait_object -* NAME -* cl_clear_wait_object -* -* DESCRIPTION -* cl_clear_wait_object -- Clear the signaled state of a wait object -* and reset it. -* -* SYNOPSIS -*/ -cl_status_t -cl_clear_wait_object( - IN cl_wait_obj_handle_t wait_obj_handle ); -/* -* PARAMETERS -* wait_obj_handle -* A handle to the wait object whose state needs to be reset. -* -* RETURN VALUES -* CL_SUCCESS -* The wait object is reset successfully. -* -* NOTES -* Used in both kernel mode as well as user mode. -* -* SEE ALSO -* cl_create_wait_object, cl_destroy_wait_object, -* cl_wait_on_wait_object, cl_signal_wait_object. -* -*********/ - -END_C_DECLS - -#endif // _CL_WAITOBJ_H_ From sashak at voltaire.com Mon Feb 27 11:37:58 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 27 Feb 2006 21:37:58 +0200 Subject: [openib-general] [PATCH] osm/complib: clean #ifdef __KERNEL__ Message-ID: <20060227193758.GC10270@sashak.voltaire.com> Hello, This removes obsolete #ifdef __KERNEL__ from complib Sasha. Remove annoying #ifdef __KERNEL__ from complib Signed-off-by: Sasha Khapyorsky --- osm/include/complib/cl_atomic.h | 86 -------------------------- osm/include/complib/cl_debug_osd.h | 109 --------------------------------- osm/include/complib/cl_event_osd.h | 27 -------- osm/include/complib/cl_spinlock_osd.h | 40 ------------ osm/include/complib/cl_thread_osd.h | 39 ------------ osm/include/complib/cl_timer_osd.h | 23 ------- osm/include/complib/cl_types_osd.h | 102 +------------------------------ 7 files changed, 9 insertions(+), 417 deletions(-) diff --git a/osm/include/complib/cl_atomic.h b/osm/include/complib/cl_atomic.h index ad2fda5..176dba8 100644 --- a/osm/include/complib/cl_atomic.h +++ b/osm/include/complib/cl_atomic.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -216,90 +216,6 @@ cl_atomic_sub( * Atomic Operations, cl_atomic_inc, cl_atomic_dec, cl_atomic_add, * cl_atomic_xchg, cl_atomic_comp_xchg *********/ -#ifdef __KERNEL__ - -/****f* Component Library: Atomic Operations/cl_atomic_xchg -* NAME -* cl_atomic_xchg -* -* DESCRIPTION -* The cl_atomic_xchg function atomically sets a value of a -* 32-bit signed integer and returns the initial value. -* -* SYNOPSIS -*/ -int32_t -cl_atomic_xchg( - IN atomic32_t* const p_value, - IN const int32_t new_value ); -/* -* PARAMETERS -* p_value -* [in] Pointer to a 32-bit integer to exchange with new_value. -* -* new_value -* [in] Value to assign. -* -* RETURN VALUE -* Returns the initial value pointed to by p_value. -* -* NOTES -* The provided value is exchanged with new_value and its initial value -* returned in one atomic operation. -* -* cl_atomic_xchg maintains data consistency without requiring additional -* synchronization mechanisms in multi-threaded environments. -* -* SEE ALSO -* Atomic Operations, cl_atomic_inc, cl_atomic_dec, cl_atomic_add, -* cl_atomic_sub, cl_atomic_comp_xchg -*********/ - - -/****f* Component Library: Atomic Operations/cl_atomic_comp_xchg -* NAME -* cl_atomic_comp_xchg -* -* DESCRIPTION -* The cl_atomic_comp_xchg function atomically compares a 32-bit signed -* integer to a desired value, sets that integer to the -* specified value if equal, and returns the initial value. -* -* SYNOPSIS -*/ -int32_t -cl_atomic_comp_xchg( - IN atomic32_t* const p_value, - IN const int32_t compare, - IN const int32_t new_value ); -/* -* PARAMETERS -* p_value -* [in] Pointer to a 32-bit integer to exchange with new_value. -* -* compare -* [in] Value to compare to the value pointed to by p_value. -* -* new_value -* [in] Value to assign if the value pointed to by p_value is equal to -* the value specified by the compare parameter. -* -* RETURN VALUE -* Returns the initial value of the variable pointed to by p_value. -* -* NOTES -* The value pointed to by p_value is compared to the value specified by the -* compare parameter. If the two values are equal, the p_value variable is -* set to new_value. The initial value pointed to by p_value is returned. -* -* cl_atomic_comp_xchg maintains data consistency without requiring additional -* synchronization mechanisms in multi-threaded environments. -* -* SEE ALSO -* Atomic Operations, cl_atomic_inc, cl_atomic_dec, cl_atomic_add, -* cl_atomic_sub, cl_atomic_xchg -*********/ -#endif /* __KERNEL__ */ END_C_DECLS diff --git a/osm/include/complib/cl_debug_osd.h b/osm/include/complib/cl_debug_osd.h index 2cd17a0..8f3c2a8 100644 --- a/osm/include/complib/cl_debug_osd.h +++ b/osm/include/complib/cl_debug_osd.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -80,111 +80,6 @@ BEGIN_C_DECLS #define PRIdSIZE_T "d" #endif -#ifdef __KERNEL__ - -#include -#include -#include - - -/* Linux Kernel Mode */ -#if __WORDSIZE == 64 -#define __PRI64_PREFIX "l" -#else -#define __PRI64_PREFIX "L" -#endif - - -#define PRId64 __PRI64_PREFIX"d" -#define PRIo64 __PRI64_PREFIX"o" -#define PRIu64 __PRI64_PREFIX"u" -#define PRIx64 __PRI64_PREFIX"x" - -void cl_printk( char *message, ... ); /* see cl_debug.c */ - - -#ifndef PRINTK_LVL -#define PRINTK_LVL KERN_INFO -#endif - -#if defined (_DEBUG_) - -#if defined (CONFIG_GDB) -#define cl_msg_out printk -#define cl_dbg_out printk -#else -#define cl_msg_out cl_printk -#define cl_dbg_out cl_printk -#endif - -#else /* not _DEBUG_ */ -#define cl_msg_out cl_printk -#define cl_dbg_out foo -#endif /* _DEBUG_ */ - - -/* - * The following macros are used internally by the CL_ENTER, CL_TRACE, - * CL_TRACE_EXIT, and CL_EXIT macros. - */ - -#if defined (CONFIG_SMP) - -#define _CL_DBG_ENTER \ - ("~%d:%s%s%s() [\n", smp_processor_id(), __MODULE__, \ - __MOD_DELIMITER__, __func__) - -#define _CL_DBG_EXIT \ - ("~%d:%s%s%s() ]\n", smp_processor_id(), __MODULE__, \ - __MOD_DELIMITER__, __func__) - -#define _CL_DBG_INFO \ - ("~%d:%s%s%s(): ", smp_processor_id(), __MODULE__, \ - __MOD_DELIMITER__, __func__) - -#define _CL_DBG_ERROR \ - ("~%d:%s%s%s() !ERROR!: ", smp_processor_id(), __MODULE__, \ - __MOD_DELIMITER__, __func__) - -#else - -#define _CL_DBG_ENTER \ - ("%s%s%s() [\n", __MODULE__, __MOD_DELIMITER__, __func__) - -#define _CL_DBG_EXIT \ - ("%s%s%s() ]\n", __MODULE__, __MOD_DELIMITER__, __func__) - -#define _CL_DBG_INFO \ - ("%s%s%s(): ", __MODULE__, __MOD_DELIMITER__, __func__) - -#define _CL_DBG_ERROR \ - ("%s%s%s() !ERROR!: ", __MODULE__, __MOD_DELIMITER__, __func__) - -#endif - -#ifdef CONFIG_X86 -#define CL_CHK_STK \ -{ \ - uint32_t __CL_ESP__; \ - __asm__ __volatile__("movl %%esp,%0" : "=r" (__CL_ESP__)); \ - if (((uint32_t)current + sizeof(struct task_struct) + \ - (in_interrupt() ? 300: 1024)) > __CL_ESP__) \ - { \ - cl_msg_out("stack corruption detected!!!\n"); \ - cl_msg_out("::::::::esp(0x%x) top(0x%x)::::::::\n",__CL_ESP__,\ - ((uint32_t)current + sizeof(struct task_struct) + \ - (in_interrupt() ? 300: 1024))); \ - CL_ASSERT (0); \ - } \ -} -#else -#define CL_CHK_STK /* We do not do checks for 64 for now... */ -#endif - - -#else /* __KERNEL__ */ - -/* Linux User Mode */ #include #include @@ -215,8 +110,6 @@ void cl_printk( char *message, ... ); /* #define CL_CHK_STK -#endif /* __KERNEL__ */ - END_C_DECLS #endif /* _CL_DEBUG_OSD_H_ */ diff --git a/osm/include/complib/cl_event_osd.h b/osm/include/complib/cl_event_osd.h index 762cd33..dd497fe 100644 --- a/osm/include/complib/cl_event_osd.h +++ b/osm/include/complib/cl_event_osd.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -64,29 +64,6 @@ BEGIN_C_DECLS -#ifdef __KERNEL__ - -/* Linux Kernel Mode. */ -#include - - -/* - * Linux kernel mode specific data structure for the event object. - * Users should not access these variables directly. - */ -typedef struct _cl_event_t -{ - wait_queue_head_t wait_queue; - boolean_t signaled; - boolean_t manual_reset; - cl_spinlock_t spinlock; - cl_state_t state; - -} cl_event_t; - -#else /* __KERNEL__ */ - -/* Linux User Mode. */ #include /* usr/include */ @@ -104,8 +81,6 @@ typedef struct _cl_event_t } cl_event_t; -#endif /* __KERNEL__ */ - END_C_DECLS #endif /* _CL_EVENT_OSD_H_ */ diff --git a/osm/include/complib/cl_spinlock_osd.h b/osm/include/complib/cl_spinlock_osd.h index 95afd4e..e8c6d14 100644 --- a/osm/include/complib/cl_spinlock_osd.h +++ b/osm/include/complib/cl_spinlock_osd.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -61,42 +61,6 @@ BEGIN_C_DECLS #include - - -#ifdef __KERNEL__ - -/* Linux Kernel Mode. */ -#include -#include -#include - -typedef enum -{ - SPIN_LVL_INVALID, - SPIN_LVL_TASKLET, - SPIN_LVL_INTERRUPT -} cl_spin_level_t; - -/* - * Spinlock object definition. - */ -typedef struct _cl_spinlock_t -{ - spinlock_t lock; - unsigned long flags; - cl_state_t state; - cl_spin_level_t level; -#ifdef _DEBUG_ - boolean_t locked; - struct task_struct *owner; - int cpuid; -#endif - -} cl_spinlock_t; - -#else /* __KERNEL__ */ - -/* Linux User Mode. */ #include /* usr/include/ */ typedef struct _cl_spinlock_t @@ -106,8 +70,6 @@ typedef struct _cl_spinlock_t } cl_spinlock_t; -#endif /* __KERNEL__ */ - END_C_DECLS #endif /* _CL_SPINLOCK_OSD_H_ */ diff --git a/osm/include/complib/cl_thread_osd.h b/osm/include/complib/cl_thread_osd.h index 24ee476..9e596a0 100644 --- a/osm/include/complib/cl_thread_osd.h +++ b/osm/include/complib/cl_thread_osd.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -62,41 +62,6 @@ BEGIN_C_DECLS #include #include - -#ifdef __KERNEL__ - -/* Linux Kernel Mode. */ -#include -#include - -/* bit position 0 for thread wait flag */ -#define THREAD_WAKEUP (0) - -/* Milli Secs per tick, since there are HZ ticks per second */ -#define MISECS_PER_TICK (1000/HZ) - -/* Linux kernel mode thread object structure definition. */ -typedef struct _cl_thread_osd_t -{ - char name[16]; - wait_queue_head_t wqueue; - struct task_struct *task; - - cl_event_t kill_event; - cl_state_t state; - -} cl_thread_osd_t; - - -static inline boolean_t -cl_is_blockable ( void ) -{ - return ( (in_interrupt()) ? FALSE : TRUE ); -} - -#else /* __KERNEL__ */ - -/* Linux User Mode. */ #include /* Linux user mode thread object structure definition. */ @@ -113,8 +78,6 @@ cl_is_blockable ( void ) return TRUE; } -#endif /* __KERNEL__ */ - END_C_DECLS #endif /* _CL_THREAD_OSD_H_ */ diff --git a/osm/include/complib/cl_timer_osd.h b/osm/include/complib/cl_timer_osd.h index 85738fa..abab65d 100644 --- a/osm/include/complib/cl_timer_osd.h +++ b/osm/include/complib/cl_timer_osd.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -63,25 +63,6 @@ BEGIN_C_DECLS -#ifdef __KERNEL__ - -/* Linux Kernel Mode. */ -#include - - -typedef struct _cl_timer_t -{ - struct timer_list timer; - cl_state_t state; - cl_pfn_timer_callback_t pfn_callback; - const void *context; - boolean_t in_timer_cb; - -} cl_timer_t; - -#else /* __KERNEL__ */ - -/* Linux User Mode. */ #include #include @@ -116,8 +97,6 @@ __cl_timer_prov_create( void ); void __cl_timer_prov_destroy( void ); -#endif /* __KERNEL__ */ - END_C_DECLS #endif /* _CL_TIMER_OSD_H_ */ diff --git a/osm/include/complib/cl_types_osd.h b/osm/include/complib/cl_types_osd.h index 84b07d5..a6ab980 100644 --- a/osm/include/complib/cl_types_osd.h +++ b/osm/include/complib/cl_types_osd.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -73,100 +73,6 @@ BEGIN_C_DECLS #define cl_break #endif -#ifdef __KERNEL__ -#include - -/* - * Linux Kernel Mode - */ - -#if defined (_DEBUG_) && defined (CONFIG_X86_REMOTE_DEBUG) -#define CONFIG_GDB -#endif - -#if defined(CONFIG_MODVERSIONS) && !defined(MODVERSIONS) -#define MODVERSIONS /* turn it on */ -#endif - -#ifdef MODVERSIONS -#include -#endif - -#include -#include -#include -#include -#include - -#ifndef LINUX_VERSION_CODE -#include -#endif - -#if defined (_DEBUG_) - -#if defined (CONFIG_GDB) -extern int gdb_initialized; - -#define CL_ASSERT( __exp__ ) \ -{ \ - if( !(__exp__) ) \ - { \ - if (!gdb_initialized) \ - { \ - panic( "Assertion failed: %s, file %s, line %d\n", \ - #__exp__, __FILE__, __LINE__ ); \ - } \ - else \ - { \ - printk( "Assertion failed: %s, file %s, line %d\n", \ - #__exp__, __FILE__, __LINE__ ); \ - printk ("Entering GDB...\n"); \ - cl_break(); \ - } \ - } \ -} -#elif defined (CONFIG_KDB) /* not CONFIG_GDB */ -#include -#undef cl_break -#define cl_break() KDB_ENTER() - -#define CL_ASSERT( __exp__ ) \ -{ \ - if( !(__exp__) ) \ - { \ - printk( "Assertion failed: %s, file %s, line %d\n", \ - #__exp__, __FILE__, __LINE__ ); \ - printk ("Entering KDB...\n"); \ - cl_break(); \ - } \ -} -#else /* not CONFIG_GDB and not CONFIG_KDB */ -#undef cl_break -#define cl_break() - -#define CL_ASSERT( __exp__ ) \ -{ \ - if( !(__exp__) ) \ - { \ - panic( "Assertion failed: %s, file %s, line %d\n", \ - #__exp__, __FILE__, __LINE__ ); \ - } \ -} -#endif /* CONFIG_GDB */ - -#else /* not _DEBUG_ */ -#undef cl_break -#define cl_break() -#define CL_ASSERT( __exp__ ) -#endif /* _DEBUG_ */ - -#define cl_panic panic - -#else /* __KERNEL__ */ - -/* - * Linux User Mode - */ #include #include #include @@ -178,14 +84,12 @@ extern int gdb_initialized; #define CL_ASSERT( __exp__ ) #endif /* _DEBUG_ */ -#endif /* __KERNEL__ */ - /* * Types not explicitly defined are native to the platform. */ typedef unsigned long uintn_t; -typedef long intn_t; -typedef int boolean_t; +typedef long intn_t; +typedef int boolean_t; typedef volatile int32_t atomic32_t; #ifndef NULL From mst at mellanox.co.il Mon Feb 27 11:42:15 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 27 Feb 2006 21:42:15 +0200 Subject: [openib-general] Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID In-Reply-To: <44034CDF.9030204@ichips.intel.com> References: <200602231814.26918.jackm@mellanox.co.il> <200602231825.19333.jackm@mellanox.co.il> <43FDF106.8070403@ichips.intel.com> <20060225170316.GA15973@mellanox.co.il> <44033E57.8090905@ichips.intel.com> <20060227181156.GC17414@mellanox.co.il> <4403422F.2000408@ichips.intel.com> <20060227182746.GB19265@mellanox.co.il> <44034CDF.9030204@ichips.intel.com> Message-ID: <20060227194215.GC20064@mellanox.co.il> Quoting r. Sean Hefty : > >Okay. But, note this affects ACKs as well as ABORTs (with multi-packet > >requests): > > > >Host A sends an RMPP request message to host B with TID=3 > >Host B sends an RMPP request message to host A with TID=3. > >Now if A generates an RMPP response it has TID=3. > > > >If B sends ACK, host A has no idea which transaction is being ACKed. > > Bah... can we distinguish which transaction is being ACKed by the response > bit? Are you talking about checking IB_MGMT_METHOD_RESP? How is this different from what I proposed? Wont this work for Abort/Stop as well? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mshefty at ichips.intel.com Mon Feb 27 11:47:43 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 27 Feb 2006 11:47:43 -0800 Subject: [openib-general] Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID In-Reply-To: <20060227194215.GC20064@mellanox.co.il> References: <200602231814.26918.jackm@mellanox.co.il> <200602231825.19333.jackm@mellanox.co.il> <43FDF106.8070403@ichips.intel.com> <20060225170316.GA15973@mellanox.co.il> <44033E57.8090905@ichips.intel.com> <20060227181156.GC17414@mellanox.co.il> <4403422F.2000408@ichips.intel.com> <20060227182746.GB19265@mellanox.co.il> <44034CDF.9030204@ichips.intel.com> <20060227194215.GC20064@mellanox.co.il> Message-ID: <4403575F.1010305@ichips.intel.com> Michael S. Tsirkin wrote: >>>Host A sends an RMPP request message to host B with TID=3 >>>Host B sends an RMPP request message to host A with TID=3. >>>Now if A generates an RMPP response it has TID=3. >>> >>>If B sends ACK, host A has no idea which transaction is being ACKed. >> >>Bah... can we distinguish which transaction is being ACKed by the response >>bit? > > Are you talking about checking IB_MGMT_METHOD_RESP? > > How is this different from what I proposed? Yes - this is what you proposed. I believe that it can work for ACKs since an ACK must match with a send. > Wont this work for Abort/Stop as well? Given the example above, with hosts A and B sending requests, if host B sends an abort, it's still unknown which transaction is being aborted, since neither the send or receive would have the response bit set. - Sean From jlentini at netapp.com Mon Feb 27 12:03:49 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 27 Feb 2006 15:03:49 -0500 (EST) Subject: [openib-general] Re: [PATCH][RFC] CMA automatic port number assignment In-Reply-To: <43FF5DD9.6070107@ichips.intel.com> References: <43FF5DD9.6070107@ichips.intel.com> Message-ID: On Fri, 24 Feb 2006, Sean Hefty wrote: > James Lentini wrote: > > +static u16 cma_generate_ephemeral_port(void) > > +{ > > + u16 port; > > + > > + get_random_bytes(&port, sizeof port); > > + return cpu_to_be16(port | 1024U); > > +} > > As you mentioned, we should verify that the port number is not in > use. I'll work on a new patch for this. > > static void cma_format_hdr(void *hdr, enum rdma_port_space ps, > > struct rdma_route *route) > > { > > @@ -1371,6 +1379,9 @@ static void cma_format_hdr(void *hdr, en > > src4 = (struct sockaddr_in *) &route->addr.src_addr; > > dst4 = (struct sockaddr_in *) &route->addr.dst_addr; > > + if (!src4->sin_port) > > + src4->sin_port = cma_generate_ephemeral_port(); > > This only sets the port number in the header. It would make more > sense to save the port number in the rdma_cm_id's src_addr. I thought I was saving the port number in the rdma_cm_id's src_addr. src4 is an alias of the rdma_id_private's rdma_cma_id's rdma_route's rdma_addr's src_addr field. How would you do it differently? From jlentini at netapp.com Mon Feb 27 12:08:28 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 27 Feb 2006 15:08:28 -0500 (EST) Subject: [openib-general] [PATCH][RFC] CMA automatic port number assignment In-Reply-To: <43FF79D0.7070900@ichips.intel.com> References: <1140808742.11764.36.camel@stevo-desktop> <43FF79D0.7070900@ichips.intel.com> Message-ID: On Fri, 24 Feb 2006, Sean Hefty wrote: > What the underlying issue that you're trying to solve? I prefaced my patch with my expectations for how the RDMA CM API should work. In particular, I expected the RDMA CM to automatically assign a port number to active connections. Currently the RDMA CM implementation doesn't do that. This is the issue the patch is addressing. From bill.boas at gmail.com Mon Feb 27 12:11:09 2006 From: bill.boas at gmail.com (Bill Boas) Date: Mon, 27 Feb 2006 12:11:09 -0800 Subject: [openib-general] OpenIB/OpenFabrics Booth at IDF - please help man it Message-ID: <19a929370602271211h3e008d44hab69deb7b63dceab@mail.gmail.com> Hello OpenIB and IBTA promoters and SC members, As most of you probably know by now, the OpenIB and IBTA have approved to jointly sponsor a booth at Spring IDF (March 7-9) in San Francisco, CA. The goal of the booth is to raise awareness, promote the activities of, and to actively recruit new members into both organizations. In planning for the booth, there are a number of things I wanted to share with you. We are signed up as a general exhibitor, and with that, we need to staff the booth (we get a number of free passes if you haven't already signed up), create signage and collateral, organize demos, etc. Please take a minute to read the below email and let me know ASAP how you can help! * * *With this booth, we receive the following number of complimentary passes:* - Booth Staff Only Passes [6] (Includes many benefits of Full Forum – see below booth schedule) - IDF Full Forum Passes [2] - Showcase Only (Thursday) Passes (Unlimited Complimentary) - . This pass grants the holder complimentary access to the Technology Showcase on Thursday, March 9th only from 10:00am-12:00pm. *Technology Showcase Show Hours – Booth duty sign-up (Booth # to come)* We will need at least one person from each association to staff the booth at the following times. Please sign up and note if you already have, or will need conference registration. If you need, I will take care of registering you. So the sooner you sign up, the better. * * *Tuesday* *March 7* *1 longer shift; or can break up* *Wednesday* *March 8* *2 short shifts* *Thursday* *March 9* *1 short shift* * * * * *Exhibitor Move-in Hours* Monday, March 6 1:00 pm - 8:00 pm *10:00 a.m.-12:00 p.m.* Exhibitor Move-in Tuesday, 10:30 am - 2:00 pm *10am – 12pm *(2 people – 1 from each organization) 1. 2. *12:00 p.m.-2:00 p.m.* *12:00 pm - 2:00 pm *(2 people – 1 from each organization) 1. 2. *TEAR-DOWN (Marta and Megan)* *5:00 -6:00 p.m.-* Let's try to have 3 or 4 people here from 5-8 this first day. (put if you can stay whole time or what hours you're committing to) *Press Sneak Preview 5-6pm and Welcome Reception from 6-8pm* 1. 2. 3. 4. *6:00 – 8:00 p.m.* (Welcome Reception) – same people as above *6:00 pm - 8:00pm (Reception) *(2 people – 1 from each organization) 1. 2. Benefits of Booth Staff Only Passes: · Technical Sessions & Labs — taught by leading Intel and industry engineers who know the things you need to know. · Keynote Addresses — delivered by Intel thought and technical leaders. This is where future directions are unveiled. Hear about them first at IDF. · Technology Showcase — see exhibits and participate in demonstrations of the latest technology innovations from Intel and other industry leaders. Interact with technology innovators in one of the technology community areas. · Panels — hear top industry leaders discuss and debate key technology topics. *Booth demos* - KeyEye has volunteered to do a demonstration of InfiniBand over Cat6 cable at the booth which I thought would be something great to show on the IBTA side. This demonstration highlights the fact that InfiniBand can operate over mainstream cabling solutions such as Cat6 – key technology for data center adoption. - Any other demo suggestions? SPEAK UP NOW. You will need to be responsible for shipping the demo to IDF, setting up the demo, tearing down the demo, and shipping the demo back. Of course, it should deliver a message that is in line with what the organizations are pushing… See what the booth looks like here: http://www.eventreg.com/idfsp06/agreement/opps.htm#exh *Signage update* - We will hopefully be able to display the new organization name, as well as IBTA; there will be signage for each organization, with bullet points. I will route this around in the next day. *Booth collateral* - Standard OpenIB one-page document to include new promoters, new technologies we're supporting, information how to join, etc. - A one-page document on IBTA to include who the steering committee and member lists are, the working groups, what each does, and information how to join. *Booth Attire and Conduct (All)* - It is important that if you are doing booth duty, that you wear OpenIB or IBTA logoed shirts to represent the organization properly. - If you don't have any OpenIB or IBTA logo wear, then wear something plain and business casual, or we'll try to get your something. - While doing booth duty, you are there to represent IBTA or OpenIB and not specifically the company or organization that actually pays you a salary. So conversation should respectfully be about OpenIB and IBTA. Let's make this booth a success with your participation! Marta -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Mon Feb 27 12:26:37 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 27 Feb 2006 12:26:37 -0800 Subject: [openib-general] Re: [PATCH][RFC] CMA automatic port number assignment In-Reply-To: References: <43FF5DD9.6070107@ichips.intel.com> Message-ID: <4403607D.6060002@ichips.intel.com> James Lentini wrote: >>> static void cma_format_hdr(void *hdr, enum rdma_port_space ps, >>> struct rdma_route *route) >>> { >>>@@ -1371,6 +1379,9 @@ static void cma_format_hdr(void *hdr, en >>> src4 = (struct sockaddr_in *) &route->addr.src_addr; >>> dst4 = (struct sockaddr_in *) &route->addr.dst_addr; >>> + if (!src4->sin_port) >>>+ src4->sin_port = cma_generate_ephemeral_port(); >> >>This only sets the port number in the header. It would make more >>sense to save the port number in the rdma_cm_id's src_addr. > > > I thought I was saving the port number in the rdma_cm_id's src_addr. > src4 is an alias of the rdma_id_private's rdma_cma_id's rdma_route's > rdma_addr's src_addr field. > > How would you do it differently? I missed that this was setting this through an alias. In that case, it makes more sense to move saving this information to another function. Setting this inside 'cma_format_hdr()' seems like a unintended side effect. - Sean From sean.hefty at intel.com Mon Feb 27 12:54:04 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 27 Feb 2006 12:54:04 -0800 Subject: [openib-general] crash with RMPP test Message-ID: I hit the following crash on the server side of the testing grmpp with the latest bits. The parameters to the test are: "rmpp=1" "message_size=1000" "responses=1". With "responses=0", I don't see the issue. Has anyone else seen this? I haven't run grmpp in a while, so I'm not sure if this crash is related to the latest check-in or not. I'll look into this more later this afternoon. - Sean Feb 27 13:12:50 mshefty-linux1 kernel: grmpp: starting server Feb 27 13:14:44 mshefty-linux1 kernel: Madeye:recv GMP Feb 27 13:14:44 mshefty-linux1 kernel: MAD version....0x1 Feb 27 13:14:44 mshefty-linux1 kernel: Class..........0x4a (Unknown vendor/application) Feb 27 13:14:44 mshefty-linux1 kernel: Class version..0x1 Feb 27 13:14:44 mshefty-linux1 kernel: Method.........0x3 (Send) Feb 27 13:14:44 mshefty-linux1 kernel: Status.........0x00 Feb 27 13:14:44 mshefty-linux1 kernel: Class specific.0x00 Feb 27 13:14:44 mshefty-linux1 kernel: Trans ID.......0x10000000f000000 Feb 27 13:14:44 mshefty-linux1 kernel: Attr ID........0x00 Feb 27 13:14:44 mshefty-linux1 kernel: Attr modifier..0x0000 Feb 27 13:14:44 mshefty-linux1 kernel: RMPP version...0x1 Feb 27 13:14:44 mshefty-linux1 kernel: RMPP type......0x1 (Data) Feb 27 13:14:44 mshefty-linux1 kernel: RMPP RRespTime.0x0 Feb 27 13:14:44 mshefty-linux1 kernel: RMPP flags.....0x3 (Active - First) Feb 27 13:14:44 mshefty-linux1 kernel: RMPP status....0x0 Feb 27 13:14:44 mshefty-linux1 kernel: Seg number.....0x0001 Feb 27 13:14:44 mshefty-linux1 kernel: Payload len....0x03fc Feb 27 13:14:44 mshefty-linux1 kernel: Madeye:sent GMP Feb 27 13:14:44 mshefty-linux1 kernel: MAD version....0x1 Feb 27 13:14:44 mshefty-linux1 kernel: Class..........0x4a (Unknown vendor/application) Feb 27 13:14:44 mshefty-linux1 kernel: Class version..0x1 Feb 27 13:14:44 mshefty-linux1 kernel: Method.........0x83 (Send response) Feb 27 13:14:44 mshefty-linux1 kernel: Status.........0x00 Feb 27 13:14:44 mshefty-linux1 kernel: Class specific.0x00 Feb 27 13:14:44 mshefty-linux1 kernel: Trans ID.......0x10000000f000000 Feb 27 13:14:44 mshefty-linux1 kernel: Attr ID........0x00 Feb 27 13:14:44 mshefty-linux1 kernel: Attr modifier..0x0000 Feb 27 13:14:44 mshefty-linux1 kernel: RMPP version...0x1 Feb 27 13:14:44 mshefty-linux1 kernel: RMPP type......0x2 (Ack) Feb 27 13:14:44 mshefty-linux1 kernel: RMPP RRespTime.0x0 Feb 27 13:14:44 mshefty-linux1 kernel: RMPP flags.....0x1 (Active) Feb 27 13:14:44 mshefty-linux1 kernel: RMPP status....0x0 Feb 27 13:14:44 mshefty-linux1 kernel: Seg number.....0x0001 Feb 27 13:14:44 mshefty-linux1 kernel: New window.....0x0041 Feb 27 13:14:44 mshefty-linux1 kernel: Madeye:recv GMP Feb 27 13:14:44 mshefty-linux1 kernel: MAD version....0x1 Feb 27 13:14:44 mshefty-linux1 kernel: Class..........0x4a (Unknown vendor/application) Feb 27 13:14:44 mshefty-linux1 kernel: Class version..0x1 Feb 27 13:14:44 mshefty-linux1 kernel: Method.........0x3 (Send) Feb 27 13:14:44 mshefty-linux1 kernel: Status.........0x00 Feb 27 13:14:44 mshefty-linux1 kernel: Class specific.0x00 Feb 27 13:14:44 mshefty-linux1 kernel: Trans ID.......0x10000000f000000 Feb 27 13:14:44 mshefty-linux1 kernel: Attr ID........0x00 Feb 27 13:14:44 mshefty-linux1 kernel: Attr modifier..0x0000 Feb 27 13:14:44 mshefty-linux1 kernel: RMPP version...0x1 Feb 27 13:14:44 mshefty-linux1 kernel: RMPP type......0x1 (Data) Feb 27 13:14:44 mshefty-linux1 kernel: RMPP RRespTime.0x0 Feb 27 13:14:44 mshefty-linux1 kernel: RMPP flags.....0x1 (Active) Feb 27 13:14:44 mshefty-linux1 kernel: RMPP status....0x0 Feb 27 13:14:44 mshefty-linux1 kernel: Seg number.....0x0002 Feb 27 13:14:44 mshefty-linux1 kernel: Payload len....0x0000 Feb 27 13:14:44 mshefty-linux1 kernel: Madeye:recv GMP Feb 27 13:14:44 mshefty-linux1 kernel: MAD version....0x1 Feb 27 13:14:44 mshefty-linux1 kernel: Class..........0x4a (Unknown vendor/application) Feb 27 13:14:44 mshefty-linux1 kernel: Class version..0x1 Feb 27 13:14:44 mshefty-linux1 kernel: Method.........0x3 (Send) Feb 27 13:14:44 mshefty-linux1 kernel: Status.........0x00 Feb 27 13:14:44 mshefty-linux1 kernel: Class specific.0x00 Feb 27 13:14:44 mshefty-linux1 kernel: Trans ID.......0x10000000f000000 Feb 27 13:14:44 mshefty-linux1 kernel: Attr ID........0x00 Feb 27 13:14:44 mshefty-linux1 kernel: Attr modifier..0x0000 Feb 27 13:14:44 mshefty-linux1 kernel: RMPP version...0x1 Feb 27 13:14:44 mshefty-linux1 kernel: RMPP type......0x1 (Data) Feb 27 13:14:44 mshefty-linux1 kernel: RMPP RRespTime.0x0 Feb 27 13:14:44 mshefty-linux1 kernel: RMPP flags.....0x1 (Active) Feb 27 13:14:44 mshefty-linux1 kernel: RMPP status....0x0 Feb 27 13:14:44 mshefty-linux1 kernel: Seg number.....0x0003 Feb 27 13:14:44 mshefty-linux1 kernel: Payload len....0x0000 Feb 27 13:14:44 mshefty-linux1 kernel: Madeye:recv GMP Feb 27 13:14:44 mshefty-linux1 kernel: MAD version....0x1 Feb 27 13:14:44 mshefty-linux1 kernel: Class..........0x4a (Unknown vendor/application) Feb 27 13:14:44 mshefty-linux1 kernel: Class version..0x1 Feb 27 13:14:44 mshefty-linux1 kernel: Method.........0x3 (Send) Feb 27 13:14:44 mshefty-linux1 kernel: Status.........0x00 Feb 27 13:14:44 mshefty-linux1 kernel: Class specific.0x00 Feb 27 13:14:44 mshefty-linux1 kernel: Trans ID.......0x10000000f000000 Feb 27 13:14:44 mshefty-linux1 kernel: Attr ID........0x00 Feb 27 13:14:44 mshefty-linux1 kernel: Attr modifier..0x0000 Feb 27 13:14:44 mshefty-linux1 kernel: RMPP version...0x1 Feb 27 13:14:44 mshefty-linux1 kernel: RMPP type......0x1 (Data) Feb 27 13:14:44 mshefty-linux1 kernel: RMPP RRespTime.0x0 Feb 27 13:14:44 mshefty-linux1 kernel: RMPP flags.....0x1 (Active) Feb 27 13:14:44 mshefty-linux1 kernel: RMPP status....0x0 Feb 27 13:14:44 mshefty-linux1 kernel: Seg number.....0x0004 Feb 27 13:14:44 mshefty-linux1 kernel: Payload len....0x0000 Feb 27 13:14:44 mshefty-linux1 kernel: Madeye:recv GMP Feb 27 13:14:44 mshefty-linux1 kernel: MAD version....0x1 Feb 27 13:14:44 mshefty-linux1 kernel: Class..........0x4a (Unknown vendor/application) Feb 27 13:14:44 mshefty-linux1 kernel: Class version..0x1 Feb 27 13:14:44 mshefty-linux1 kernel: Method.........0x3 (Send) Feb 27 13:14:44 mshefty-linux1 kernel: Status.........0x00 Feb 27 13:14:44 mshefty-linux1 kernel: Class specific.0x00 Feb 27 13:14:44 mshefty-linux1 kernel: Trans ID.......0x10000000f000000 Feb 27 13:14:44 mshefty-linux1 kernel: Attr ID........0x00 Feb 27 13:14:44 mshefty-linux1 kernel: Attr modifier..0x0000 Feb 27 13:14:44 mshefty-linux1 kernel: RMPP version...0x1 Feb 27 13:14:44 mshefty-linux1 kernel: RMPP type......0x1 (Data) Feb 27 13:14:44 mshefty-linux1 kernel: RMPP RRespTime.0x0 Feb 27 13:14:44 mshefty-linux1 kernel: RMPP flags.....0x5 (Active - Last) Feb 27 13:14:44 mshefty-linux1 kernel: RMPP status....0x0 Feb 27 13:14:44 mshefty-linux1 kernel: Seg number.....0x0005 Feb 27 13:14:44 mshefty-linux1 kernel: Payload len....0x008c Feb 27 13:14:44 mshefty-linux1 kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000014 Feb 27 13:14:44 mshefty-linux1 kernel: printing eip: Feb 27 13:14:44 mshefty-linux1 kernel: f8db2d89 Feb 27 13:14:44 mshefty-linux1 kernel: *pde = 3c448067 Feb 27 13:14:44 mshefty-linux1 kernel: Oops: 0000 [#1] Feb 27 13:14:44 mshefty-linux1 kernel: SMP Feb 27 13:14:44 mshefty-linux1 kernel: Modules linked in: ib_grmpp ib_sa ib_addr ib_madeye ib_mthca ib_mad ib_core edd evdev joydev st sr_mod ide_cd cdrom nvram usbserial parport_pc lp parport ipv6 thermal processor fan button battery ac af_packet e1000 i2c_i801 i2c_core hw_random uhci_hcd usbcore reiserfs aic7xxx scsi_transport_spi sd_mod scsi_mod Feb 27 13:14:44 mshefty-linux1 kernel: CPU: 0 Feb 27 13:14:44 mshefty-linux1 kernel: EIP: 0060:[pg0+949489033/1069335552] Not tainted VLI Feb 27 13:14:44 mshefty-linux1 kernel: EIP: 0060:[] Not tainted VLI Feb 27 13:14:44 mshefty-linux1 kernel: EFLAGS: 00010092 (2.6.16-rc1) Feb 27 13:14:44 mshefty-linux1 kernel: EIP is at mthca_ah_grh_present+0x1/0xf [ib_mthca] Feb 27 13:14:44 mshefty-linux1 kernel: eax: 00000000 ebx: f6d9796c ecx: 00000002 edx: f60b8440 Feb 27 13:14:44 mshefty-linux1 kernel: esi: f6d9786c edi: dd017100 ebp: f1db9d94 esp: f1db9d74 Feb 27 13:14:44 mshefty-linux1 kernel: ds: 007b es: 007b ss: 0068 Feb 27 13:14:44 mshefty-linux1 kernel: Process ib_mad1 (pid: 12158, threadinfo=f1db8000 task=dff7b570) Feb 27 13:14:44 mshefty-linux1 kernel: Stack: <0>f1db9d94 f8db1702 00000002 e2710000 00000286 f60b8440 dd017110 f60b8440 Feb 27 13:14:44 mshefty-linux1 kernel: f1db9df8 f8db1bc4 f60b8440 dd017100 dd017110 00000002 00000001 00000000 Feb 27 13:14:44 mshefty-linux1 kernel: 00000002 00000000 00000001 c01522b4 00000000 00000000 00000092 dd017080 Feb 27 13:14:44 mshefty-linux1 kernel: Call Trace: Feb 27 13:14:44 mshefty-linux1 kernel: [show_stack_log_lvl+174/182] show_stack_log_lvl+0xae/0xb6 Feb 27 13:14:44 mshefty-linux1 kernel: [] show_stack_log_lvl+0xae/0xb6 Feb 27 13:14:44 mshefty-linux1 kernel: [show_registers+244/348] show_registers+0xf4/0x15c Feb 27 13:14:44 mshefty-linux1 kernel: [] show_registers+0xf4/0x15c Feb 27 13:14:44 mshefty-linux1 kernel: [die+249/365] die+0xf9/0x16d Feb 27 13:14:44 mshefty-linux1 kernel: [] die+0xf9/0x16d Feb 27 13:14:44 mshefty-linux1 kernel: [do_page_fault+900/1220] do_page_fault+0x384/0x4c4 Feb 27 13:14:44 mshefty-linux1 kernel: [] do_page_fault+0x384/0x4c4 Feb 27 13:14:44 mshefty-linux1 kernel: [error_code+79/96] error_code+0x4f/0x60 Feb 27 13:14:44 mshefty-linux1 kernel: [] error_code+0x4f/0x60 Feb 27 13:14:44 mshefty-linux1 kernel: [pg0+949484484/1069335552] mthca_tavor_post_send+0x2f3/0x56b [ib_mthca] Feb 27 13:14:44 mshefty-linux1 kernel: [] mthca_tavor_post_send+0x2f3/0x56b [ib_mthca] Feb 27 13:14:44 mshefty-linux1 kernel: [pg0+949924455/1069335552] ib_send_mad+0xdb/0x110 [ib_mad] Feb 27 13:14:44 mshefty-linux1 kernel: [] ib_send_mad+0xdb/0x110 [ib_mad] Feb 27 13:14:44 mshefty-linux1 kernel: [pg0+949937804/1069335552] send_next_seg+0xcb/0xd2 [ib_mad] Feb 27 13:14:44 mshefty-linux1 kernel: [] send_next_seg+0xcb/0xd2 [ib_mad] Feb 27 13:14:44 mshefty-linux1 kernel: [pg0+949939122/1069335552] ib_send_rmpp_mad+0xa3/0xb5 [ib_mad] Feb 27 13:14:44 mshefty-linux1 kernel: [] ib_send_rmpp_mad+0xa3/0xb5 [ib_mad] Feb 27 13:14:44 mshefty-linux1 kernel: [pg0+949924790/1069335552] ib_post_send_mad+0x11a/0x1a8 [ib_mad] Feb 27 13:14:44 mshefty-linux1 kernel: [] ib_post_send_mad+0x11a/0x1a8 [ib_mad] Feb 27 13:14:44 mshefty-linux1 kernel: [pg0+949597095/1069335552] send_response+0xbb/0xe3 [ib_grmpp] Feb 27 13:14:44 mshefty-linux1 kernel: [] send_response+0xbb/0xe3 [ib_grmpp] Feb 27 13:14:44 mshefty-linux1 kernel: [pg0+949597306/1069335552] recv_handler+0x3d/0x51 [ib_grmpp] Feb 27 13:14:44 mshefty-linux1 kernel: [] recv_handler+0x3d/0x51 [ib_grmpp] Feb 27 13:14:44 mshefty-linux1 kernel: [pg0+949927683/1069335552] ib_mad_complete_recv+0xf3/0x121 [ib_mad] Feb 27 13:14:44 mshefty-linux1 kernel: [] ib_mad_complete_recv+0xf3/0x121 [ib_mad] Feb 27 13:14:44 mshefty-linux1 kernel: [pg0+949928209/1069335552] ib_mad_recv_done_handler+0x1e0/0x215 [ib_mad] Feb 27 13:14:44 mshefty-linux1 kernel: [] ib_mad_recv_done_handler+0x1e0/0x215 [ib_mad] Feb 27 13:14:44 mshefty-linux1 kernel: [pg0+949929484/1069335552] ib_mad_completion_handler+0x45/0x7a [ib_mad] Feb 27 13:14:44 mshefty-linux1 kernel: [] ib_mad_completion_handler+0x45/0x7a [ib_mad] Feb 27 13:14:44 mshefty-linux1 kernel: [run_workqueue+130/195] run_workqueue+0x82/0xc3 Feb 27 13:14:44 mshefty-linux1 kernel: [] run_workqueue+0x82/0xc3 Feb 27 13:14:44 mshefty-linux1 kernel: [worker_thread+248/298] worker_thread+0xf8/0x12a Feb 27 13:14:44 mshefty-linux1 kernel: [] worker_thread+0xf8/0x12a Feb 27 13:14:44 mshefty-linux1 kernel: [kthread+120/160] kthread+0x78/0xa0 Feb 27 13:14:44 mshefty-linux1 kernel: [] kthread+0x78/0xa0 Feb 27 13:14:44 mshefty-linux1 kernel: [kernel_thread_helper+5/11] kernel_thread_helper+0x5/0xb Feb 27 13:14:44 mshefty-linux1 kernel: [] kernel_thread_helper+0x5/0xb Feb 27 13:14:44 mshefty-linux1 kernel: Code: 0f ac ca 05 e8 43 90 ff ff eb 1b 8b 4a 18 8b 80 34 07 00 00 8b 52 14 e8 71 04 49 c7 eb 08 8b 42 14 e8 8e 10 3a c7 31 c0 5d c3 55 <8b> 40 14 89 e5 5d 0f be 40 05 c1 e8 1f c3 55 89 e5 57 56 53 53 Feb 27 13:14:44 mshefty-linux1 kernel: <3>Debug: sleeping function called from invalid context at include/linux/rwsem.h:43 Feb 27 13:14:44 mshefty-linux1 kernel: in_atomic():0, irqs_disabled():1 Feb 27 13:14:44 mshefty-linux1 kernel: [show_trace+13/15] show_trace+0xd/0xf Feb 27 13:14:44 mshefty-linux1 kernel: [] show_trace+0xd/0xf Feb 27 13:14:44 mshefty-linux1 kernel: [dump_stack+21/23] dump_stack+0x15/0x17 Feb 27 13:14:44 mshefty-linux1 kernel: [] dump_stack+0x15/0x17 Feb 27 13:14:44 mshefty-linux1 kernel: [__might_sleep+143/153] __might_sleep+0x8f/0x99 Feb 27 13:14:44 mshefty-linux1 kernel: [] __might_sleep+0x8f/0x99 Feb 27 13:14:44 mshefty-linux1 kernel: [profile_task_exit+27/71] profile_task_exit+0x1b/0x47 Feb 27 13:14:44 mshefty-linux1 kernel: [] profile_task_exit+0x1b/0x47 Feb 27 13:14:44 mshefty-linux1 kernel: [do_exit+27/853] do_exit+0x1b/0x355 Feb 27 13:14:44 mshefty-linux1 kernel: [] do_exit+0x1b/0x355 Feb 27 13:14:44 mshefty-linux1 kernel: [do_trap+0/150] do_trap+0x0/0x96 Feb 27 13:14:44 mshefty-linux1 kernel: [] do_trap+0x0/0x96 Feb 27 13:14:44 mshefty-linux1 kernel: [do_page_fault+900/1220] do_page_fault+0x384/0x4c4 Feb 27 13:14:44 mshefty-linux1 kernel: [] do_page_fault+0x384/0x4c4 Feb 27 13:14:44 mshefty-linux1 kernel: [error_code+79/96] error_code+0x4f/0x60 Feb 27 13:14:44 mshefty-linux1 kernel: [] error_code+0x4f/0x60 Feb 27 13:14:44 mshefty-linux1 kernel: [pg0+949484484/1069335552] mthca_tavor_post_send+0x2f3/0x56b [ib_mthca] Feb 27 13:14:44 mshefty-linux1 kernel: [] mthca_tavor_post_send+0x2f3/0x56b [ib_mthca] Feb 27 13:14:44 mshefty-linux1 kernel: [pg0+949924455/1069335552] ib_send_mad+0xdb/0x110 [ib_mad] Feb 27 13:14:44 mshefty-linux1 kernel: [] ib_send_mad+0xdb/0x110 [ib_mad] Feb 27 13:14:44 mshefty-linux1 kernel: [pg0+949937804/1069335552] send_next_seg+0xcb/0xd2 [ib_mad] Feb 27 13:14:44 mshefty-linux1 kernel: [] send_next_seg+0xcb/0xd2 [ib_mad] Feb 27 13:14:44 mshefty-linux1 kernel: [pg0+949939122/1069335552] ib_send_rmpp_mad+0xa3/0xb5 [ib_mad] Feb 27 13:14:44 mshefty-linux1 kernel: [] ib_send_rmpp_mad+0xa3/0xb5 [ib_mad] Feb 27 13:14:44 mshefty-linux1 kernel: [pg0+949924790/1069335552] ib_post_send_mad+0x11a/0x1a8 [ib_mad] Feb 27 13:14:44 mshefty-linux1 kernel: [] ib_post_send_mad+0x11a/0x1a8 [ib_mad] Feb 27 13:14:44 mshefty-linux1 kernel: [pg0+949597095/1069335552] send_response+0xbb/0xe3 [ib_grmpp] Feb 27 13:14:44 mshefty-linux1 kernel: [] send_response+0xbb/0xe3 [ib_grmpp] Feb 27 13:14:44 mshefty-linux1 kernel: [pg0+949597306/1069335552] recv_handler+0x3d/0x51 [ib_grmpp] Feb 27 13:14:44 mshefty-linux1 kernel: [] recv_handler+0x3d/0x51 [ib_grmpp] Feb 27 13:14:44 mshefty-linux1 kernel: [pg0+949927683/1069335552] ib_mad_complete_recv+0xf3/0x121 [ib_mad] Feb 27 13:14:44 mshefty-linux1 kernel: [] ib_mad_complete_recv+0xf3/0x121 [ib_mad] Feb 27 13:14:44 mshefty-linux1 kernel: [pg0+949928209/1069335552] ib_mad_recv_done_handler+0x1e0/0x215 [ib_mad] Feb 27 13:14:44 mshefty-linux1 kernel: [] ib_mad_recv_done_handler+0x1e0/0x215 [ib_mad] Feb 27 13:14:44 mshefty-linux1 kernel: [pg0+949929484/1069335552] ib_mad_completion_handler+0x45/0x7a [ib_mad] Feb 27 13:14:44 mshefty-linux1 kernel: [] ib_mad_completion_handler+0x45/0x7a [ib_mad] Feb 27 13:14:44 mshefty-linux1 kernel: [run_workqueue+130/195] run_workqueue+0x82/0xc3 Feb 27 13:14:44 mshefty-linux1 kernel: [] run_workqueue+0x82/0xc3 Feb 27 13:14:44 mshefty-linux1 kernel: [worker_thread+248/298] worker_thread+0xf8/0x12a Feb 27 13:14:44 mshefty-linux1 kernel: [] worker_thread+0xf8/0x12a Feb 27 13:14:44 mshefty-linux1 kernel: [kthread+120/160] kthread+0x78/0xa0 Feb 27 13:14:44 mshefty-linux1 kernel: [] kthread+0x78/0xa0 Feb 27 13:14:44 mshefty-linux1 kernel: [kernel_thread_helper+5/11] kernel_thread_helper+0x5/0xb Feb 27 13:14:44 mshefty-linux1 kernel: [] kernel_thread_helper+0x5/0xb From hch at lst.de Mon Feb 27 13:20:02 2006 From: hch at lst.de (Christoph Hellwig) Date: Mon, 27 Feb 2006 22:20:02 +0100 Subject: [openib-general] RFC: SDP plans In-Reply-To: <20060227173459.GB19855@mellanox.co.il> References: <20060227173459.GB19855@mellanox.co.il> Message-ID: <20060227212002.GA7597@lst.de> On Mon, Feb 27, 2006 at 07:34:59PM +0200, Michael S. Tsirkin wrote: > I started preparing a stable linux SDP implementation, > with the eye towards mainline inclusion. > > The idea is to get to a drastically simple code base and get this admitted in > mainline, then add enhancements. > > The plan (as compared to existing SDP implementation) includes: > - Use CMA API > - Reuse generic code from sock.c > SO_SNDBUF/SO_RCVBUF should work properly > - Use sock_lock_t and simple spin_lock_bh for socket locking > - Use skbuff and standard skbuff queues (in struct sock) > for incoming/outgoing messages > - Implement transport-level queues by simple circular buffer, > attach BSDH by s/g > - Set socket bits to signal the need for control messages > - Single CQ, perform all CQ polling from interrupt context > - Code must be sparse-clean, keep network data in __beXX structures > - Proper use of DMA API > - Use sysfs for statistics, entry per socket this one sounds a bit fishy. with lots of sockets you'll eat up far too much memory I susect. But let's look at the code once it's there. else this list sounds like where sdp needs to head for. From bos at pathscale.com Mon Feb 27 14:06:21 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Mon, 27 Feb 2006 14:06:21 -0800 Subject: [openib-general] RFC: SDP plans In-Reply-To: <20060227193311.GA20064@mellanox.co.il> References: <20060227173459.GB19855@mellanox.co.il> <19a929370602270944v30f79516id428b0913bd04a08@mail.gmail.com> <44034423.8090109@ichips.intel.com> <20060227184212.GD19265@mellanox.co.il> <440349FD.1050103@ichips.intel.com> <20060227193311.GA20064@mellanox.co.il> Message-ID: <1141077981.30345.3.camel@serpentine.pathscale.com> On Mon, 2006-02-27 at 21:33 +0200, Michael S. Tsirkin wrote: > So its possible to make the protocol family configurable in userspace sdp > library (libsdp) and then include libsdp in Release 1.0 and it will work with > whatever SDP code makes it upstream. The one thing that worries me about this is that it appears that libsdp silently falls back to using TCP at the moment, at least based on some internal tests we did with the environment slightly and inadvertantly misconfigured. If this is the case, then having the protocol family potentially change out under SDP could make it occur that a kernel update (or a switch to a kernel that doesn't have SDP available) would cause apps to silently start using IPoIB instead of SDP. Which is the sort of behaviour that leads to rapid hair loss on the part of both users and support people. If libsdp were to abort by default if SDP wasn't available, then I think we'd have a fairly strong case that we're not creating a nightmare for ourselves. Message-ID: <000001c63bea$50fea6d0$a1cc180a@amr.corp.intel.com> Bryan wrote, >We'll have to see how many of those are feasible. Packaging userspace >for the different distros is easy enough; the big problem is backporting >kernel support to kernels that will actually work on the distros in >question, and building binary packages of those. >I think the only feasible approach will be for people to build binary >packages of whatever kernels they can and make them available. People >who test can either use those, or build their own kernels and report the >kernel versions they are using when they send in test results. I just committed some new backport patches for 2.6.9 EL kernels and associated RPMS based on SVN5492, which matches the initial branch for 1.0 for those that want to do some initial testing of the 1.0 branch before RC1. I built RPMS for x86_64, ia64, and i686. > - CPU and chipset platforms >I assume ia64 needs to be included, too, but I would very much like >people to let me know what platforms they want to see tested or can test >themselves. I have done some limited testing of ia64, uDAPL and IPoIB using Intel MPI to drive them and so far, the branch looks stable. >It would be good to have a common set of features and applications that >vendors could test in a uniform way, so that we have at least a base set >of somewhat standardised test results. In addition, any further testing >that people can perform and report on will be most welcome. We will be testing primarily using Intel MPI on top of uDAPL on x86_64 and Itaimium. I will also do some limited testing of IPoIB and SDP using Intel MPI to drive them. Once we have a formal RC1, we will likely load it up on our 32-node x86_64 and ia64 clusters and run some real MPI applications using Intel MPI. woody From tzachid at mellanox.co.il Mon Feb 27 14:21:19 2006 From: tzachid at mellanox.co.il (Tzachi Dar) Date: Tue, 28 Feb 2006 00:21:19 +0200 Subject: [openib-general] Microsoft virtual machine and Infiniband Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3017BCAE2@mtlexch01.mtl.com> Hi Fab, When trying to run windows 2003 server on a Microsoft virtual machine we have found out that there is one problem that prevents IPOIB from running. The problem as you can guess is related to the way MAC addresses are being handled. On such a machine, a fake Mac addresses is being created and it is later used for communication (one MAC per guest OS). However although this packets return to the correct computer, IPOIB doesn't restore their correct dest MAC and therefore pinging to a remote host is impossible. In order to solve this problem there is a need to create a mechanism that will allow the IPOIB driver to correct the MAC addresses of packets based on their IP addresses. It seems that the best way to do this is to have a "static" table of IP's and MAC addresses and to check every IP packet as well as every ARP reply. We have done such an experiment and it did seems to work. We are still looking for a way to configure the table of guest OS and their IPs and MACs. One way to achieve this is simply having a static table that will be entered through some file. Although this is the simplest way, it has an obvious disadvantage (the need to manually configure the machine). A different way is to find some configuration API's that the remote machine has, while the last possibility is trying to find the information by sniffing for packets (the way that an Ethernet switch does things). One bug that I have already found is that if a broadcast packet is sent for example an ARP request, we send the packet as a multicast, and we also receive the packet ourselves, and later we send this packet to NDIS. This is not the correct behavior (assuming we are emulating Ethernet behavior) and we should remove this packets. In the next week I'll try to create a patch that will allow the virtual machine to work, I just wanted to know what your opinion about this issue. Thanks Tzachi -------------- next part -------------- An HTML attachment was scrubbed... URL: From ftillier at silverstorm.com Mon Feb 27 14:48:56 2006 From: ftillier at silverstorm.com (Fabian Tillier) Date: Mon, 27 Feb 2006 14:48:56 -0800 Subject: [openib-general] Microsoft virtual machine and Infiniband In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3017BCAE2@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3017BCAE2@mtlexch01.mtl.com> Message-ID: <79ae2f320602271448l3466298bmabf41489256e907@mail.gmail.com> Hi Tzachi, On 2/27/06, Tzachi Dar wrote: > Hi Fab, > > When trying to run windows 2003 server on a Microsoft virtual machine we > have found out that there is one problem that prevents IPOIB from running. > > The problem as you can guess is related to the way MAC addresses are being > handled. On such a machine, a fake Mac addresses is being created and it is > later used for communication (one MAC per guest OS). However although this > packets return to the correct computer, IPOIB doesn't restore their correct > dest MAC and therefore pinging to a remote host is impossible. Is IPoIB running on the guest OS, or on the host OS? I'm assuming host, and the guest sends packets using it's guest MAC. So a packet gets passed to IPoIB using the guest MAC as source. The recipient of such a packet tries to reconstruct the Ethernet header, and ends up with the sender's host MAC, rather than the sender's guest MAC. Am I following this properly? > In order to solve this problem there is a need to create a mechanism that > will allow the IPOIB driver to correct the MAC addresses of packets based on > their IP addresses. So, the recipient should do an IP lookup on every received IP packet and restore the MAC based on the IP, rather than just based on the LID/GID of the source. This requires adding a mechanism to lookup by IP, which currently doesn't exist (do we need to to support duplicate IPs?) Currently the receive flow does something like this: resolve endpoints discard loopback switch packet type { case IP: handle IP packet; break; case ARP: handle ARP packet; break; default: handle generic packet; break; } This would have to change to something like this: resolve source by LID/GID and discard loopback switch packet type { case IP: resolve endpoints by IP; handle IP packet; break; case ARP: process ARP, creating IP mappings; break; default: resolve destination from WC; handle generic packet; break; } > It seems that the best way to do this is to have a "static" table of IP's > and MAC addresses and to check every IP packet as well as every ARP reply. > We have done such an experiment and it did seems to work. Why have a static table? Why not just extend the endpoint lookup mechanisms to support lookup by IP? > We are still looking for a way to configure the table of guest OS and their > IPs and MACs. One way to achieve this is simply having a static table that > will be entered through some file. Although this is the simplest way, it has > an obvious disadvantage (the need to manually configure the machine). A > different way is to find some configuration API's that the remote machine > has, while the last possibility is trying to find the information by > sniffing for packets (the way that an Ethernet switch does things). We have to sniff the packets, both outbound and inbound, to do IPoIB encapsulation since we pretend to be a standard 802.3 NIC. Additional snooping shouldn't be a big deal. If it is, we can add a configuration parameter to turn the IP based MAC resolution on/off. > One bug that I have already found is that if a broadcast packet is sent for > example an ARP request, we send the packet as a multicast, and we also > receive the packet ourselves, and later we send this packet to NDIS. This is > not the correct behavior (assuming we are emulating Ethernet behavior) and > we should remove this packets. Yes, I have a fix for this in my sandbox already. Any packet we receive where we are the sender needs to be discarded. The existing check in the code for loopback packets uses the unformatted ethernet header, which clearly doesn't work. Thanks for pointing it out, though! > In the next week I'll try to create a patch that will allow the virtual > machine to work, I just wanted to know what your opinion about this issue. Cool, thanks! Hopefully my understanding above is correct. Please let me know if I've missed something. Thanks, - Fab From halr at voltaire.com Mon Feb 27 14:49:20 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 27 Feb 2006 17:49:20 -0500 Subject: [openib-general] Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID In-Reply-To: <4403422F.2000408@ichips.intel.com> References: <200602231814.26918.jackm@mellanox.co.il> <200602231825.19333.jackm@mellanox.co.il> <43FDF106.8070403@ichips.intel.com> <20060225170316.GA15973@mellanox.co.il> <44033E57.8090905@ichips.intel.com> <20060227181156.GC17414@mellanox.co.il> <4403422F.2000408@ichips.intel.com> Message-ID: <1141080073.4335.17.camel@hal.voltaire.com> Hi Sean, On Mon, 2006-02-27 at 13:17, Sean Hefty wrote: > Michael S. Tsirkin wrote: > > I dont see a way around this: the dirt is at the spec level. > > Correct - I believe this is an architectural issue that needs to be addressed by > changes to the spec. Care to offer a proposal to fix this (I presume in a generic way) ? -- Hal > - Sean > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Mon Feb 27 14:52:33 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 27 Feb 2006 17:52:33 -0500 Subject: [openib-general] RFC: SDP plans In-Reply-To: <44034423.8090109@ichips.intel.com> References: <20060227173459.GB19855@mellanox.co.il> <19a929370602270944v30f79516id428b0913bd04a08@mail.gmail.com> <44034423.8090109@ichips.intel.com> Message-ID: <1141080131.4335.24.camel@hal.voltaire.com> On Mon, 2006-02-27 at 13:25, Sean Hefty wrote: > Bill Boas wrote: > > Having a stable, performant SDP in Rel. 1.0 available from Novel and > > RedHat is absolutely critical for Wall Street and others. > > Then release 1.0 will need to wait until SPD has been updated. I thought the mantra was whatever was ready for the train not the other way 'round... -- Hal > > - Sean > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mshefty at ichips.intel.com Mon Feb 27 15:02:39 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 27 Feb 2006 15:02:39 -0800 Subject: [openib-general] Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID In-Reply-To: <1141080073.4335.17.camel@hal.voltaire.com> References: <200602231814.26918.jackm@mellanox.co.il> <200602231825.19333.jackm@mellanox.co.il> <43FDF106.8070403@ichips.intel.com> <20060225170316.GA15973@mellanox.co.il> <44033E57.8090905@ichips.intel.com> <20060227181156.GC17414@mellanox.co.il> <4403422F.2000408@ichips.intel.com> <1141080073.4335.17.camel@hal.voltaire.com> Message-ID: <4403850F.4010409@ichips.intel.com> Hal Rosenstock wrote: > Care to offer a proposal to fix this (I presume in a generic way) ? Offhand, I can think of two possibilities. Redefine/Extend RMPPType. 3: STOP send packet; shall be sent only from Receiver to Sender. 4: ABORT send packet; shall be sent only from Receiver to Sender. 5: STOP receive packet; shall be sent only from Sender to Receiver. 6: ABORT receive packet; shall be sent only from Sender to Receiver. or Extend RMPPStatus. Determine which STOP/ABORT messages may be transfered either direction. Define existing values as applying only from Receiver to Sender. Define new values (in current reserved range) as applying only from Sender to Receiver. I prefer the first option, but there may be other options that I'm overlooking. - Sean From arlin.r.davis at intel.com Mon Feb 27 15:11:53 2006 From: arlin.r.davis at intel.com (Arlin Davis) Date: Mon, 27 Feb 2006 15:11:53 -0800 Subject: [openib-general] [PATCH 1.0] uDAPL - QP destroy and HCA close problems fixed Message-ID: James, Here is a small uDAPL patch that should go into 1.0 that fixes some issues that we just found with MPI scale out testing on OpenIB. QP was not being destroyed in some cases and hca_close issues with async work thread. I am still working one other elusive disconnect problem that may require another small patch. Thanks, -arlin Signed-off by: Arlin Davis Index: dapl/openib_cma/dapl_ib_util.c =================================================================== --- dapl/openib_cma/dapl_ib_util.c (revision 5489) +++ dapl/openib_cma/dapl_ib_util.c (working copy) @@ -330,6 +330,13 @@ DAT_RETURN dapls_ib_close_hca(IN DAPL_HC hca_ptr->ib_hca_handle = IB_INVALID_HANDLE; } + dapl_os_lock(&g_hca_lock); + if (g_ib_thread_state != IB_THREAD_RUN) { + dapl_os_unlock(&g_hca_lock); + goto bail; + } + dapl_os_unlock(&g_hca_lock); + /* * Remove hca from async and CQ event processing list * Wakeup work thread to remove from polling list @@ -342,10 +349,12 @@ DAT_RETURN dapls_ib_close_hca(IN DAPL_HC struct timespec sleep, remain; sleep.tv_sec = 0; sleep.tv_nsec = 10000000; /* 10 ms */ + write(g_ib_pipe[1], "w", sizeof "w"); dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " ib_thread_destroy: wait on hca %p destroy\n"); nanosleep (&sleep, &remain); } +bail: return (DAT_SUCCESS); } Index: dapl/openib_cma/dapl_ib_cm.c =================================================================== --- dapl/openib_cma/dapl_ib_cm.c (revision 5489) +++ dapl/openib_cma/dapl_ib_cm.c (working copy) @@ -306,15 +306,6 @@ static int dapli_cm_active_cb(struct dap destroy = conn->destroy; conn->in_callback = conn->destroy; dapl_os_unlock(&conn->lock); - if (destroy) { - dapl_dbg_log(DAPL_DBG_TYPE_CM, - " active_cb: DESTROY conn %p id %d \n", - conn, conn->cm_id ); - if (conn->ep) - conn->ep->cm_handle = IB_INVALID_HANDLE; - - dapl_os_free(conn, sizeof(*conn)); - } return(destroy); } @@ -389,12 +380,6 @@ static int dapli_cm_passive_cb(struct da destroy = conn->destroy; conn->in_callback = conn->destroy; dapl_os_unlock(&conn->lock); - if (destroy) { - if (conn->ep) - conn->ep->cm_handle = IB_INVALID_HANDLE; - - dapl_os_free(conn, sizeof(*conn)); - } return(destroy); } @@ -1080,10 +1065,21 @@ void dapli_cma_event_cb(void) ret = dapli_cm_passive_cb(conn,event); else ret = dapli_cm_active_cb(conn,event); - - if (ret) + + /* destroy both qp and cm_id */ + if (ret) { + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " cma_cb: DESTROY conn %p" + " cm_id %p qp %p\n", + conn, conn->cm_id, + conn->cm_id->qp); + + if (conn->cm_id->qp) + rdma_destroy_qp(conn->cm_id); + rdma_destroy_id(conn->cm_id); - + dapl_os_free(conn, sizeof(*conn)); + } break; case RDMA_CM_EVENT_CONNECT_RESPONSE: default: @@ -1095,7 +1091,7 @@ void dapli_cma_event_cb(void) } rdma_ack_cm_event(event); } else { - dapl_dbg_log(DAPL_DBG_TYPE_WARN, + dapl_dbg_log(DAPL_DBG_TYPE_CM, " cm_event: ERROR: rdma_get_cm_event() %d %d %s\n", ret, errno, strerror(errno)); } Index: dapl/openib_cma/dapl_ib_util.h =================================================================== --- dapl/openib_cma/dapl_ib_util.h (revision 5489) +++ dapl/openib_cma/dapl_ib_util.h (working copy) @@ -295,7 +295,8 @@ dapl_convert_errno( IN int err, IN const if (!err) return DAT_SUCCESS; #if DAPL_DBG - if ((err != EAGAIN) && (err != ETIME) && (err != ETIMEDOUT)) + if ((err != EAGAIN) && (err != ETIME) && + (err != ETIMEDOUT) && (err != EINTR)) dapl_dbg_log (DAPL_DBG_TYPE_ERR," %s %s\n", str, strerror(err)); #endif Index: dapl/openib_cma/dapl_ib_cq.c =================================================================== --- dapl/openib_cma/dapl_ib_cq.c (revision 5489) +++ dapl/openib_cma/dapl_ib_cq.c (working copy) @@ -498,7 +498,10 @@ dapls_ib_wait_object_wait(IN ib_wait_obj if (timeout != DAT_TIMEOUT_INFINITE) timeout_ms = timeout/1000; - status = poll(&cq_fd, 1, timeout_ms); + /* restart syscall */ + while ((status = poll(&cq_fd, 1, timeout_ms)) == -1 ) + if (errno == EINTR) + continue; /* returned event */ if (status > 0) { @@ -511,13 +514,15 @@ dapls_ib_wait_object_wait(IN ib_wait_obj /* timeout */ } else if (status == 0) status = ETIMEDOUT; + else + status = errno; dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " cq_object_wait: RET evd %p ibv_cq %p ibv_ctx %p %s\n", evd_ptr, ibv_cq,ibv_ctx,strerror(errno)); return(dapl_convert_errno(status,"cq_wait_object_wait")); - + } #endif -------------- next part -------------- A non-text attachment was scrubbed... Name: udapl_patch_1.0 Type: application/octet-stream Size: 4236 bytes Desc: not available URL: From mshefty at ichips.intel.com Mon Feb 27 15:13:23 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 27 Feb 2006 15:13:23 -0800 Subject: [openib-general] RFC: SDP plans In-Reply-To: <1141080131.4335.24.camel@hal.voltaire.com> References: <20060227173459.GB19855@mellanox.co.il> <19a929370602270944v30f79516id428b0913bd04a08@mail.gmail.com> <44034423.8090109@ichips.intel.com> <1141080131.4335.24.camel@hal.voltaire.com> Message-ID: <44038793.6010104@ichips.intel.com> Hal Rosenstock wrote: >>>Having a stable, performant SDP in Rel. 1.0 available from Novel and >>>RedHat is absolutely critical for Wall Street and others. >> >>Then release 1.0 will need to wait until SPD has been updated. > > > I thought the mantra was whatever was ready for the train not the other > way 'round... I agree. I was simply stating that the kernel component of SDP is not ready. We've already branched for the 1.0 release. Either that branch was done prematurely, or we need to focus on determining what in that branch is not ready. I don't believe that we should be making major updates to the release branch. Components not ready should simply be moved to a later release. - Sean From bos at pathscale.com Mon Feb 27 15:19:55 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Mon, 27 Feb 2006 15:19:55 -0800 Subject: [openib-general] RFC: SDP plans In-Reply-To: <1141080131.4335.24.camel@hal.voltaire.com> References: <20060227173459.GB19855@mellanox.co.il> <19a929370602270944v30f79516id428b0913bd04a08@mail.gmail.com> <44034423.8090109@ichips.intel.com> <1141080131.4335.24.camel@hal.voltaire.com> Message-ID: <1141082395.30345.37.camel@serpentine.pathscale.com> On Mon, 2006-02-27 at 17:52 -0500, Hal Rosenstock wrote: > > > Having a stable, performant SDP in Rel. 1.0 available from Novel and > > > RedHat is absolutely critical for Wall Street and others. > > > > Then release 1.0 will need to wait until SPD has been updated. > > I thought the mantra was whatever was ready for the train not the other > way 'round... Well, here are the facts we have before us. * SDP is in reasonable shape now, but Michael has a lot of work planned for it before he makes an upstream kernel submission. * That work isn't going to be ready for 2.6.17, unless a miracle happens. * Given our current tentative schedule, 2.6.17 will be the current kernel.org kernel when we release 1.0. * Michael has a suggestion for making libsdp work in the face of a potentially unstable kernel ABI, and I have a suggestion for making his suggestion more robust. * Whatever we do, it's apparently going to be too late for the next round of enterprise distros from the big vendors. So. We can move the 1.0 release date to "whenever SDP gets into an upstream kernel" without knowing when that might be, or we can do the best we can with what we have now. My preference is strongly for the latter. Message-ID: <000001c63bf5$203c5910$0ff9070a@amr.corp.intel.com> Michael wrote, >>Quoting Bill Boas : >> Michael, >> >> Having a stable, performant SDP in Rel. 1.0 available from Novel and RedHat is >> absolutely critical for Wall Street and others. >> >> If anyone diagrees with this requirement please speak up! >> >> Bill. >I think I agree with Roland and Sean that its best to avoid distributing kernel >level code in Release 1.0. If we do not distribute kernel code, then what kernel should we test against ? 2.6.16 ? or 2.6.17 ? It seems like people need binary kernel RPMs that install easily to allow more people to test the code. At least the people that I deal with do not typically build their own kernels. From mshefty at ichips.intel.com Mon Feb 27 15:27:28 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 27 Feb 2006 15:27:28 -0800 Subject: [openib-general] RFC: SDP plans In-Reply-To: <1141082395.30345.37.camel@serpentine.pathscale.com> References: <20060227173459.GB19855@mellanox.co.il> <19a929370602270944v30f79516id428b0913bd04a08@mail.gmail.com> <44034423.8090109@ichips.intel.com> <1141080131.4335.24.camel@hal.voltaire.com> <1141082395.30345.37.camel@serpentine.pathscale.com> Message-ID: <44038AE0.9040208@ichips.intel.com> Bryan O'Sullivan wrote: > So. We can move the 1.0 release date to "whenever SDP gets into an > upstream kernel" without knowing when that might be, or we can do the > best we can with what we have now. > > My preference is strongly for the latter. SDP should not be in a release until it is release quality, and I would based that on an upstream submission. What's wrong with shipping release 1.0 without SDP, then shipping an updated release (1.1) once SDP is ready? SDP doesn't get there faster by delaying the 1.0 release. - Sean From dledford at redhat.com Mon Feb 27 15:28:31 2006 From: dledford at redhat.com (Doug Ledford) Date: Mon, 27 Feb 2006 18:28:31 -0500 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <1140817557.6119.493.camel@localhost> References: <000401c6396b$159ea440$6aa1070a@amr.corp.intel.com> <1140803852.1158.46.camel@camp4.serpentine.com> <1140817557.6119.493.camel@localhost> Message-ID: <20060227232831.GS5082@redhat.com> On Fri, Feb 24, 2006 at 01:45:57PM -0800, Matt L. Leininger wrote: > On Fri, 2006-02-24 at 09:57 -0800, Bryan O'Sullivan wrote: > > On Fri, 2006-02-24 at 09:52 -0800, Bob Woodruff wrote: > > > Bryan wrote, > > > >Do you want the diags in, or out? They're not packaged in any way, so > > > >I'd vote for "out". > > > > > > Again, I would vote for including everything that is in the trunk > > > unless the maintainer decides it is not ready or it is some > > > code that is now obsolete and should be removed. > > > > I don't have a problem with leaving code in the SVN branch and not > > touching it. My point is more that the management diags don't have an > > RPM spec file, so if someone doesn't write one, they won't get shipped > > in binary form, and hence they won't get tested or used. This applies > > to other components, too. > > Agreed. Is anyone planning on adding a spec file for the tools and > diags? My spec files for both libibverbs and opensm include the related utilities and diags. My one suggestion is that if you bother to create spec files for a 1.0 release, then please don't use /usr/local, use the proper locations for files as though they were something other than 1 off local builds. For example all the scripts in the management tree use /usr/local as their prefix, the configure program doesn't change them, so my rpm has a shell environment file it drops in /etc/profile.d in order to get the scripts to work without having to edit all the paths. I'd prefer not to have to have that file in /etc/profile.d for an official 1.0 release ;-) -- Doug Ledford 919-754-3700 x44233 Red Hat, Inc. 1801 Varsity Dr. Raleigh, NC 27606 From mshefty at ichips.intel.com Mon Feb 27 15:36:56 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 27 Feb 2006 15:36:56 -0800 Subject: [openib-general] crash with RMPP test In-Reply-To: References: Message-ID: <44038D18.2000706@ichips.intel.com> Sean Hefty wrote: > I hit the following crash on the server side of the testing grmpp with > the latest bits. The parameters to the test are: "rmpp=1" "message_size=1000" > "responses=1". With "responses=0", I don't see the issue. > > Has anyone else seen this? I haven't run grmpp in a while, so I'm not sure > if this crash is related to the latest check-in or not. I'll look into this > more later this afternoon. FYI - this was a bug in grmpp setting the ah incorrectly. A fix has been committed. - Sean From sean.hefty at intel.com Mon Feb 27 15:44:01 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 27 Feb 2006 15:44:01 -0800 Subject: [openib-general] [PATCH] MAD: rename 'multipacket' to 'rmpp' Message-ID: The following renames 'multipacket' to 'rmpp' for consistency. The most notable name changes are: ib_mad_multipacket_seg --> ib_rmpp_segment ib_mad_get_multipacket_seg() --> ib_get_rmpp_segment() Signed-off-by: Sean Hefty --- Index: include/rdma/ib_mad.h =================================================================== --- include/rdma/ib_mad.h (revision 5455) +++ include/rdma/ib_mad.h (working copy) @@ -141,7 +141,7 @@ struct ib_rmpp_hdr { __be32 paylen_newwin; }; -struct ib_mad_multipacket_seg { +struct ib_rmpp_segment { struct list_head list; u32 num; u16 size; @@ -597,14 +597,14 @@ struct ib_mad_send_buf * ib_create_send_ gfp_t gfp_mask); /** - * *ib_mad_get_multipacket_seg - returns a given RMPP segment. + * *ib_get_rmpp_segment - returns a given RMPP segment. * @send_buf: Previously allocated send data buffer. * @seg_num: number of segment to return * - * This routine returns a pointer to a segment of a multipacket RMPP message. + * This routine returns a pointer to a segment of an RMPP message. */ -struct ib_mad_multipacket_seg -*ib_mad_get_multipacket_seg(struct ib_mad_send_buf *send_buf, int seg_num); +struct ib_rmpp_segment *ib_get_rmpp_segment(struct ib_mad_send_buf *send_buf, + int seg_num); /** * ib_free_send_mad - Returns data buffers used to send a MAD. Index: core/mad_rmpp.c =================================================================== --- core/mad_rmpp.c (revision 5455) +++ core/mad_rmpp.c (working copy) @@ -535,7 +535,7 @@ start_rmpp(struct ib_mad_agent_private * static int send_next_seg(struct ib_mad_send_wr_private *mad_send_wr) { struct ib_rmpp_mad *rmpp_mad; - struct ib_mad_multipacket_seg *seg; + struct ib_rmpp_segment *seg; int timeout; u32 paylen; @@ -549,8 +549,7 @@ static int send_next_seg(struct ib_mad_s mad_send_wr->pad; rmpp_mad->rmpp_hdr.paylen_newwin = cpu_to_be32(paylen); } else { - seg = ib_rmpp_get_multipacket_seg(mad_send_wr, - mad_send_wr->seg_num); + seg = ib_get_segment(mad_send_wr, mad_send_wr->seg_num); if (!seg) { printk(KERN_ERR PFX "send_next_seg: " "could not find segment %d\n", @@ -605,12 +604,12 @@ out: static inline void adjust_last_ack(struct ib_mad_send_wr_private *wr) { - struct ib_mad_multipacket_seg *seg; + struct ib_rmpp_segment *seg; if (wr->last_ack < 2) return; else if (!wr->last_ack_seg) - list_for_each_entry(seg, &wr->multipacket_list, list) { + list_for_each_entry(seg, &wr->rmpp_list, list) { if (wr->last_ack == seg->num) { wr->last_ack_seg = seg; break; @@ -902,7 +901,7 @@ int ib_retry_rmpp(struct ib_mad_send_wr_ return IB_RMPP_RESULT_PROCESSED; mad_send_wr->seg_num = mad_send_wr->last_ack + 1; - mad_send_wr->seg_num_seg = mad_send_wr->last_ack_seg; + mad_send_wr->cur_seg = mad_send_wr->last_ack_seg; ret = send_next_seg(mad_send_wr); if (ret) Index: core/user_mad.c =================================================================== --- core/user_mad.c (revision 5455) +++ core/user_mad.c (working copy) @@ -194,7 +194,7 @@ static int copy_recv_mad(struct ib_mad_r struct ib_mad_recv_buf *seg_buf; struct ib_rmpp_mad *rmpp_mad; void *data; - struct ib_mad_multipacket_seg *seg; + struct ib_rmpp_segment *seg; int size, len, offset; u8 flags; @@ -220,7 +220,7 @@ static int copy_recv_mad(struct ib_mad_r size = len; else size = sizeof(*rmpp_mad) - offset; - seg = kmalloc(sizeof(struct ib_mad_multipacket_seg) + + seg = kmalloc(sizeof(struct ib_rmpp_segment) + sizeof(struct ib_rmpp_mad) - offset, GFP_KERNEL); if (!seg) @@ -235,7 +235,7 @@ static int copy_recv_mad(struct ib_mad_r static void free_packet(struct ib_umad_packet *packet) { - struct ib_mad_multipacket_seg *seg, *tmp; + struct ib_rmpp_segment *seg, *tmp; list_for_each_entry_safe(seg, tmp, &packet->seg_list, list) { list_del(&seg->list); @@ -322,7 +322,7 @@ static ssize_t ib_umad_read(struct file size_t count, loff_t *pos) { struct ib_umad_file *file = filp->private_data; - struct ib_mad_multipacket_seg *seg; + struct ib_rmpp_segment *seg; struct ib_umad_packet *packet; ssize_t ret; @@ -411,7 +411,7 @@ static ssize_t ib_umad_write(struct file int ret, length, hdr_len, copy_offset; int rmpp_active, has_rmpp_header; int s, seg_num; - struct ib_mad_multipacket_seg *seg; + struct ib_rmpp_segment *seg; if (count < sizeof (struct ib_user_mad) + IB_MGMT_RMPP_HDR) return -EINVAL; @@ -523,7 +523,7 @@ static ssize_t ib_umad_write(struct file if (length > 0) { buf += sizeof (struct ib_user_mad) + sizeof(struct ib_mad); for (seg_num = 2; length > 0; ++seg_num, buf += s, length -= s) { - seg = ib_mad_get_multipacket_seg(packet->msg, seg_num); + seg = ib_get_rmpp_segment(packet->msg, seg_num); BUG_ON(!seg); s = min_t(int, length, seg->size); if (copy_from_user(seg->data, buf, s)) { Index: core/mad.c =================================================================== --- core/mad.c (revision 5455) +++ core/mad.c (working copy) @@ -779,23 +779,22 @@ static int get_buf_length(int hdr_len, i return hdr_len + data_len + pad; } -static void free_send_multipacket_list(struct ib_mad_send_wr_private * - mad_send_wr) +static void free_send_rmpp_list(struct ib_mad_send_wr_private *mad_send_wr) { - struct ib_mad_multipacket_seg *s, *t; + struct ib_rmpp_segment *s, *t; - list_for_each_entry_safe(s, t, &mad_send_wr->multipacket_list, list) { + list_for_each_entry_safe(s, t, &mad_send_wr->rmpp_list, list) { list_del(&s->list); kfree(s); } } -static inline int alloc_send_rmpp_segs(struct ib_mad_send_wr_private *send_wr, +static inline int alloc_send_rmpp_list(struct ib_mad_send_wr_private *send_wr, int message_size, int hdr_len, int data_len, u8 rmpp_version, gfp_t gfp_mask) { - struct ib_mad_multipacket_seg *seg; + struct ib_rmpp_segment *seg; struct ib_rmpp_mad *rmpp_mad = send_wr->send_buf.mad; int seg_size, i = 2; @@ -809,19 +808,19 @@ static inline int alloc_send_rmpp_segs(s message_size -= sizeof(struct ib_mad); seg_size = sizeof(struct ib_mad) - hdr_len; while (message_size > 0) { - seg = kmalloc(sizeof(struct ib_mad_multipacket_seg) + seg_size, + seg = kmalloc(sizeof(struct ib_rmpp_segment) + seg_size, gfp_mask); if (!seg) { printk(KERN_ERR "ib_create_send_mad: RMPP mem " "alloc failed for len %zd, gfp %#x\n", - sizeof(struct ib_mad_multipacket_seg) + seg_size, + sizeof(struct ib_rmpp_segment) + seg_size, gfp_mask); - free_send_multipacket_list(send_wr); + free_send_rmpp_list(send_wr); return -ENOMEM; } seg->size = seg_size; seg->num = i++; - list_add_tail(&seg->list, &send_wr->multipacket_list); + list_add_tail(&seg->list, &send_wr->rmpp_list); message_size -= seg_size; } return 0; @@ -854,7 +853,7 @@ struct ib_mad_send_buf * ib_create_send_ return ERR_PTR(-ENOMEM); mad_send_wr = buf + sizeof(struct ib_mad); - INIT_LIST_HEAD(&mad_send_wr->multipacket_list); + INIT_LIST_HEAD(&mad_send_wr->rmpp_list); mad_send_wr->send_buf.mad = buf; mad_send_wr->mad_payload = buf + hdr_len; @@ -873,10 +872,10 @@ struct ib_mad_send_buf * ib_create_send_ mad_send_wr->send_wr.wr.ud.remote_qkey = IB_QP_SET_QKEY; mad_send_wr->send_wr.wr.ud.pkey_index = pkey_index; mad_send_wr->last_ack_seg = NULL; - mad_send_wr->seg_num_seg = NULL; + mad_send_wr->cur_seg = NULL; if (rmpp_active) { - ret = alloc_send_rmpp_segs(mad_send_wr, message_size, hdr_len, + ret = alloc_send_rmpp_list(mad_send_wr, message_size, hdr_len, data_len, mad_agent->rmpp_version, gfp_mask); if (ret) { @@ -891,48 +890,45 @@ struct ib_mad_send_buf * ib_create_send_ } EXPORT_SYMBOL(ib_create_send_mad); -struct ib_mad_multipacket_seg -*ib_rmpp_get_multipacket_seg(struct ib_mad_send_wr_private *wr, int seg_num) +struct ib_rmpp_segment *ib_get_segment(struct ib_mad_send_wr_private *mad_send_wr, + int seg_num) { - struct ib_mad_multipacket_seg *seg; + struct ib_rmpp_segment *seg; if (seg_num == 2) { - wr->seg_num_seg = - container_of(wr->multipacket_list.next, - struct ib_mad_multipacket_seg, list); - return wr->seg_num_seg; + mad_send_wr->cur_seg = container_of(mad_send_wr->rmpp_list.next, + struct ib_rmpp_segment, list); + return mad_send_wr->cur_seg; } /* get first list entry if was not already done */ - if (!wr->seg_num_seg) - wr->seg_num_seg = - container_of(wr->multipacket_list.next, - struct ib_mad_multipacket_seg, list); - - if (wr->seg_num_seg->num == seg_num) - return wr->seg_num_seg; - else if (wr->seg_num_seg->num < seg_num) { - list_for_each_entry(seg, &wr->seg_num_seg->list, list) { + if (!mad_send_wr->cur_seg) + mad_send_wr->cur_seg = container_of(mad_send_wr->rmpp_list.next, + struct ib_rmpp_segment, list); + + if (mad_send_wr->cur_seg->num == seg_num) + return mad_send_wr->cur_seg; + else if (mad_send_wr->cur_seg->num < seg_num) { + list_for_each_entry(seg, &mad_send_wr->cur_seg->list, list) { if (seg->num == seg_num) { - wr->seg_num_seg = seg; - return wr->seg_num_seg; + mad_send_wr->cur_seg = seg; + return mad_send_wr->cur_seg; } } - return NULL; } else { - list_for_each_entry_reverse(seg, &wr->seg_num_seg->list, list) { + list_for_each_entry_reverse(seg, &mad_send_wr->cur_seg->list, + list) { if (seg->num == seg_num) { - wr->seg_num_seg = seg; - return wr->seg_num_seg; + mad_send_wr->cur_seg = seg; + return mad_send_wr->cur_seg; } } - return NULL; } return NULL; } -struct ib_mad_multipacket_seg -*ib_mad_get_multipacket_seg(struct ib_mad_send_buf *send_buf, int seg_num) +struct ib_rmpp_segment *ib_get_rmpp_segment(struct ib_mad_send_buf *send_buf, + int seg_num) { struct ib_mad_send_wr_private *wr; @@ -940,9 +936,9 @@ struct ib_mad_multipacket_seg return NULL; wr = container_of(send_buf, struct ib_mad_send_wr_private, send_buf); - return ib_rmpp_get_multipacket_seg(wr, seg_num); + return ib_get_segment(wr, seg_num); } -EXPORT_SYMBOL(ib_mad_get_multipacket_seg); +EXPORT_SYMBOL(ib_get_rmpp_segment); void ib_free_send_mad(struct ib_mad_send_buf *send_buf) { @@ -954,7 +950,7 @@ void ib_free_send_mad(struct ib_mad_send mad_send_wr = container_of(send_buf, struct ib_mad_send_wr_private, send_buf); - free_send_multipacket_list(mad_send_wr); + free_send_rmpp_list(mad_send_wr); kfree(send_buf->mad); if (atomic_dec_and_test(&mad_agent_priv->refcount)) wake_up(&mad_agent_priv->wait); Index: core/mad_priv.h =================================================================== --- core/mad_priv.h (revision 5455) +++ core/mad_priv.h (working copy) @@ -132,9 +132,9 @@ struct ib_mad_send_wr_private { enum ib_wc_status status; /* RMPP control */ - struct list_head multipacket_list; - struct ib_mad_multipacket_seg *last_ack_seg; - struct ib_mad_multipacket_seg *seg_num_seg; + struct list_head rmpp_list; + struct ib_rmpp_segment *last_ack_seg; + struct ib_rmpp_segment *cur_seg; int last_ack; int seg_num; int newwin; @@ -224,7 +224,7 @@ void ib_mark_mad_done(struct ib_mad_send void ib_reset_mad_timeout(struct ib_mad_send_wr_private *mad_send_wr, int timeout_ms); -struct ib_mad_multipacket_seg -*ib_rmpp_get_multipacket_seg(struct ib_mad_send_wr_private *wr, int seg_num); +struct ib_rmpp_segment *ib_get_segment(struct ib_mad_send_wr_private *mad_send_wr, + int seg_num); #endif /* __IB_MAD_PRIV_H__ */ From robert.j.woodruff at intel.com Mon Feb 27 15:46:12 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Mon, 27 Feb 2006 15:46:12 -0800 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <20060227232831.GS5082@redhat.com> Message-ID: <000101c63bf7$fdbb28a0$0ff9070a@amr.corp.intel.com> Doug Ledford wrote, >My spec files for both libibverbs and opensm include the related utilities >and diags. My one suggestion is that if you bother to create spec files for >a 1.0 release, then please don't use /usr/local, use the proper locations >for files as though they were something other than 1 off local builds. For >example all the scripts in the management tree use /usr/local as their >prefix, the configure program doesn't change them, so my rpm has a shell >environment file it drops in /etc/profile.d in order to get the scripts to >work without having to edit all the paths. I'd prefer not to have to have >that file in /etc/profile.d for an official 1.0 release ;-) Also, should the makefiles in SVN target the "proper locations" rather than /usr/local ? Right now, my test all-on-one usermode RPM targets istall stuff in /usr/local, which isn't really the proper place but it is the default target for the makefiles in SVN which allows me to easily download and build a newer version of the code from SVN, since that is where all the makefiles want to put stuff. If the release .spec files put stuff into places like /usr/lib or /usr/lib64 and the makefiles from SVN by default put stuff into /usr/local, then if someone tries to get a newer version from SVN and build it on a platform that has the release code (in the proper places) they will end up with a mess and could have a mismatch of components depending on how they set their path. woody From bos at pathscale.com Mon Feb 27 15:53:35 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Mon, 27 Feb 2006 15:53:35 -0800 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <20060227232831.GS5082@redhat.com> References: <000401c6396b$159ea440$6aa1070a@amr.corp.intel.com> <1140803852.1158.46.camel@camp4.serpentine.com> <1140817557.6119.493.camel@localhost> <20060227232831.GS5082@redhat.com> Message-ID: <1141084415.30345.41.camel@serpentine.pathscale.com> On Mon, 2006-02-27 at 18:28 -0500, Doug Ledford wrote: > My spec files for both libibverbs and opensm include the related utilities > and diags. My one suggestion is that if you bother to create spec files for > a 1.0 release, then please don't use /usr/local, use the proper locations > for files as though they were something other than 1 off local builds. Absolutely. > For > example all the scripts in the management tree use /usr/local as their > prefix, the configure program doesn't change them, so my rpm has a shell > environment file it drops in /etc/profile.d in order to get the scripts to > work without having to edit all the paths. Yes, that's ugly. References: <000001c63bf5$203c5910$0ff9070a@amr.corp.intel.com> Message-ID: <1141084497.30345.44.camel@serpentine.pathscale.com> On Mon, 2006-02-27 at 15:25 -0800, Bob Woodruff wrote: > If we do not distribute kernel code, then what kernel should we test against > ? > 2.6.16 ? or 2.6.17 ? I'm planning to build at least some kernel RPMs, most likely for FC4 and SUSE10. I'll have to rely on other people to provide packages for the kernels and distros they're most interested in. > It seems like people need binary kernel RPMs that install easily to > allow more people to test the code. Yes, and to be sure we're testing the same bits. References: <000001c63bf5$203c5910$0ff9070a@amr.corp.intel.com> <1141084497.30345.44.camel@serpentine.pathscale.com> Message-ID: <20060228000013.GT5082@redhat.com> On Mon, Feb 27, 2006 at 03:54:57PM -0800, Bryan O'Sullivan wrote: > On Mon, 2006-02-27 at 15:25 -0800, Bob Woodruff wrote: > > > If we do not distribute kernel code, then what kernel should we test against > > ? > > 2.6.16 ? or 2.6.17 ? > > I'm planning to build at least some kernel RPMs, most likely for FC4 and > SUSE10. I'll have to rely on other people to provide packages for the > kernels and distros they're most interested in. I'll be building RHEL4 RPMs. -- Doug Ledford 919-754-3700 x44233 Red Hat, Inc. 1801 Varsity Dr. Raleigh, NC 27606 From robert.j.woodruff at intel.com Mon Feb 27 16:02:01 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Mon, 27 Feb 2006 16:02:01 -0800 Subject: [openib-general] RFC: SDP plans In-Reply-To: <1141084497.30345.44.camel@serpentine.pathscale.com> Message-ID: <000201c63bfa$33fa6af0$0ff9070a@amr.corp.intel.com> Bryan wrote, >I'm planning to build at least some kernel RPMs, most likely for FC4 and >SUSE10. I'll have to rely on other people to provide packages for the >kernels and distros they're most interested in. Should we create an RPMs directory in the 1.0 branch where people can start to put the RPMs for the various kernels and usermode components ? When should we do a build for RC1 ? woody From dledford at redhat.com Mon Feb 27 16:06:53 2006 From: dledford at redhat.com (Doug Ledford) Date: Mon, 27 Feb 2006 19:06:53 -0500 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <000101c63bf7$fdbb28a0$0ff9070a@amr.corp.intel.com> References: <20060227232831.GS5082@redhat.com> <000101c63bf7$fdbb28a0$0ff9070a@amr.corp.intel.com> Message-ID: <20060228000653.GV5082@redhat.com> On Mon, Feb 27, 2006 at 03:46:12PM -0800, Bob Woodruff wrote: > Doug Ledford wrote, > > >My spec files for both libibverbs and opensm include the related utilities > >and diags. My one suggestion is that if you bother to create spec files > for > >a 1.0 release, then please don't use /usr/local, use the proper locations > >for files as though they were something other than 1 off local builds. For > >example all the scripts in the management tree use /usr/local as their > >prefix, the configure program doesn't change them, so my rpm has a shell > >environment file it drops in /etc/profile.d in order to get the scripts to > >work without having to edit all the paths. I'd prefer not to have to have > >that file in /etc/profile.d for an official 1.0 release ;-) > > Also, should the makefiles in SVN target the "proper locations" rather than > /usr/local ? Right now, my test all-on-one usermode RPM targets > istall stuff in /usr/local, which isn't really the proper place but it > is the default target for the makefiles in SVN which allows me to easily > download > and build a newer version of the code from SVN, since that is where all the > makefiles want to put stuff. > > If the release .spec files put stuff into places like /usr/lib or /usr/lib64 > and the makefiles from SVN by default put stuff into /usr/local, then if > someone > tries to get a newer version from SVN and build it on a platform > that has the release code (in the proper places) they will end up with a > mess and could have a mismatch of components depending on how they > set their path. My opinion on this is that once you make an official release with the files in non /usr/local locations, the development tree should be updated to use those locations as well to avoid the exact problem you cited. In general, my thought is that once you go from purely a developer quality product to a release product with ongoing development, then it's time to start using official locations instead of /usr/local. My $.02 -- Doug Ledford 919-754-3700 x44233 Red Hat, Inc. 1801 Varsity Dr. Raleigh, NC 27606 From robert.j.woodruff at intel.com Mon Feb 27 16:09:24 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Mon, 27 Feb 2006 16:09:24 -0800 Subject: [openib-general] RFC: SDP plans In-Reply-To: <20060228000013.GT5082@redhat.com> Message-ID: <000301c63bfb$3ba07780$0ff9070a@amr.corp.intel.com> Doug wrote, >I'll be building RHEL4 RPMs. Maybe I can build RPMs based on the latest kernel.org kernels, at least for x86_64, IPF, and i386. woody From robert.j.woodruff at intel.com Mon Feb 27 16:11:03 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Mon, 27 Feb 2006 16:11:03 -0800 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <20060228000653.GV5082@redhat.com> Message-ID: <000401c63bfb$76a85050$0ff9070a@amr.corp.intel.com> Doug wrote, >My opinion on this is that once you make an official release with the files >in non /usr/local locations, the development tree should be updated to use >those locations as well to avoid the exact problem you cited. In general, >my thought is that once you go from purely a developer quality product to a >release product with ongoing development, then it's time to start using >official locations instead of /usr/local. My $.02 Yep, makes sense to me. woody From sean.hefty at intel.com Mon Feb 27 16:36:07 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 27 Feb 2006 16:36:07 -0800 Subject: [openib-general] [PATCH] uMAD: fix copying MAD data when reporting a send failure Message-ID: Fix a bug where the size of the data copied to userspace is larger than the reported size. Re-use the failed send packet when reporting the failure, versus copying the send information to a new packet. Signed-off-by: Sean Hefty --- NOTE: I don't have a way to test that this fix actually works. Index: user_mad.c =================================================================== --- user_mad.c (revision 5514) +++ user_mad.c (working copy) @@ -248,28 +248,17 @@ static void send_handler(struct ib_mad_a struct ib_mad_send_wc *send_wc) { struct ib_umad_file *file = agent->context; - struct ib_umad_packet *timeout; struct ib_umad_packet *packet = send_wc->send_buf->context[0]; ib_destroy_ah(packet->msg->ah); ib_free_send_mad(packet->msg); if (send_wc->status == IB_WC_RESP_TIMEOUT_ERR) { - timeout = kzalloc(sizeof *timeout + sizeof(struct ib_mad), - GFP_KERNEL); - if (!timeout) - goto out; - INIT_LIST_HEAD(&timeout->seg_list); - timeout->length = IB_MGMT_MAD_HDR; - timeout->mad.hdr.id = packet->mad.hdr.id; - timeout->mad.hdr.status = ETIMEDOUT; - memcpy(timeout->mad.data, packet->mad.data, - sizeof (struct ib_mad_hdr)); - - if (queue_packet(file, agent, timeout)) - free_packet(timeout); + packet->length = IB_MGMT_MAD_HDR; + packet->mad.hdr.status = ETIMEDOUT; + if (!queue_packet(file, agent, packet)) + return; } -out: kfree(packet); } @@ -324,7 +313,7 @@ static ssize_t ib_umad_read(struct file struct ib_umad_file *file = filp->private_data; struct ib_rmpp_segment *seg; struct ib_umad_packet *packet; - ssize_t ret; + ssize_t ret, size; if (count < sizeof (struct ib_user_mad) + sizeof (struct ib_mad)) return -EINVAL; @@ -349,8 +338,9 @@ static ssize_t ib_umad_read(struct file spin_unlock_irq(&file->recv_lock); + size = min_t(int, sizeof (struct ib_mad), packet->length); if (copy_to_user(buf, &packet->mad, - sizeof(struct ib_user_mad) + sizeof(struct ib_mad))) { + sizeof(struct ib_user_mad) + size)) { ret = -EFAULT; goto err; } From mst at mellanox.co.il Mon Feb 27 16:38:33 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Feb 2006 02:38:33 +0200 Subject: [openib-general] Re: RFC: SDP plans In-Reply-To: <20060227212002.GA7597@lst.de> References: <20060227173459.GB19855@mellanox.co.il> <20060227212002.GA7597@lst.de> Message-ID: <20060228003833.GD20064@mellanox.co.il> Quoting r. Christoph Hellwig : > > - Use sysfs for statistics, entry per socket > > this one sounds a bit fishy. with lots of sockets you'll eat up far > too much memory I susect. But let's look at the code once it's there. Hmm. Should I stick to /proc then? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From rdreier at cisco.com Mon Feb 27 16:40:54 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 27 Feb 2006 16:40:54 -0800 Subject: [openib-general] Re: RFC: SDP plans In-Reply-To: <20060228003833.GD20064@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 28 Feb 2006 02:38:33 +0200") References: <20060227173459.GB19855@mellanox.co.il> <20060227212002.GA7597@lst.de> <20060228003833.GD20064@mellanox.co.il> Message-ID: > > > - Use sysfs for statistics, entry per socket > > this one sounds a bit fishy. with lots of sockets you'll eat up far > > too much memory I susect. But let's look at the code once it's there. > Hmm. Should I stick to /proc then? No, nothing like per-socket statistics in /proc is going to be merged upstream. I think the real question is whether per-socket are useful. If they are important, and sysfs doesn't work, then another possibility would be a netlink-based way of retrieving them. - R. From vuhuong at mellanox.com Mon Feb 27 16:42:07 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Mon, 27 Feb 2006 16:42:07 -0800 Subject: [openib-general][patch review] srp: fmr implementation, Message-ID: <44039C5F.9070503@mellanox.com> Another attempt to implement fmr for srp + moving dev_list, pd, dma mr, and fmr resource to srp_device per ib_device + implementing fmr - try to build a single fmr per scsi_cmd if fail then falling back to dma mr Signed-off-by: Vu Pham -------------- next part -------------- A non-text attachment was scrubbed... Name: srp-fmr.patch Type: text/x-patch Size: 10874 bytes Desc: not available URL: From robert.j.woodruff at intel.com Mon Feb 27 17:56:02 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Mon, 27 Feb 2006 17:56:02 -0800 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <25AE7F432672D511B8DC00B0D0DF11DA06666145@MTIEX01> Message-ID: <000501c63c0a$2139e660$0ff9070a@amr.corp.intel.com> Sujal wrote, >Will the release address following test/support scenarios pertaining to the >specific Open IB components being discussed? >- OS distro and kernel versions to be tested with - RHAs, SLES, SUSE Pro, Fedora >etc..., kernel 2.6.13, 14.... etc We will probably help test RHEL4 and perhaps the latest kernel.org kernel with uDAPL and Intel MPI. What are your plans for testing various kernels and such ? >- CPU and chipset platforms We will test on Intel chipsets and platforms ? Lindenhurst, Tiger, and some older i686 Xeon platforms. What platforms can you help with testing ? >- Variations on each HCA hardware. For example - with Mellanox, we have DDR, >SDR, Mem-free etc. We will test Mellanox SDR and DDR cards on x86_64, SDR on IPF, SDR on Xeon ia32 platforms. >- Testing with various switches available We will test Mellanox SDR switches. >- Testing for performance/features with ISV / user level apps apps - Fluent, >Oracle, DB2, LSDyna, MPI etc We will do some testing of real MPI applications using Intel MPI on uDAPL. >- Testing with available storage targets - both gateways and native, SRP and >iSER? I will do some limited testing of iSer. Not sure I have time to do SRP also, how about you ? can you help with testing SRP ? The more people that help, the better the release will be. woody From ianjiang.ict at gmail.com Mon Feb 27 18:38:22 2006 From: ianjiang.ict at gmail.com (Ian Jiang) Date: Tue, 28 Feb 2006 10:38:22 +0800 Subject: [openib-general] [CM]question about creating a Servcie Message-ID: <7b2fa1820602271838u32abe4b9g7732fa19d14705cc@mail.gmail.com> Hi all, I am trying to use the CM mechanism in the kernal space and come across a problem while creating a service: [IB_CM][ib_cm_service_create][drivers/infiniband/ib_cm/cm_service_table.c:198]Conflict between 000000000000b0e0/ffffffffffffffff and 0000000000000000/ffffffffffff0000 I am not very clear of the rules while using the Service ID and the Service Mask, could anybody give an explaination? Thanks very much and here are a piece of my code and the relative code in IBGD-1.8.0: ================================================================ #define TS_KVAPITEST_MSG_SERVICE_ID_VALUE (0xb0e0ULL) #define TS_KVAPITEST_MSG_SERVICE_ID_MASK (0xFFFFFFFFFFFF0000ULL) res = ib_cm_listen(TS_KVAPITEST_MSG_SERVICE_ID_VALUE, TS_KVAPITEST_MSG_SERVICE_ID_MASK, params_p->ib_res.call_back_func, params_p->ib_res.arg, ¶ms_p->ib_res.listen_hndl); ================================================================ int ib_cm_service_create(tTS_IB_SERVICE_ID service_id, tTS_IB_SERVICE_ID service_mask, struct ib_cm_service **service) { struct ib_cm_tree_node *node; struct ib_cm_tree_node *new_node = NULL; struct ib_cm_tree_node *new_parent = NULL; u64 bit, mask; int child_num; int ret; TS_WINDOWS_SPINLOCK_FLAGS *service = NULL; new_node = kmem_cache_alloc(node_cache, GFP_KERNEL); new_parent = kmem_cache_alloc(node_cache, GFP_KERNEL); *service = kmem_cache_alloc(service_cache, GFP_KERNEL); if (!new_node || !new_parent || !*service) { ret = -ENOMEM; goto out_free; } new_node->id = service_id; new_node->mask = service_mask; new_node->bit = 0; new_node->ptr.service = *service; spin_lock(&tree_lock); node = ib_cm_tree_search(service_id, service_mask); if (node) { if ((service_mask & node->id) == (node->mask & service_id)) { TS_REPORT_WARN(MOD_IB_CM, "Conflict between %016" TS_U64_FMT "x/%016" TS_U64_FMT "x " "and %016" TS_U64_FMT "x/%016" TS_U64_FMT "x", node->id, node->mask, service_id, service_mask); ret = -EADDRINUSE; goto out; } /* find the first bit where we're different -- if there isn't one, then we conflict with the previous service */ for (mask = 0x0ULL, bit = 0x8000000000000000ULL; bit; mask |= bit, bit >>= 1) { if ((bit & service_id) != (bit & node->id)) break; } if (!bit || mask >= node->mask) { TS_REPORT_WARN(MOD_IB_CM, "Couldn't find a difference: %016" TS_U64_FMT "x/%016" TS_U64_FMT "x " "and %016" TS_U64_FMT "x/%016" TS_U64_FMT "x", node->id, node->mask, service_id, service_mask); ret = -EINVAL; goto out; } new_parent->id = service_id & mask; new_parent->mask = mask; new_parent->bit = bit; new_parent->parent = node->parent; new_parent->child_num = node->child_num; if (node->parent) { node->parent->ptr.child[node->child_num] = new_parent; } else { service_tree = new_parent; } child_num = !!(bit & service_id); new_parent->ptr.child[ child_num] = new_node; new_node->parent = new_parent; new_node->child_num = child_num; new_parent->ptr.child[!child_num] = node; node->parent = new_parent; node->child_num = !child_num; } else { if (service_tree) { TS_REPORT_WARN(MOD_IB_CM, "No parent found but tree not empty!"); ret = -EINVAL; goto out; } /* Don't need the new parent node for the first service we add */ kmem_cache_free(node_cache, new_parent); service_tree = new_node; new_node->parent = NULL; } (*service)->node = new_node; (*service)->freeing = 0; init_MUTEX_LOCKED(&(*service)->mutex); atomic_set(&(*service)->waiters, 0); spin_unlock(&tree_lock); return 0; out: spin_unlock(&tree_lock); out_free: if (new_node) kmem_cache_free(node_cache, new_node); if (new_parent) kmem_cache_free(node_cache, new_parent); if (*service) kmem_cache_free(service_cache, *service); return ret; } -- Ian Jiang ianjiang.ict at gmail.com Laboratory of Spatial Information Technology Division of System Architecture Institute of Computing Technology Chinese Academy of Sciences -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Mon Feb 27 19:25:42 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 27 Feb 2006 19:25:42 -0800 Subject: [openib-general] [CM]question about creating a Servcie In-Reply-To: <7b2fa1820602271838u32abe4b9g7732fa19d14705cc@mail.gmail.com> (Ian Jiang's message of "Tue, 28 Feb 2006 10:38:22 +0800") References: <7b2fa1820602271838u32abe4b9g7732fa19d14705cc@mail.gmail.com> Message-ID: Ian> Hi all, I am trying to use the CM mechanism in the kernal Ian> space and come across a problem while creating a service: Ian> [IB_CM][ib_cm_service_create][drivers/infiniband/ib_cm/cm_service_table.c:198]Conflict Ian> between 000000000000b0e0/ffffffffffffffff and Ian> 0000000000000000/ffffffffffff0000 You've chosen a service ID in the range used by SDP (SDP claims 0000...ffff, and you picked b0e0). Try something like 1b0e0 instead. - R. From mlleinin at hpcn.ca.sandia.gov Mon Feb 27 19:58:58 2006 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Mon, 27 Feb 2006 19:58:58 -0800 Subject: [openib-general] RFC: SDP plans In-Reply-To: <44038AE0.9040208@ichips.intel.com> References: <20060227173459.GB19855@mellanox.co.il> <19a929370602270944v30f79516id428b0913bd04a08@mail.gmail.com> <44034423.8090109@ichips.intel.com> <1141080131.4335.24.camel@hal.voltaire.com> <1141082395.30345.37.camel@serpentine.pathscale.com> <44038AE0.9040208@ichips.intel.com> Message-ID: <1141099138.6119.593.camel@localhost> On Mon, 2006-02-27 at 15:27 -0800, Sean Hefty wrote: > Bryan O'Sullivan wrote: > > So. We can move the 1.0 release date to "whenever SDP gets into an > > upstream kernel" without knowing when that might be, or we can do the > > best we can with what we have now. > > > > My preference is strongly for the latter. > > SDP should not be in a release until it is release quality, and I would based > that on an upstream submission. > > What's wrong with shipping release 1.0 without SDP, then shipping an updated > release (1.1) once SDP is ready? SDP doesn't get there faster by delaying the > 1.0 release. Sounds reasonable to me. Put whatever components are ready into the 1.0 release. Other components can come in later. Don't try and solve every problem now, just work towards a solid baseline for code releases. - Matt From bos at pathscale.com Mon Feb 27 20:13:21 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Mon, 27 Feb 2006 20:13:21 -0800 Subject: [openib-general] Re: RFC: SDP plans In-Reply-To: References: <20060227173459.GB19855@mellanox.co.il> <20060227212002.GA7597@lst.de> <20060228003833.GD20064@mellanox.co.il> Message-ID: <1141100001.6310.14.camel@camp4.serpentine.com> On Mon, 2006-02-27 at 16:40 -0800, Roland Dreier wrote: > If they are important, and sysfs doesn't work, then another > possibility would be a netlink-based way of retrieving them. If you have to use netlink, consider using the connector interface (new since 2.6.15) instead of vanilla netlink. The normal netlink interface is just horrible. (Michael S. Tsirkin's message of "Mon, 27 Feb 2006 10:59:24 +0200") References: <20060222104037.GB21077@mellanox.co.il> <20060226235750.GA22412@mellanox.co.il> <20060227085924.GN19855@mellanox.co.il> Message-ID: Michael> I guess there's a misunderstanding here. Its pretty Michael> simple: ipoib_mcast_send tests mcast->ah twice under Michael> priv->lock. ipoib_mcast_join_finish modifies the Michael> mcast->ah without taking a lock. No *other* place Michael> modifies the mcast->ah. Michael> As a solution, take priv->lock around assignment to Michael> mcast->ah thus making sure ipoib_mcast_send is not in Michael> flight. Ah, I got it now. But since ipoib_mcast_join_finish is always called with interrupts enabled, we can use the following slightly simpler version. I applied this and queued it for 2.6.17 (I don't think this is critical enough for 2.6.16 at this stage). - R. --- infiniband/ulp/ipoib/ipoib_multicast.c (revision 5514) +++ infiniband/ulp/ipoib/ipoib_multicast.c (working copy) @@ -213,6 +213,7 @@ static int ipoib_mcast_join_finish(struc { struct net_device *dev = mcast->dev; struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_ah *ah; int ret; mcast->mcmember = *mcmember; @@ -269,8 +270,8 @@ static int ipoib_mcast_join_finish(struc av.static_rate, priv->local_rate, ib_sa_rate_enum_to_int(mcast->mcmember.rate)); - mcast->ah = ipoib_create_ah(dev, priv->pd, &av); - if (!mcast->ah) { + ah = ipoib_create_ah(dev, priv->pd, &av); + if (!ah) { ipoib_warn(priv, "ib_address_create failed\n"); } else { ipoib_dbg_mcast(priv, "MGID " IPOIB_GID_FMT @@ -280,6 +281,10 @@ static int ipoib_mcast_join_finish(struc be16_to_cpu(mcast->mcmember.mlid), mcast->mcmember.sl); } + + spin_lock_irq(&priv->lock); + mcast->ah = ah; + spin_unlock_irq(&priv->lock); } /* actually send any queued packets */ From rdreier at cisco.com Mon Feb 27 20:54:45 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 27 Feb 2006 20:54:45 -0800 Subject: [openib-general] Re: [PATCH] uMAD: fix copying MAD data when reporting a send failure In-Reply-To: (Sean Hefty's message of "Mon, 27 Feb 2006 16:36:07 -0800") References: Message-ID: Sean> Fix a bug where the size of the data copied to userspace is Sean> larger than the reported size. Re-use the failed send Sean> packet when reporting the failure, versus copying the send Sean> information to a new packet. Looks fine to me. I checked it into svn. It seems like the upstream kernel doesn't have this bug, right? - R. From rdreier at cisco.com Mon Feb 27 20:56:26 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 27 Feb 2006 20:56:26 -0800 Subject: [openib-general] [patch] tvflash configure checks for libpci In-Reply-To: <20060227154027.GB7849@aon.at> (Bernhard Fischer's message of "Mon, 27 Feb 2006 16:40:27 +0100") References: <20060227154027.GB7849@aon.at> Message-ID: I haven't really been maintaining tvflash and I suggest mstflint for FW burning now, but it can't hurt to fix things like this, so I applied the patch. Thanks, Roland From rdreier at cisco.com Mon Feb 27 20:58:58 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 27 Feb 2006 20:58:58 -0800 Subject: [openib-general] Re: [PATCH] libmthca: fence support In-Reply-To: <20060227141226.GV19855@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 27 Feb 2006 16:12:26 +0200") References: <20060227141226.GV19855@mellanox.co.il> Message-ID: Thanks, applied. From rdreier at cisco.com Mon Feb 27 21:02:10 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 27 Feb 2006 21:02:10 -0800 Subject: [openib-general] Re: [PATCH] mthca: fence support In-Reply-To: <20060227141048.GU19855@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 27 Feb 2006 16:10:48 +0200") References: <20060227141048.GU19855@mellanox.co.il> Message-ID: Thanks, applied and queued for 2.6.17. From sean.hefty at intel.com Mon Feb 27 21:21:38 2006 From: sean.hefty at intel.com (Hefty, Sean) Date: Mon, 27 Feb 2006 21:21:38 -0800 Subject: [openib-general] RE: [PATCH] uMAD: fix copying MAD data when reporting a send failure Message-ID: > Sean> Fix a bug where the size of the data copied to userspace is > Sean> larger than the reported size. Re-use the failed send > Sean> packet when reporting the failure, versus copying the send > Sean> information to a new packet. > >Looks fine to me. I checked it into svn. > >It seems like the upstream kernel doesn't have this bug, right? Correct - I checked what was upstream, and didn't see that it had this issue. - Sean From mst at mellanox.co.il Mon Feb 27 22:37:50 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Feb 2006 08:37:50 +0200 Subject: [openib-general] Re: ipoib_multicast_ah.patch In-Reply-To: References: <20060222104037.GB21077@mellanox.co.il> <20060226235750.GA22412@mellanox.co.il> <20060227085924.GN19855@mellanox.co.il> Message-ID: <20060228063750.GC17421@mellanox.co.il> Quoting r. Roland Dreier : > I applied this and queued it for 2.6.17 (I don't think this is > critical enough for 2.6.16 at this stage). Right. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From devesh28 at gmail.com Mon Feb 27 22:44:16 2006 From: devesh28 at gmail.com (Devesh Sharma) Date: Tue, 28 Feb 2006 12:14:16 +0530 Subject: [openib-general] LID assignment policy of opensm Message-ID: <309a667c0602272244re7134ees6031bc11883372ea@mail.gmail.com> Hi list, Please anybody brife me about the LID assignment policy used by opensm subnet manager. Can user specify fixed LID mappings using a file? -------------- next part -------------- An HTML attachment was scrubbed... URL: From bboas at systemfabricworks.com Mon Feb 27 22:49:05 2006 From: bboas at systemfabricworks.com (Bill Boas) Date: Mon, 27 Feb 2006 22:49:05 -0800 Subject: [openib-general] Need for ONE OpenIB Release process that all members can agree to and that follows OpenIB Bylaws Message-ID: There appear to be 2 groups within OpenIB thinking about different approaches to preparing the code for Release 1.0. One group is thinking about downstreaming it to RedHat and Novell, another group seems to be thinking about separate releases from some IB suppliers than others. Lets remind ourselves of the purposes OpenIB was created and what all of the member companies have just re-affirmed in the Board meeting last Friday (by approving the re-worked By-laws). The principles are, I believe: (if there are misstatements below, lets discus openly) 1) OpenIB develops open source code creating a software stack. OpenIB (now OpenFabrics Alliance) is a corporation with Bylaws that all members should obey if they want the corporation to continue to function. It will only survive if, in general, all members self interests are served simultaneously with each member's own self interest. 2) OpenIB members by a 2/3rds vote of the members have to approve the content of that stack through the Proposal process described section 12 of the Bylaws. It is not up to a single member or group of members to decide on their own what is or is not in the OpenIB stack. This is deliberate to prevent one or more members gaining competitive advantage through the OpenIB stack over other members. 3) OpenIB downstreams kernel code to kernel.org 4) OpenIB code is distributed to end customers (like Wall St., labs, etc) and to mid tier customers of OpenIB (Oracle, IBM, Sun, Dell, LNXI etc.) via Linux distributions such as RedHat and Novell. 5) End customers told the IB companies in February 2004 and in December 2005 at Credit Suisse (HSIR meeting) that they wanted ONE OpenIB stack that runs on every IB vendors hardware, that interoperates with all other IB vendors h/w and s/w, is used by all mid-tier suppliers and that it comes with their Linux distribution. I realize that so far in OpenIB's evolution we have not worked out the issue of how to support end-customers while following these principles for the release process. But that, I suggest, is not a valid reason for breaking these principles. We should be able to deal with "Release" as one process and "Support" as another process - though of course there will be linkage between them but they are not the same process. The way do ne is not necessarily the way to do the other. This email is an appeal to the two groups to work together, not to work separately, and to work on solving these issues for the membership as a whole, not just their own company, or a select group. Please bring to the Board a proposal that serves all the membership. Here's what one group seems to be thinking (edited to remove "I"): "Here is a first cut at the set of components (protocols, drivers, userspace bits) that we think we should be supporting in 1.0. Please look over it and let us know if we are missing anything. HCA support (both kernel driver and userspace verbs components): * ehca * ipath * mthca IB protocols: * IPoIB * RC * SDP * SRQ * UC * UD Userland software: * libibverbs * libsdp * opensm As far as we can tell, most of the rest of OpenIB userland (libibcm, libibat, libibmad, etc) is logically part of OpenSM, can be treated as such (I think Doug is already doing this with his Red Hat spec files) and is unlikely to be used by other applications. Am I way off? Components that we don't know what to do about, and will likely want to drop unless someone can vouch for them: * iSER * SRP * uDAPL" Here's what the other group suggested: "Openib Commercial Grade Release 1.0 release criteria 1) CPU Architectures: a) x86_32 (Xeon) b) x84_64 (Nocona, Opteron) c) ia64 d) PPC64 (Power5, Power6) - Mellanox does not support these systems 2) Linux distributors and kernels a) RH: AS EL4 up3; Fedora C4 last update , and maybe FC5 b) SuSE: SuSE 10 last update (open - SLES10 beta) c) kernel.org: the latest that is available when generating rc1. In 1.0 it will probably be 2.6.16 (might be 2.6.17). 3) Packaging and installation a) The openib release will be packages in one tarball for both kernel and user-level. b) One install script will support full installation. The install will support typical and custom components I will send a different document with install definition to be reviewed and agreed between all. 4) HCA and Switch Support: a) HCAs: InfiniHost, InfiniHost III Ex (both modes: with memory and MemFree), InfiniHost III Lx b) Switches: Need to support all vendors' production switches - each vendor should send the list. 5) Switch Management Interoperability testing a) Follow the CIWG-OpenIB HCA-OEM Switch Interop Test Plan 6) Feature set per ULP: a) Will be defined later with each ULP maintainer. 7) Minimum cluster size to be tested a) Need at least 128 nodes cluster, bigger is better. 8) Scalability requirements a) SM: i) Bringup a subnet with 1,000 nodes in 2 minutes ii) SM should not be a bottle neck in any application running (IPoIB) b) MPI: i) MPI runner - should be able to launch thousands of processes (say 50,000) in a bounded time manner. ii) Memory consumption - should be able to run many processes on the same node (for now, 8 processes is the upper limit with the Opteron machines), in a many node (thousands of nodes) installation. iii) Sending HUGE messages in collectives - MPI should not fail for limited physical memory. 9) Performance requirements: First we need to agree on the performance benchmark for each ULP: a) Basic verbs - performance tests in openib (send, RDMA read/write latency & BW) b) IPoIB - netperf c) MPI - Pallas d) SDP - iperf e) SRP - iometer f) iSER - iometer 10) Documentation requirements a) Product brief b) Installation guide c) User guide d) Release notes e) Troubleshooting f) Test Plan and Test Report 11) Storage target test requirements a) Engenio target - Mellanox will be responsible of verification b) Cisco & SST - please add more target systems 12) Firmware and Hardware versions to be tested a) Both DDR and SDR modes should be supported. b) FW burned should be the last official released by Mellanox: i) InfiniHost III Lx: fw-25204-1.0.800 ii) InfiniHost III Ex: fw-25218-5.1.400 and fw-25208-4.7.600 (both will be released in 2 weeks) iii) InfiniHost: fw-23108-3.4.000 iv) InfiniScale III - fw-47396-0.8.3 v) InfiniScale - fw-43132-5.5.0 13) Specifications compliance: a) Verbs & management: InfiniBand Architecture Specification, Volume 1, Release 1.2 b) IPoIB: www.ietf.org: draft-ietf-ipoib-architecture-04 and draft-ietf-ipoib-ip-over-infiniband-07 c) SDP: Annex A4" of the InfiniBand Architecture Specification, Volume 1, Release 1.2 d) SRP: SCSI RDMA Protocol-2 (SRP-2), Doc. no. T10/1524-D. (www.t10.org/ftp/t10/drafts /srp2/srp2r00a.pdf). e) MPI: www.mpi-forum.org/docs/mpi-11 -html/mpi-report.html f) iSER: www.ietf.org/internet-drafts/draft-hufferd-iser-ib-01.pdf g) RDS: SS can you provide info The following two items are very important for the SW stack QA but not gating for starting the release process. 1) ISV test requirements - coverage for all ULPs 2) Database test requirements Cisco, SS and Voltaire should define those, since they already have test beds for commercial applications and databases." -------------- next part -------------- An HTML attachment was scrubbed... URL: From iod00d at hp.com Mon Feb 27 23:12:40 2006 From: iod00d at hp.com (Grant Grundler) Date: Mon, 27 Feb 2006 23:12:40 -0800 Subject: [openib-general] Performance optimization In-Reply-To: <4403B542.8060401@uci.edu> References: <43FA29A1.5020301@uci.edu> <20060227174237.GH31654@esmail.cup.hp.com> <4403B542.8060401@uci.edu> Message-ID: <20060228071240.GH880@esmail.cup.hp.com> On Mon, Feb 27, 2006 at 06:28:18PM -0800, Frithjof Kruggel wrote: > > Hi, > > thank you for your advice. However, using > > taskset 0x1 ibv_uc_pingpong xxxxx Is this test a bandwidth test or a latency test? Anything that says "pingpong" suggests latency to me. > did not report rates beyond 1850 MBit/s. > If I raise the size (--size 65536), rates are about > 3500 MBit/s. There is no significant difference if > I try this within the same node or across node, > with or without a switch. Well, that suggests the system is CPU bound or bottlenecking on control traffic somehow. Fairly normal for a latency test. What is the PCI-X bus frequency? (Check HW Owners Guide or system data sheet) 3500 Mbit/s would be about right for a 66Mhz/64-bit bus. Is MSI or MSI-X enabled? Can you remind me which model of box/motherboard is being used? (ISTR Opteron and the issues older Operton can't support MSI/MSI-X. That will cause substantial performance penalty on smaller transactions that involve lots of interrupts). > I would like to know whether there is any step-by-step > list (even checking BIOS parameters?) in order to > understand where something might go wrong. I'm not aware of one since so many different things impact performance. Everything from DIMM types, PCI bus frequencies, CPU cache size, chipset features (e.g. write coalescing or MSI support), to how workloads have different cache "footprint". sorry that it's not so simple. grant From mst at mellanox.co.il Mon Feb 27 23:32:09 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Feb 2006 09:32:09 +0200 Subject: [openib-general] Re: RFC: SDP plans In-Reply-To: <1141077981.30345.3.camel@serpentine.pathscale.com> References: <20060227173459.GB19855@mellanox.co.il> <19a929370602270944v30f79516id428b0913bd04a08@mail.gmail.com> <44034423.8090109@ichips.intel.com> <20060227184212.GD19265@mellanox.co.il> <440349FD.1050103@ichips.intel.com> <20060227193311.GA20064@mellanox.co.il> <1141077981.30345.3.camel@serpentine.pathscale.com> Message-ID: <20060228073209.GC17543@mellanox.co.il> Quoting r. Bryan O'Sullivan : > Subject: Re: RFC: SDP plans > > On Mon, 2006-02-27 at 21:33 +0200, Michael S. Tsirkin wrote: > > > So its possible to make the protocol family configurable in userspace sdp > > library (libsdp) and then include libsdp in Release 1.0 and it will work with > > whatever SDP code makes it upstream. > > The one thing that worries me about this is that it appears that libsdp > silently falls back to using TCP at the moment, at least based on some > internal tests we did with the environment slightly and inadvertantly > misconfigured. > > If this is the case, then having the protocol family potentially change > out under SDP could make it occur that a kernel update (or a switch to a > kernel that doesn't have SDP available) would cause apps to silently > start using IPoIB instead of SDP. Which is the sort of behaviour that > leads to rapid hair loss on the part of both users and support people. > > If libsdp were to abort by default if SDP wasn't available, then I think > we'd have a fairly strong case that we're not creating a nightmare for > ourselves. OK, makes sense. But it should already work this way: we now have in libsdp.conf match_both listen *:12865 # match listen *:21 match_both listen *:22 match program * And match program * should cause everything to go over SDP by default. So you either have a modified libsdp.conf (could you post it please?) or there's a bug in libsdp. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From eli at mellanox.co.il Tue Feb 28 00:08:35 2006 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 28 Feb 2006 10:08:35 +0200 Subject: [openib-general] [PATCH] mthca - optimize sinai large message Message-ID: <1141114115.7277.1.camel@mtls03.yok.mtl.com> Memory key generation modified to optimize large messages transfer in Sinai. This implementation restricts the MPT table size for Sinai to a maximum of 2^23 entries. Signed-off-by: Eli Cohen Index: linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_profile.c =================================================================== --- linux-2.6.14.2.orig/drivers/infiniband/hw/mthca/mthca_profile.c +++ linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_profile.c @@ -153,8 +153,8 @@ u64 mthca_make_profile(struct mthca_dev "won't in 0x%llx bytes of context memory.\n", (unsigned long long) total_size, (unsigned long long) mem_avail); - kfree(profile); - return -ENOMEM; + total_size = -ENOMEM; + goto exit; } if (profile[i].size) @@ -260,6 +260,13 @@ u64 mthca_make_profile(struct mthca_dev */ dev->limits.num_pds = MTHCA_NUM_PDS; + /* for Sinai MPT table must be smaller the 2^24 for optimized oprtatipn */ + if ((dev->mthca_flags & MTHCA_FLAG_SINAI_OPT) && init_hca->log_mpt_sz > 23) { + total_size = -ENOSYS; + mthca_err(dev, "MPT table too large\n"); + goto exit; + } + /* * For Tavor, FMRs use ioremapped PCI memory. For 32 bit * systems it may use too much vmalloc space to map all MTT @@ -272,6 +279,7 @@ u64 mthca_make_profile(struct mthca_dev else dev->limits.fmr_reserved_mtts = request->fmr_reserved_mtts; +exit: kfree(profile); return total_size; } Index: linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_main.c =================================================================== --- linux-2.6.14.2.orig/drivers/infiniband/hw/mthca/mthca_main.c +++ linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_main.c @@ -937,11 +937,12 @@ static struct { u64 latest_fw; int is_memfree; int is_pcie; + int mkey_opt; } mthca_hca_table[] = { - [TAVOR] = { .latest_fw = MTHCA_FW_VER(3, 3, 3), .is_memfree = 0, .is_pcie = 0 }, - [ARBEL_COMPAT] = { .latest_fw = MTHCA_FW_VER(4, 7, 0), .is_memfree = 0, .is_pcie = 1 }, - [ARBEL_NATIVE] = { .latest_fw = MTHCA_FW_VER(5, 1, 0), .is_memfree = 1, .is_pcie = 1 }, - [SINAI] = { .latest_fw = MTHCA_FW_VER(1, 0, 1), .is_memfree = 1, .is_pcie = 1 } + [TAVOR] = { .latest_fw = MTHCA_FW_VER(3, 3, 3), .is_memfree = 0, .is_pcie = 0, .mkey_opt = 0 }, + [ARBEL_COMPAT] = { .latest_fw = MTHCA_FW_VER(4, 7, 0), .is_memfree = 0, .is_pcie = 1, .mkey_opt = 0 }, + [ARBEL_NATIVE] = { .latest_fw = MTHCA_FW_VER(5, 1, 0), .is_memfree = 1, .is_pcie = 1, .mkey_opt = 0 }, + [SINAI] = { .latest_fw = MTHCA_FW_VER(1, 0, 1), .is_memfree = 1, .is_pcie = 1, .mkey_opt = 1 } }; static int __devinit mthca_init_one(struct pci_dev *pdev, @@ -1037,6 +1038,9 @@ static int __devinit mthca_init_one(stru mdev->mthca_flags |= MTHCA_FLAG_MEMFREE; if (mthca_hca_table[id->driver_data].is_pcie) mdev->mthca_flags |= MTHCA_FLAG_PCIE; + if (mthca_hca_table[id->driver_data].mkey_opt) + mdev->mthca_flags |= MTHCA_FLAG_SINAI_OPT; + /* * Now reset the HCA before we touch the PCI capabilities or Index: linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_dev.h =================================================================== --- linux-2.6.14.2.orig/drivers/infiniband/hw/mthca/mthca_dev.h +++ linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_dev.h @@ -71,7 +71,8 @@ enum { MTHCA_FLAG_NO_LAM = 1 << 5, MTHCA_FLAG_FMR = 1 << 6, MTHCA_FLAG_MEMFREE = 1 << 7, - MTHCA_FLAG_PCIE = 1 << 8 + MTHCA_FLAG_PCIE = 1 << 8, + MTHCA_FLAG_SINAI_OPT = 1 << 9 }; enum { Index: linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_mr.c =================================================================== --- linux-2.6.14.2.orig/drivers/infiniband/hw/mthca/mthca_mr.c +++ linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_mr.c @@ -76,6 +76,8 @@ struct mthca_mpt_entry { #define MTHCA_MPT_STATUS_SW 0xF0 #define MTHCA_MPT_STATUS_HW 0x00 +#define SINAI_FMR_KEY_INC 0x1000000 + /* * Buddy allocator for MTT segments (currently not very efficient * since it doesn't keep a free list and just searches linearly @@ -330,6 +332,14 @@ static inline u32 key_to_hw_index(struct return tavor_key_to_hw_index(key); } +static inline u32 adjust_key(struct mthca_dev *dev, u32 key) +{ + if (dev->mthca_flags & MTHCA_FLAG_SINAI_OPT) + return ((key << 20) & 0x800000) | (key & 0x7fffff); + else + return key; +} + int mthca_mr_alloc(struct mthca_dev *dev, u32 pd, int buffer_size_shift, u64 iova, u64 total_size, u32 access, struct mthca_mr *mr) { @@ -345,6 +355,7 @@ int mthca_mr_alloc(struct mthca_dev *dev key = mthca_alloc(&dev->mr_table.mpt_alloc); if (key == -1) return -ENOMEM; + key = adjust_key(dev, key); mr->ibmr.rkey = mr->ibmr.lkey = hw_index_to_key(dev, key); if (mthca_is_memfree(dev)) { @@ -504,6 +515,7 @@ int mthca_fmr_alloc(struct mthca_dev *de key = mthca_alloc(&dev->mr_table.mpt_alloc); if (key == -1) return -ENOMEM; + key = adjust_key(dev, key); idx = key & (dev->limits.num_mpts - 1); mr->ibmr.rkey = mr->ibmr.lkey = hw_index_to_key(dev, key); @@ -687,7 +699,10 @@ int mthca_arbel_map_phys_fmr(struct ib_f ++fmr->maps; key = arbel_key_to_hw_index(fmr->ibmr.lkey); - key += dev->limits.num_mpts; + if (dev->mthca_flags & MTHCA_FLAG_SINAI_OPT) + key += SINAI_FMR_KEY_INC; + else + key += dev->limits.num_mpts; fmr->ibmr.lkey = fmr->ibmr.rkey = arbel_hw_index_to_key(key); *(u8 *) fmr->mem.arbel.mpt = MTHCA_MPT_STATUS_SW; Index: linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_cmd.c =================================================================== --- linux-2.6.14.2.orig/drivers/infiniband/hw/mthca/mthca_cmd.c +++ linux-2.6.14.2/drivers/infiniband/hw/mthca/mthca_cmd.c @@ -1173,7 +1173,8 @@ int mthca_INIT_HCA(struct mthca_dev *dev int err; #define INIT_HCA_IN_SIZE 0x200 -#define INIT_HCA_FLAGS_OFFSET 0x014 +#define INIT_HCA_FLAGS1_OFFSET 0x00c +#define INIT_HCA_FLAGS2_OFFSET 0x014 #define INIT_HCA_QPC_OFFSET 0x020 #define INIT_HCA_QPC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x10) #define INIT_HCA_LOG_QP_OFFSET (INIT_HCA_QPC_OFFSET + 0x17) @@ -1216,15 +1217,18 @@ int mthca_INIT_HCA(struct mthca_dev *dev memset(inbox, 0, INIT_HCA_IN_SIZE); + if (dev->mthca_flags & MTHCA_FLAG_SINAI_OPT) + MTHCA_PUT(inbox, 0x1, INIT_HCA_FLAGS1_OFFSET); + #if defined(__LITTLE_ENDIAN) - *(inbox + INIT_HCA_FLAGS_OFFSET / 4) &= ~cpu_to_be32(1 << 1); + *(inbox + INIT_HCA_FLAGS2_OFFSET / 4) &= ~cpu_to_be32(1 << 1); #elif defined(__BIG_ENDIAN) - *(inbox + INIT_HCA_FLAGS_OFFSET / 4) |= cpu_to_be32(1 << 1); + *(inbox + INIT_HCA_FLAGS2_OFFSET / 4) |= cpu_to_be32(1 << 1); #else #error Host endianness not defined #endif /* Check port for UD address vector: */ - *(inbox + INIT_HCA_FLAGS_OFFSET / 4) |= cpu_to_be32(1); + *(inbox + INIT_HCA_FLAGS2_OFFSET / 4) |= cpu_to_be32(1); /* We leave wqe_quota, responder_exu, etc as 0 (default) */ From ianjiang.ict at gmail.com Tue Feb 28 00:17:19 2006 From: ianjiang.ict at gmail.com (Ian Jiang) Date: Tue, 28 Feb 2006 16:17:19 +0800 Subject: [openib-general] [CM]problem while connecting Message-ID: <7b2fa1820602280017j6e98e942mad34b4dba125de06@mail.gmail.com> Hi Roland, Thanks for your reply! The previous problem disappeared after I modified my code as you told. But another one while connecting hold me up: Feb 28 23:05:02 linux3 kernel: VIPKL(1): var/tmp/IBGD/tmp/openib/infiniband/ib_verbs/hw/mellanox-hca/vip/qpm.c[1108]: VAPI_RESET=>VAPI_RESET, must_mask: 0x0, allowed_mask: 0x1 Feb 28 23:05:02 linux3 kernel: VIPKL(1): var/tmp/IBGD/tmp/openib/infiniband/ib_verbs/hw/mellanox-hca/vip/qpm.c[1111]: The following bits are not supported or allowed in this transition: 0x30 Feb 28 23:05:02 linux3 kernel: Feb 28 23:05:02 linux3 kernel: [KERNEL_IB][tsIbTavorQpModify][/var/tmp/IBGD/tmp/openib/infiniband/ib_verbs/hw/provider/tavor_qp.c:457]InfiniHost_III_Lx0: VAPI_modify_qp failed, return code = -233 (Unsupported attribute) Feb 28 23:05:02 linux3 kernel: [IB_CM][ib_cm_connect][/var/tmp/IBGD/tmp/openib/infiniband/ib_cm/cm_api.c:168]ib_qp_modify to INIT failed Was this caused by my improper QP creation? Any suggestion is appreciated. Here are the relative codes of my own: create QP ======= qp_param.limit.max_outstanding_send_request = params_p->ib_res.cm_send_cq_size; qp_param.limit.max_outstanding_receive_request = params_p->ib_res.cm_recv_cq_size; qp_param.limit.max_send_gather_element = 1; qp_param.limit.max_receive_scatter_element = 1; qp_param.pd = params_p->ib_res.cm_pd_p; qp_param.send_queue = params_p->ib_res.cm_send_cq_p; qp_param.receive_queue = params_p->ib_res.cm_recv_cq_p; qp_param.send_policy = IB_WQ_SIGNAL_SELECTABLE; qp_param.receive_policy = IB_WQ_SIGNAL_ALL; qp_param.transport = IB_TRANSPORT_RC; qp_param.device_specific = NULL; res = ib_qp_create(&qp_param, ¶ms_p->ib_res.cm_qp_p, ¶ms_p->ib_res.cm_qpn); if (res) { PRINT_ERR("ib_qp_create failed\n"); cm_ib_clean(¶ms_p->ib_res); return -1; } PRINT_TRACE("QP created: 0x%p, Num: 0x%x\n", params_p->ib_res.cm_qp_p, params_p->ib_res.cm_qpn); connect ====== active_param.qp = params_p->ib_res.cm_qp_p; active_param.req_private_data = msg_p; active_param.req_private_data_len = msg_len; active_param.responder_resources = 4; active_param.initiator_depth = 4; active_param.retry_count = 7; active_param.rnr_retry_count = 7; active_param.cm_response_timeout = 20; /* 4 seconds */ active_param.max_cm_retries = 15; active_param.flow_control = 1; params_p->ib_res.cm_path.mtu = (params_p->ib_res.cm_hca_infor.device_id == 23108) ? MTU1024 : MTU2048; memcpy(¶ms_p->ib_res.cm_path.sgid, params_p->ib_res.cm_port_gid, sizeof(tTS_IB_GID)); memcpy(¶ms_p->ib_res.cm_path.dgid, params_p->ib_res.cm_port_gid, sizeof(tTS_IB_GID)); params_p->ib_res.cm_path.packet_life = 13; /* FIXME */ PRINT_TRACE("connect...\n"); res = ib_cm_connect(&active_param, ¶ms_p->ib_res.cm_path, NULL, params_p->ib_res.dst_service_id, 0, params_p->ib_res.call_back_func, (void *)params_p, ¶ms_p->ib_res.cm_comm_id_conn); if (res) { PRINT_ERR("ib_cm_connect failed\n"); return -1; } -- Ian Jiang ianjiang.ict at gmail.com Laboratory of Spatial Information Technology Division of System Architecture Institute of Computing Technology Chinese Academy of Sciences -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Tue Feb 28 00:38:12 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Feb 2006 10:38:12 +0200 Subject: [openib-general] Re: [PATCH] mthca - optimize sinai large message In-Reply-To: <1141114115.7277.1.camel@mtls03.yok.mtl.com> References: <1141114115.7277.1.camel@mtls03.yok.mtl.com> Message-ID: <20060228083812.GC17783@mellanox.co.il> Quoting Eli Cohen : > Memory key generation modified to optimize large messages transfer in Sinai. > This implementation restricts the MPT table size for Sinai to a maximum of > 2^23 entries. > > Signed-off-by: Eli Cohen This also restricts the number of maps for an FMR on Sinai to 256. This is not an issue since fmr pool in core currently only uses up to 64 maps per FMR. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From jackm at mellanox.co.il Tue Feb 28 01:12:41 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Tue, 28 Feb 2006 11:12:41 +0200 Subject: [openib-general] Re: [PATCH] ib_mad: =?iso-8859-1?q?prevent=09duplicateoutstanding_MADtransactions_with_same?= =?iso-8859-1?q?_TID?= In-Reply-To: <20060227193637.GB20064@mellanox.co.il> References: <44034F36.2000402@ichips.intel.com> <20060227193637.GB20064@mellanox.co.il> Message-ID: <200602281112.41981.jackm@mellanox.co.il> On Monday 27 February 2006 21:36, Michael S. Tsirkin wrote: > Quoting Sean Hefty : > > As long as all of the ACKs match to the same RMPP response transaction, > > this should be okay. Some of the ACKs will be interpreted as > > old/duplicates and be discarded. The first response should be > > reassembled on the requester side. Additional responses may time out > > waiting for an ACK that gets matched to another request, but that > > shouldn't matter. > > > > If this is not the case, I'd like to understand why this isn't happening. > > There may be a more serious issue that we're overlooking. > If one of the duplicate sessions on the responder side gets the ACK, it will issue an abort, because the segment being ACKed has not yet been sent by that duplicate session. In this case, the requester will abort the receiving session, with an S2B status code (IB Spec 13.6.2.2, page 774) -- even though the transfer might have been properly completed if the primary responder session had received the ACK instead of one of the duplicate sessions. Which session actually receives the ACK is a toss-up, since when a session sends a segment, it is placed (upon send completion) at the tail of the wait queue. Thus, arriving ACKs will likely be routed (based upon TID only) to one of the duplicate sessions. -- Jack From ogerlitz at voltaire.com Tue Feb 28 02:19:36 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 28 Feb 2006 12:19:36 +0200 Subject: [openib-general] [PATCH 3/6] [RFC] iser initiator In-Reply-To: <4401A6AA.3080508@voltaire.com> References: <20060222162507.GB24303@lst.de> <43FD8353.3020909@voltaire.com> <43FD8DF9.5090200@voltaire.com> <20060224211615.GA30927@lst.de> <4401A6AA.3080508@voltaire.com> Message-ID: <440423B8.2080302@voltaire.com> Or Gerlitz wrote: > So the iser code will be cleaned in two phases with respect "single sc": > +1 assume everywhere the sc has SG and just SG-fy the singles. > +2 remove the SG-fy trick and maintain it as a patch for backports OK, I have removed the explicit support of non SG SCSI commands. When 2.6.16 is released i will also patch out the code that SG-ifies such commands and leave only a BUG assertion verifying sc->use_sg is non zero, as done in iscsi_tcp.c ------------------------------------------------------------------------ r5523 | ogerlitz | 2006-02-28 12:08:38 +0200 (Tue, 28 Feb 2006) | 6 removed explicit handling of non SG scsi commands. Such commands and those copied as rdma unalignment are SG-ified using sg_init_one() and treated later as one entry SG elsewhere in the code. Signed-off-by: Or Gerlitz From ogerlitz at voltaire.com Tue Feb 28 03:17:09 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 28 Feb 2006 13:17:09 +0200 (IST) Subject: [openib-general] [PATCH] [RFC] use device max_map_per_fmr in fmr pool Message-ID: Roland, Remapping an FMR the maximum number of times the device allows should reduce the amortized cost of using fmrs to the minimum possible. I was thinking on a patch of this spirit (which is not complete i know since down the code there's a usage of IB_FMR_MAX_REMAPS) but i figured out that mthca does not implement filling device_attr.max_map_per_fmr. >From what i understand, struct mthca_limits should be enhanced to support it, something which i failed to do... i think its related to 2^ log2(some function of the MPT size) but i am not sure. I understand that this relates also to the thread which does an optimizations for sinai... anyway it would be good if for 2.6.17 the max remaps attribute would be supported so we can consider enhance the fmr pool to use it. Or. Index: fmr_pool.c =================================================================== --- fmr_pool.c (revision 5524) +++ fmr_pool.c (working copy) @@ -214,6 +214,7 @@ struct ib_fmr_pool *ib_create_fmr_pool(s { struct ib_device *device; struct ib_fmr_pool *pool; + struct ib_device_attr device_attr; int i; int ret; @@ -228,6 +229,12 @@ struct ib_fmr_pool *ib_create_fmr_pool(s return ERR_PTR(-ENOSYS); } + ret = ib_query_device(device, &device_attr); + if (ret) { + printk(KERN_WARNING "couldn't query device"); + return ERR_PTR(ret); + } + pool = kmalloc(sizeof *pool, GFP_KERNEL); if (!pool) { printk(KERN_WARNING "couldn't allocate pool struct"); @@ -279,7 +286,7 @@ struct ib_fmr_pool *ib_create_fmr_pool(s struct ib_pool_fmr *fmr; struct ib_fmr_attr attr = { .max_pages = params->max_pages_per_fmr, - .max_maps = IB_FMR_MAX_REMAPS, + .max_maps = device_attr.max_map_per_fmr, .page_shift = params->page_shift }; From ogerlitz at voltaire.com Tue Feb 28 03:15:08 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 28 Feb 2006 13:15:08 +0200 Subject: [openib-general] [PATCH] mthca - optimize sinai large message In-Reply-To: <1141114115.7277.1.camel@mtls03.yok.mtl.com> References: <1141114115.7277.1.camel@mtls03.yok.mtl.com> Message-ID: <440430BC.402@voltaire.com> Eli Cohen wrote: > Memory key generation modified to optimize large messages transfer > in Sinai. > This implementation restricts the MPT table size for Sinai to a > maximum of 2^23 entries. Can you elaborate a little what are the implications of not using this optimizations and what IB transport/size/type of messages (or better which ULPs) are effected? Or. From eli at mellanox.co.il Tue Feb 28 04:23:19 2006 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 28 Feb 2006 14:23:19 +0200 Subject: [openib-general] [PATCH] mthca - optimize sinai large message In-Reply-To: <440430BC.402@voltaire.com> References: <1141114115.7277.1.camel@mtls03.yok.mtl.com> <440430BC.402@voltaire.com> Message-ID: <1141129399.25707.7.camel@mtls03.yok.mtl.com> On Tue, 2006-02-28 at 13:15 +0200, Or Gerlitz wrote: > Can you elaborate a little what are the implications of not using this > optimizations and what IB transport/size/type of messages (or better > which ULPs) are effected? The improvement will be noticable for protocol messages larger then ~80 KByte: RDMA Write/Read and Send. From mst at mellanox.co.il Tue Feb 28 04:25:27 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Feb 2006 14:25:27 +0200 Subject: [openib-general] Re: [PATCH] [RFC] use device max_map_per_fmr in fmr pool In-Reply-To: References: Message-ID: <20060228122527.GD19855@mellanox.co.il> Quoting r. Or Gerlitz : > Subject: [PATCH] [RFC] use device max_map_per_fmr in fmr pool > > Roland, > > Remapping an FMR the maximum number of times the device allows should > reduce the amortized cost of using fmrs to the minimum possible. > > > I was thinking on a patch of this spirit (which is not complete i know > since down the code there's a usage of IB_FMR_MAX_REMAPS) but i figured > out that mthca does not implement filling device_attr.max_map_per_fmr. > >From what i understand, struct mthca_limits should be enhanced to support > it, something which i failed to do... i think its related to > 2^ log2(some function of the MPT size) but i am not sure. > > I understand that this relates also to the thread which does an > optimizations for sinai... anyway it would be good if for 2.6.17 > the max remaps attribute would be supported so we can consider enhance > the fmr pool to use it. > > Or. Does this actually help performance? How about increasing IB_FMR_MAX_REMAPS from 32 to say 128 and checking? > Index: fmr_pool.c > =================================================================== > --- fmr_pool.c (revision 5524) > +++ fmr_pool.c (working copy) > @@ -214,6 +214,7 @@ struct ib_fmr_pool *ib_create_fmr_pool(s > { > struct ib_device *device; > struct ib_fmr_pool *pool; > + struct ib_device_attr device_attr; > int i; > int ret; > > @@ -228,6 +229,12 @@ struct ib_fmr_pool *ib_create_fmr_pool(s > return ERR_PTR(-ENOSYS); > } > > + ret = ib_query_device(device, &device_attr); > + if (ret) { > + printk(KERN_WARNING "couldn't query device"); > + return ERR_PTR(ret); > + } > + > pool = kmalloc(sizeof *pool, GFP_KERNEL); > if (!pool) { > printk(KERN_WARNING "couldn't allocate pool struct"); > @@ -279,7 +286,7 @@ struct ib_fmr_pool *ib_create_fmr_pool(s > struct ib_pool_fmr *fmr; > struct ib_fmr_attr attr = { > .max_pages = params->max_pages_per_fmr, > - .max_maps = IB_FMR_MAX_REMAPS, > + .max_maps = device_attr.max_map_per_fmr, > .page_shift = params->page_shift > }; Allocating resources up to a maximum supported by a device is rarely optimal. Its easy to imagine a device where FMR number of maps is a shared resource, so allocating one FMR with max_map_per_fmr will not let you create any more of these. Something like .max_maps = min(IB_FMR_MAX_REMAPS, device_attr.max_map_per_fmr) Would make more sense in my opinion. We could also change ib_alloc_fmr implementation in mthca to report the actual max_maps supported by this fmr (e.g. device could round this up to the power of 2 or something). -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From halr at voltaire.com Tue Feb 28 04:19:35 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 28 Feb 2006 07:19:35 -0500 Subject: [openib-general] LID assignment policy of opensm In-Reply-To: <309a667c0602272244re7134ees6031bc11883372ea@mail.gmail.com> References: <309a667c0602272244re7134ees6031bc11883372ea@mail.gmail.com> Message-ID: <1141128535.4335.4148.camel@hal.voltaire.com> Hi Devesh, On Tue, 2006-02-28 at 01:44, Devesh Sharma wrote: > Hi list, > Please anybody brife me about the LID assignment policy used by opensm > subnet manager. Can user specify fixed LID mappings using a file? There is a file it creates with these in it so they can be reused subsequently. It is /var/cache/osm/guid2lid. opensm -h has the following option: -c --cache-options Cache the given command line options into the file /var/cache/osm/opensm.opts for use next invocation The cache directory can be changed by the environment variable OSM_CACHE_DIR Is that suitable for your needs ? -- Hal > > ______________________________________________________________________ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From ogerlitz at voltaire.com Tue Feb 28 04:44:36 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 28 Feb 2006 14:44:36 +0200 Subject: [openib-general] Re: [PATCH] [RFC] use device max_map_per_fmr in fmr pool In-Reply-To: <20060228122527.GD19855@mellanox.co.il> References: <20060228122527.GD19855@mellanox.co.il> Message-ID: <440445B4.5070702@voltaire.com> One issue i brought here is that the max_map_per_fmr field is not implemented by mthca query device, this better be fixed, also the field name should be changed to max_remaps_per_fmr which is much clearer. Michael S. Tsirkin wrote: > Does this actually help performance? > How about increasing IB_FMR_MAX_REMAPS from 32 to say 128 and checking? I am not on working on performance optimizations now, the code we wrote to the GEN1 driver was using the max possible remaps and i'd like to work here this way as well. > Allocating resources up to a maximum supported by a device is rarely optimal. > Its easy to imagine a device where FMR number of maps is a shared resource, > so allocating one FMR with max_map_per_fmr will not let you create any more > of these. Why imagine something related to Mellanox proprietary implementation over which the number of times you remap an FMR does not consume --any-- further resources on the device other (at least for tavor and arbel, correct if i am wrong to other devices). From hch at lst.de Tue Feb 28 04:47:32 2006 From: hch at lst.de (Christoph Hellwig) Date: Tue, 28 Feb 2006 13:47:32 +0100 Subject: [openib-general] [PATCH 5/6] [RFC] iser handling of memory for RDMA In-Reply-To: <4402FE05.6020605@voltaire.com> References: <20060222162903.GC24303@lst.de> <4402FE05.6020605@voltaire.com> Message-ID: <20060228124732.GA22013@lst.de> > use kmap_atomic instead of page_address in the code copying from/to SG > which is unaligned for rdma this isn't entirely correct I think. iser_finalize_rdma_unaligned_sg is called from a tasklist, which is softirq context, so you can't use KM_USER0 there. KM_SOFTIRQ0 should probably work. Otoh tasklets are not very scalable because tasklets of a type a serialized against running at multiple cpus, so maybe you should switch to a different mechanisms. From mst at mellanox.co.il Tue Feb 28 04:55:52 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Feb 2006 14:55:52 +0200 Subject: [openib-general] Re: [PATCH] [RFC] use device max_map_per_fmr in fmr pool In-Reply-To: <440445B4.5070702@voltaire.com> References: <20060228122527.GD19855@mellanox.co.il> <440445B4.5070702@voltaire.com> Message-ID: <20060228125552.GH19855@mellanox.co.il> Quoting r. Or Gerlitz : > I am not on working on performance optimizations now, the code we wrote > to the GEN1 driver was using the max possible remaps and i'd like to > work here this way as well. But why make changes if this doesnt gain performance? > >Allocating resources up to a maximum supported by a device is rarely > >optimal. > >Its easy to imagine a device where FMR number of maps is a shared resource, > >so allocating one FMR with max_map_per_fmr will not let you create any more > >of these. > > Why imagine something related to Mellanox proprietary implementation > over which the number of times you remap an FMR does not consume --any-- > further resources on the device other (at least for tavor and arbel, > correct if i am wrong to other devices). Fair enough. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From ogerlitz at voltaire.com Tue Feb 28 05:05:16 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 28 Feb 2006 15:05:16 +0200 Subject: [openib-general] [PATCH 5/6] [RFC] iser handling of memory for RDMA In-Reply-To: <20060228124732.GA22013@lst.de> References: <20060222162903.GC24303@lst.de> <4402FE05.6020605@voltaire.com> <20060228124732.GA22013@lst.de> Message-ID: <44044A8C.9060507@voltaire.com> Christoph Hellwig wrote: > > use kmap_atomic instead of page_address in the code copying from/to SG > > which is unaligned for rdma > > this isn't entirely correct I think. iser_finalize_rdma_unaligned_sg > is called from a tasklist, which is softirq context, so you can't use > KM_USER0 there. KM_SOFTIRQ0 should probably work. This is exactly the case, iser_finalize_rdma_unaligned_sg runs in tasklet context and its code is using KM_SOFTIRQ0 and iser_start_rdma_unaligned_sg runs in kernel thread or user process context and its code uses KM_USER0 > Otoh tasklets are not very scalable because tasklets of a type a > serialized against running at multiple cpus, so maybe you should > switch to a different mechanisms. I see. Well, first, the current iser code is not instrumented for completion reaping from the CQ to run from multiple contexts. Second, the code in iscsi_iser.c borrowed from drivers/scsi/iscsi_tcp.c assumes it runs in tasklet context (ie spin_lock_bh() and friends) as the tcp upcalls are. So for now we plan to keep using tasklets and it does not seem an issue for the upstream inclusion. Later down the road (specifically for open source iser target...) i guess an implementation of multiple (num cpus) kernel threads competing on polling the CQ would be considered. Does this makes sense? Or. From mst at mellanox.co.il Tue Feb 28 05:08:19 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Feb 2006 15:08:19 +0200 Subject: [openib-general] Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID In-Reply-To: <4403575F.1010305@ichips.intel.com> References: <200602231825.19333.jackm@mellanox.co.il> <43FDF106.8070403@ichips.intel.com> <20060225170316.GA15973@mellanox.co.il> <44033E57.8090905@ichips.intel.com> <20060227181156.GC17414@mellanox.co.il> <4403422F.2000408@ichips.intel.com> <20060227182746.GB19265@mellanox.co.il> <44034CDF.9030204@ichips.intel.com> <20060227194215.GC20064@mellanox.co.il> <4403575F.1010305@ichips.intel.com> Message-ID: <20060228130819.GJ19855@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID > > Michael S. Tsirkin wrote: > >>>Host A sends an RMPP request message to host B with TID=3 > >>>Host B sends an RMPP request message to host A with TID=3. > >>>Now if A generates an RMPP response it has TID=3. > >>> > >>>If B sends ACK, host A has no idea which transaction is being ACKed. > >> > >>Bah... can we distinguish which transaction is being ACKed by the > >>response bit? > > > >Are you talking about checking IB_MGMT_METHOD_RESP? > > > >How is this different from what I proposed? > > Yes - this is what you proposed. I believe that it can work for ACKs since > an ACK must match with a send. > > >Wont this work for Abort/Stop as well? > > Given the example above, with hosts A and B sending requests, if host B > sends an abort, it's still unknown which transaction is being aborted, > since neither the send or receive would have the response bit set. Sorry for being dense, I'm not sure I understand. Do you mean response when you say receive? We never have both a request and a response outstanding to the same remove GID with the same TID, do we? If I get an abort with a response method - this is an abort that a receiver issued, so its an abort of a transaction that I am currently sending. If I get an abort with a request method - this is an abort that a sender issued, so its an abort of a transaction that I am currently receiving. So how are aborts different from ACKs here? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Tue Feb 28 05:11:06 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Feb 2006 15:11:06 +0200 Subject: [openib-general] Re: [PATCH 5/6] [RFC] iser handling of memory for RDMA In-Reply-To: <20060228124732.GA22013@lst.de> References: <20060222162903.GC24303@lst.de> <4402FE05.6020605@voltaire.com> <20060228124732.GA22013@lst.de> Message-ID: <20060228131106.GL19855@mellanox.co.il> Quoting r. Christoph Hellwig : > Subject: Re: [PATCH 5/6] [RFC] iser handling of memory for RDMA > > > use kmap_atomic instead of page_address in the code copying from/to SG > > which is unaligned for rdma > > this isn't entirely correct I think. iser_finalize_rdma_unaligned_sg > is called from a tasklist, which is softirq context, so you can't use > KM_USER0 there. KM_SOFTIRQ0 should probably work. Otoh tasklets are not > very scalable because tasklets of a type a serialized against running > at multiple cpus, so maybe you should switch to a different mechanisms. You could just switch to using a workqueue - its already per cpu and KM_USER0 is legal there. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From ogerlitz at voltaire.com Tue Feb 28 05:17:38 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 28 Feb 2006 15:17:38 +0200 Subject: [openib-general] [PATCH 5/6] [RFC] iser handling of memory for RDMA In-Reply-To: <44044A8C.9060507@voltaire.com> References: <20060222162903.GC24303@lst.de> <4402FE05.6020605@voltaire.com> <20060228124732.GA22013@lst.de> <44044A8C.9060507@voltaire.com> Message-ID: <44044D72.3000705@voltaire.com> Or Gerlitz wrote: > Second, the code in > iscsi_iser.c borrowed from drivers/scsi/iscsi_tcp.c assumes it runs in > tasklet context (ie spin_lock_bh() and friends) as the tcp upcalls are. I realize now that the network stack (and hence iscsi_tcp) upflow runs in softirq context which is more scalable then tasklet, but i still think we can live with iser upflow running in tasklet. Or. From mst at mellanox.co.il Tue Feb 28 05:46:10 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Feb 2006 15:46:10 +0200 Subject: [openib-general] [PATCH] make cmatose compile Message-ID: <20060228134610.GP19855@mellanox.co.il> Make cmatose compile under 2.6.15 Signed-off-by: Michael S. Tsirkin Index: drivers/infiniband/util/cmatose/cmatose.c =================================================================== --- drivers/infiniband/util/cmatose/cmatose.c (revision 5525) +++ drivers/infiniband/util/cmatose/cmatose.c (working copy) @@ -137,7 +137,7 @@ node->mem = mem; node->addr = dma_map_single(node->cma_id->device->dma_device, node->mem, message_size, DMA_TO_DEVICE); - pci_unmap_addr_set(&node, mapping, node->addr); + pci_unmap_addr_set(node, mapping, node->addr); return 0; out: kfree(mem); @@ -385,7 +385,7 @@ if (node->mem) { dma_unmap_single(node->cma_id->device->dma_device, - pci_unmap_addr(&node, mapping), + pci_unmap_addr(node, mapping), message_size, DMA_TO_DEVICE); ib_dereg_mr(node->mr); kfree(node->mem); -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From halr at voltaire.com Tue Feb 28 05:39:29 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 28 Feb 2006 08:39:29 -0500 Subject: [openib-general] Re: [PATCH] OpenSM - osm_vendor_get_all_port_attr - add info In-Reply-To: <5zlkvxjeh2.fsf@mtl066.yok.mtl.com> References: <5zlkvxjeh2.fsf@mtl066.yok.mtl.com> Message-ID: <1141133968.4335.4809.camel@hal.voltaire.com> Hi Yael, On Mon, 2006-02-27 at 05:14, Yael Kalka wrote: > Hi Hal, > > Currently osm_vendor_get_all_port_attr doesn't update the port number > information. Yes, this is currently not implemented and needed. What uses this ? > The following patch adds this information. Aside from being rejected and having to be applied manually, I have some comments on this. > Thanks, > Yael > > Signed-off-by: Yael Kalka > > Index: libvendor/osm_vendor_ibumad.c > =================================================================== > --- libvendor/osm_vendor_ibumad.c (revision 5496) > +++ libvendor/osm_vendor_ibumad.c (working copy) > @@ -565,8 +565,10 @@ osm_vendor_get_all_port_attr( > ib_net64_t *p_guid = portguids, *e = portguids + *p_num_ports; > umad_ca_t ca; > int lids[*p_num_ports]; > + int portnums[*p_num_ports]; > int linkstates[*p_num_ports]; > int *p_lid = lids; > + int *p_portnum = portnums; This is not used in your patch (but should be in the below loop in osm_vendor_get_all_port_attr): for (i = 0; p_guid < e && i < p_vend->ca_count; i++) { ... if ((r = umad_get_ca(p_vend->ca_names[i], &ca)) == 0) { for (j = 0; j <= ca.numports; j++) { if (ca.ports[j]) { *p_lid = ca.ports[j]->base_lid; *p_linkstates = ca.ports[j]->state; *p_portnum = ca.ports[j]->portnum; <=========== } p_lid++; p_linkstates++; p_portnum++; <================================= } } } > int *p_linkstates = linkstates; > umad_port_t def_port = {""}; > int r, i, j; > @@ -622,6 +624,7 @@ osm_vendor_get_all_port_attr( > > portguids[0] = def_port.port_guid; > lids[0] = def_port.base_lid; > + portnums[0] = def_port.portnum; > linkstates[0] = def_port.state; > sm_lid = def_port.sm_lid; > > @@ -642,6 +645,7 @@ osm_vendor_get_all_port_attr( > continue; > p_attr_array[j].port_guid = portguids[i]; > p_attr_array[j].lid = lids[i]; > + p_attr_array[j].port_num = portnums[i]; > if (j == 0) > p_attr_array[j].sm_lid = sm_lid; > else > -- Hal From Bret.Weber at engenio.com Tue Feb 28 06:19:32 2006 From: Bret.Weber at engenio.com (Weber, Bret) Date: Tue, 28 Feb 2006 07:19:32 -0700 Subject: [openib-general] RE: [Openib-promoters] Need for ONE OpenIB Release process that allmembers can agree to and that follows OpenIB Bylaws Message-ID: Bill, This is the first that I have heard of the desire for some to drop SRP and iSER. These are our storage protocols and the defined standard methods to run SCSI over IB to native IB storage (or any other RDMA based wire). Active testing is going on with feedback being given back to the developers on these drivers. Storage is a major part of the promise of IB, RDMA, and Unified wire. We can't be looking at dropping it now. It was late getting into the gen 2 stack, but now that it is in, storage vendors are ramping up on it. Bret ________________________________ From: openib-promoters-bounces at openib.org [mailto:openib-promoters-bounces at openib.org] On Behalf Of Bill Boas Sent: Tuesday, February 28, 2006 12:49 AM To: openib-promoters at openib.org; openib-general at openib.org Cc: chetm at us.ibm.com Subject: [Openib-promoters] Need for ONE OpenIB Release process that allmembers can agree to and that follows OpenIB Bylaws There appear to be 2 groups within OpenIB thinking about different approaches to preparing the code for Release 1.0. One group is thinking about downstreaming it to RedHat and Novell, another group seems to be thinking about separate releases from some IB suppliers than others. Lets remind ourselves of the purposes OpenIB was created and what all of the member companies have just re-affirmed in the Board meeting last Friday (by approving the re-worked By-laws). The principles are, I believe: (if there are misstatements below, lets discus openly) 1) OpenIB develops open source code creating a software stack. OpenIB (now OpenFabrics Alliance) is a corporation with Bylaws that all members should obey if they want the corporation to continue to function. It will only survive if, in general, all members self interests are served simultaneously with each member's own self interest. 2) OpenIB members by a 2/3rds vote of the members have to approve the content of that stack through the Proposal process described section 12 of the Bylaws. It is not up to a single member or group of members to decide on their own what is or is not in the OpenIB stack. This is deliberate to prevent one or more members gaining competitive advantage through the OpenIB stack over other members. 3) OpenIB downstreams kernel code to kernel.org 4) OpenIB code is distributed to end customers (like Wall St., labs, etc) and to mid tier customers of OpenIB (Oracle, IBM, Sun, Dell, LNXI etc.) via Linux distributions such as RedHat and Novell. 5) End customers told the IB companies in February 2004 and in December 2005 at Credit Suisse (HSIR meeting) that they wanted ONE OpenIB stack that runs on every IB vendors hardware, that interoperates with all other IB vendors h/w and s/w, is used by all mid-tier suppliers and that it comes with their Linux distribution. I realize that so far in OpenIB's evolution we have not worked out the issue of how to support end-customers while following these principles for the release process. But that, I suggest, is not a valid reason for breaking these principles. We should be able to deal with "Release" as one process and "Support" as another process - though of course there will be linkage between them but they are not the same process. The way do ne is not necessarily the way to do the other. This email is an appeal to the two groups to work together, not to work separately, and to work on solving these issues for the membership as a whole, not just their own company, or a select group. Please bring to the Board a proposal that serves all the membership. Here's what one group seems to be thinking (edited to remove "I"): "Here is a first cut at the set of components (protocols, drivers, userspace bits) that we think we should be supporting in 1.0. Please look over it and let us know if we are missing anything. HCA support (both kernel driver and userspace verbs components): * ehca * ipath * mthca IB protocols: * IPoIB * RC * SDP * SRQ * UC * UD Userland software: * libibverbs * libsdp * opensm As far as we can tell, most of the rest of OpenIB userland (libibcm, libibat, libibmad, etc) is logically part of OpenSM, can be treated as such (I think Doug is already doing this with his Red Hat spec files) and is unlikely to be used by other applications. Am I way off? Components that we don't know what to do about, and will likely want to drop unless someone can vouch for them: * iSER * SRP * uDAPL" Here's what the other group suggested: "Openib Commercial Grade Release 1.0 release criteria 1) CPU Architectures: a) x86_32 (Xeon) b) x84_64 (Nocona, Opteron) c) ia64 d) PPC64 (Power5, Power6) - Mellanox does not support these systems 2) Linux distributors and kernels a) RH: AS EL4 up3; Fedora C4 last update , and maybe FC5 b) SuSE: SuSE 10 last update (open - SLES10 beta) c) kernel.org: the latest that is available when generating rc1. In 1.0 it will probably be 2.6.16 (might be 2.6.17). 3) Packaging and installation a) The openib release will be packages in one tarball for both kernel and user-level. b) One install script will support full installation. The install will support typical and custom components I will send a different document with install definition to be reviewed and agreed between all. 4) HCA and Switch Support: a) HCAs: InfiniHost, InfiniHost III Ex (both modes: with memory and MemFree), InfiniHost III Lx b) Switches: Need to support all vendors' production switches - each vendor should send the list. 5) Switch Management Interoperability testing a) Follow the CIWG-OpenIB HCA-OEM Switch Interop Test Plan 6) Feature set per ULP: a) Will be defined later with each ULP maintainer. 7) Minimum cluster size to be tested a) Need at least 128 nodes cluster, bigger is better. 8) Scalability requirements a) SM: i) Bringup a subnet with 1,000 nodes in 2 minutes ii) SM should not be a bottle neck in any application running (IPoIB) b) MPI: i) MPI runner - should be able to launch thousands of processes (say 50,000) in a bounded time manner. ii) Memory consumption - should be able to run many processes on the same node (for now, 8 processes is the upper limit with the Opteron machines), in a many node (thousands of nodes) installation. iii) Sending HUGE messages in collectives - MPI should not fail for limited physical memory. 9) Performance requirements: First we need to agree on the performance benchmark for each ULP: a) Basic verbs - performance tests in openib (send, RDMA read/write latency & BW) b) IPoIB - netperf c) MPI - Pallas d) SDP - iperf e) SRP - iometer f) iSER - iometer 10) Documentation requirements a) Product brief b) Installation guide c) User guide d) Release notes e) Troubleshooting f) Test Plan and Test Report 11) Storage target test requirements a) Engenio target - Mellanox will be responsible of verification b) Cisco & SST - please add more target systems 12) Firmware and Hardware versions to be tested a) Both DDR and SDR modes should be supported. b) FW burned should be the last official released by Mellanox: i) InfiniHost III Lx: fw-25204-1.0.800 ii) InfiniHost III Ex: fw-25218-5.1.400 and fw-25208-4.7.600 (both will be released in 2 weeks) iii) InfiniHost: fw-23108-3.4.000 iv) InfiniScale III - fw-47396-0.8.3 v) InfiniScale - fw-43132-5.5.0 13) Specifications compliance: a) Verbs & management: InfiniBand Architecture Specification, Volume 1, Release 1.2 b) IPoIB: www.ietf.org: draft-ietf-ipoib-architecture-04 and draft-ietf-ipoib-ip-over-infiniband-07 c) SDP: Annex A4" of the InfiniBand Architecture Specification, Volume 1, Release 1.2 d) SRP: SCSI RDMA Protocol-2 (SRP-2), Doc. no. T10/1524-D. (www.t10.org/ftp/t10/drafts/srp2/srp2r00a.pdf ). e) MPI: www.mpi-forum.org/docs/mpi-11-html/mpi-report.html f) iSER: www.ietf.org/internet-drafts/draft-hufferd-iser-ib-01.pdf g) RDS: SS can you provide info The following two items are very important for the SW stack QA but not gating for starting the release process. 1) ISV test requirements - coverage for all ULPs 2) Database test requirements Cisco, SS and Voltaire should define those, since they already have test beds for commercial applications and databases." -------------- next part -------------- An HTML attachment was scrubbed... URL: From dotanb at mellanox.co.il Tue Feb 28 06:23:03 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Tue, 28 Feb 2006 16:23:03 +0200 Subject: [openib-general] [PATCH] mthca: check alt_pkey_index in modify_qp + cosmetic change Message-ID: <44045CC7.3000806@mellanox.co.il> Added a check that the alternate pkey index is valid + cosmetic change. Signed-off-by: Dotan Barak Index: latest/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- latest.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2006-02-28 15:46:58.000000000 +0200 +++ latest/drivers/infiniband/hw/mthca/mthca_qp.c 2006-02-28 16:02:50.000000000 +0200 @@ -529,8 +529,8 @@ int mthca_modify_qp(struct ib_qp *ibqp, if ((attr_mask & IB_QP_PKEY_INDEX) && attr->pkey_index >= dev->limits.pkey_table_len) { - mthca_dbg(dev, "PKey index (%u) too large. max is %d\n", - attr->pkey_index,dev->limits.pkey_table_len-1); + mthca_dbg(dev, "pkey_index (%u) too large. max is %d\n", + attr->pkey_index, dev->limits.pkey_table_len - 1); return -EINVAL; } @@ -651,6 +651,13 @@ int mthca_modify_qp(struct ib_qp *ibqp, return -EINVAL; } + if (attr->alt_pkey_index >= dev->limits.pkey_table_len) { + mthca_dbg(dev, "alt_pkey_index (%u) too large. max is %d\n", + attr->alt_pkey_index, + dev->limits.pkey_table_len - 1); + return -EINVAL; + } + mthca_path_set(&attr->alt_ah_attr, &qp_context->alt_path); qp_context->alt_path.port_pkey |= cpu_to_be32(attr->alt_pkey_index | attr->alt_port_num << 24); From mst at mellanox.co.il Tue Feb 28 06:24:06 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Feb 2006 16:24:06 +0200 Subject: [openib-general] Re: Re: RFC: SDP plans In-Reply-To: References: <20060227173459.GB19855@mellanox.co.il> <20060227212002.GA7597@lst.de> <20060228003833.GD20064@mellanox.co.il> Message-ID: <20060228142406.GS19855@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: Re: RFC: SDP plans > > > > > - Use sysfs for statistics, entry per socket > > > > this one sounds a bit fishy. with lots of sockets you'll eat up far > > > too much memory I susect. But let's look at the code once it's there. > > > Hmm. Should I stick to /proc then? > > No, nothing like per-socket statistics in /proc is going to be merged > upstream. I think the real question is whether per-socket are useful. > If they are important, and sysfs doesn't work, then another > possibility would be a netlink-based way of retrieving them. I'd like to point out that other protocols maintain stuff in /proc/net/tcp6 Assuming I want to use sysfs, where do I stick my directory? /sys/class/net/sdp? /sys/class/infiniband/sdp? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From jackm at mellanox.co.il Tue Feb 28 06:34:04 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Tue, 28 Feb 2006 16:34:04 +0200 Subject: [openib-general] [PATCH] mad_rmpp: fix check for old ACK Message-ID: <200602281634.04999.jackm@mellanox.co.il> Test for old ACK does not include most recent ACK. Signed-off-by: Jack Morgenstein Index: drivers/infiniband/core/mad_rmpp.c =================================================================== --- drivers/infiniband/core/mad_rmpp.c (revision 5525) +++ drivers/infiniband/core/mad_rmpp.c (working copy) @@ -666,7 +666,7 @@ static void process_rmpp_ack(struct ib_m return; } - if (newwin < mad_send_wr->newwin || seg_num < mad_send_wr->last_ack) + if (newwin < mad_send_wr->newwin || seg_num <= mad_send_wr->last_ack) goto out; /* Old ACK */ if (seg_num > mad_send_wr->last_ack) { From jlentini at netapp.com Tue Feb 28 07:10:56 2006 From: jlentini at netapp.com (James Lentini) Date: Tue, 28 Feb 2006 10:10:56 -0500 (EST) Subject: [openib-general] RFC: SDP plans In-Reply-To: <000201c63bfa$33fa6af0$0ff9070a@amr.corp.intel.com> References: <000201c63bfa$33fa6af0$0ff9070a@amr.corp.intel.com> Message-ID: On Mon, 27 Feb 2006, Bob Woodruff wrote: > Should we create an RPMs directory in the 1.0 branch where people > can start to put the RPMs for the various kernels and usermode > components ? How about posting the RPMs on OpenIB.org? I'd suggest using the OpenIB Wiki's "Downloads" page: https://openib.org/tiki/tiki-index.php?page=Downloads I don't see a benefit in placing generated binary objects under version control. From rdreier at cisco.com Tue Feb 28 07:31:25 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 28 Feb 2006 07:31:25 -0800 Subject: [openib-general] Need for ONE OpenIB Release process that all members can agree to and that follows OpenIB Bylaws In-Reply-To: (Bill Boas's message of "Mon, 27 Feb 2006 22:49:05 -0800") References: Message-ID: I think the central issue is that there is a conflation of two very different things: an OpenIB release, and distributions of that OpenIB release. In my opinion, the correct consumers to have in my when thinking of the OpenIB release are the _distributors_: Red Hat, Novell, Debian, Ubuntu or any other vendors who feel that they can provide value by packaging, distributing and supporting the OpenIB release. OpenIB should not think of the release as also being a distribution. OpenIB has no capacity to provide support (no phone lines, no field engineers, etc), and it doesn't make sense to try and build this capacity to compete with commercial vendors who already have it. The standard in the open source world, as exemplified by projects such as the Linux kernel, the Gnome project, KDE, X.org, gcc, etc, etc, is for the open source project to focus on producing a release that distributors can package and get to end users. It is _not_ on producing something that anyone other than the most sophisticated early adopters on the bloodiest bleeding edge will install directly. - R. From rdreier at cisco.com Tue Feb 28 07:32:53 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 28 Feb 2006 07:32:53 -0800 Subject: [openib-general] Performance optimization In-Reply-To: <20060228071240.GH880@esmail.cup.hp.com> (Grant Grundler's message of "Mon, 27 Feb 2006 23:12:40 -0800") References: <43FA29A1.5020301@uci.edu> <20060227174237.GH31654@esmail.cup.hp.com> <4403B542.8060401@uci.edu> <20060228071240.GH880@esmail.cup.hp.com> Message-ID: > > taskset 0x1 ibv_uc_pingpong xxxxx > Is this test a bandwidth test or a latency test? > Anything that says "pingpong" suggests latency to me. ibv_uc_pingpong is not really a good test of either bandwidth or latency. It is just a basic test of UC functionality. I should probably put some language to that effect in the manpage. - R. From jlentini at netapp.com Tue Feb 28 07:37:16 2006 From: jlentini at netapp.com (James Lentini) Date: Tue, 28 Feb 2006 10:37:16 -0500 (EST) Subject: [openib-general] Re: [PATCH 1.0] uDAPL - QP destroy and HCA close problems fixed In-Reply-To: References: Message-ID: On Mon, 27 Feb 2006, Arlin Davis wrote: > Here is a small uDAPL patch that should go into 1.0 that fixes some > issues that we just found with MPI scale out testing on OpenIB. QP > was not being destroyed in some cases and hca_close issues with > async work thread. I am still working one other elusive disconnect > problem that may require another small patch. Committed in the trunk and on the 1.0 branch in revision 5530. From dotanb at mellanox.co.il Tue Feb 28 07:43:29 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Tue, 28 Feb 2006 17:43:29 +0200 Subject: [openib-general] Performance optimization Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3018A4F2D@mtlexch01.mtl.com> > > Is this test a bandwidth test or a latency test? > > Anything that says "pingpong" suggests latency to me. > > ibv_uc_pingpong is not really a good test of either bandwidth or > latency. It is just a basic test of UC functionality. If you need, you can find in src/userspace/perftest several tests for bw and latency for RC / UD / UC QPs for RDMA Write / RDMA Read / Send. Dotan From mst at mellanox.co.il Tue Feb 28 07:50:03 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Feb 2006 17:50:03 +0200 Subject: [openib-general] [PATCH repost] ipoib_multicast_race.patch Message-ID: <20060228155003.GT19855@mellanox.co.il> Hello, Roland! The following was found by code review. Please review. I have updated the description, so I expect the issue and the fix are clear now. --- ipoib_mcast_stop_thread currently tests mcast->query and if it is NULL, does not perform wait_for_completion on the mcast and frees the mcast object directly. However, since both operations are done without locking, it is possible that ipoib_mcast_join_complete is in progress on this mcast object and has set mcast->query to NULL already. Solve this by: - taking priv->lock before we change mcast->query in ipoib_mcast_join_complete, and keeping it until we no longer need the mcast object - taking priv->lock around mcast->query test in ipoib_mcast_stop_thread Signed-off-by: Michael S. Tsirkin Index: linux-2.6.14/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- linux-2.6.14.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-01-15 17:02:52.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2006-01-15 17:05:16.000000000 +0200 @@ -432,9 +432,11 @@ static void ipoib_mcast_join_complete(in if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS) mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS; + mutex_lock(&mcast_mutex); + + spin_lock_irq(&priv->lock); mcast->query = NULL; - mutex_lock(&mcast_mutex); if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) { if (status == -ETIMEDOUT) queue_work(ipoib_workqueue, &priv->mcast_task); @@ -443,6 +445,7 @@ static void ipoib_mcast_join_complete(in mcast->backoff * HZ); } else complete(&mcast->done); + spin_unlock_irq(&priv->lock); mutex_unlock(&mcast_mutex); return; @@ -627,21 +630,27 @@ int ipoib_mcast_stop_thread(struct net_d if (flush) flush_workqueue(ipoib_workqueue); + spin_lock_irq(&priv->lock); if (priv->broadcast && priv->broadcast->query) { ib_sa_cancel_query(priv->broadcast->query_id, priv->broadcast->query); priv->broadcast->query = NULL; + spin_unlock_irq(&priv->lock); ipoib_dbg_mcast(priv, "waiting for bcast\n"); wait_for_completion(&priv->broadcast->done); - } + } else + spin_unlock_irq(&priv->lock); list_for_each_entry(mcast, &priv->multicast_list, list) { + spin_lock_irq(&priv->lock); if (mcast->query) { ib_sa_cancel_query(mcast->query_id, mcast->query); mcast->query = NULL; + spin_unlock_irq(&priv->lock); ipoib_dbg_mcast(priv, "waiting for MGID " IPOIB_GID_FMT "\n", IPOIB_GID_ARG(mcast->mcmember.mgid)); wait_for_completion(&mcast->done); - } + } else + spin_unlock_irq(&priv->lock); } return 0; -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Tue Feb 28 07:56:05 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Feb 2006 17:56:05 +0200 Subject: [openib-general] Re: Need for ONE OpenIB Release process that all members can agree to and that follows OpenIB Bylaws In-Reply-To: References: Message-ID: <20060228155605.GU19855@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: Need for ONE OpenIB Release process that all members can agree to and that follows OpenIB Bylaws > > I think the central issue is that there is a conflation of two very > different things: an OpenIB release, and distributions of that OpenIB > release. In my opinion, the correct consumers to have in my when > thinking of the OpenIB release are the _distributors_: Red Hat, > Novell, Debian, Ubuntu or any other vendors who feel that they can > provide value by packaging, distributing and supporting the OpenIB release. I second this, although since most stuff is just built with autotools, these releases are actually sometimes useful for a power-user. In particular, with this approach openib is distribution-agnostic which is a very good thing. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From devesh28 at gmail.com Tue Feb 28 08:12:59 2006 From: devesh28 at gmail.com (Devesh Sharma) Date: Tue, 28 Feb 2006 21:42:59 +0530 Subject: [openib-general] LID assignment policy of opensm In-Reply-To: <1141128535.4335.4148.camel@hal.voltaire.com> References: <309a667c0602272244re7134ees6031bc11883372ea@mail.gmail.com> <1141128535.4335.4148.camel@hal.voltaire.com> Message-ID: <309a667c0602280812s6da1cc99s30f10b7ea940f82e@mail.gmail.com> Hi Hal Thanks for replying. This setisfies my needs if user can define his own guid to lid mapping. whether in this file user can define his own guid to lid mapping? Devesh On 28 Feb 2006 07:19:35 -0500, Hal Rosenstock wrote: > > Hi Devesh, > > On Tue, 2006-02-28 at 01:44, Devesh Sharma wrote: > > Hi list, > > Please anybody brife me about the LID assignment policy used by opensm > > subnet manager. Can user specify fixed LID mappings using a file? > > There is a file it creates with these in it so they can be reused > subsequently. It is /var/cache/osm/guid2lid. > > opensm -h has the following option: > -c > --cache-options > Cache the given command line options into the file > /var/cache/osm/opensm.opts for use next invocation > The cache directory can be changed by the environment > variable OSM_CACHE_DIR > > Is that suitable for your needs ? > > -- Hal > > > > ______________________________________________________________________ > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mohitka at hcl.in Tue Feb 28 08:11:36 2006 From: mohitka at hcl.in (Mohit Katiyar, Noida) Date: Tue, 28 Feb 2006 21:41:36 +0530 Subject: [openib-general] function details Message-ID: <3E6BB9CEE261E2428AD25D0D553DC497031F555F@HSDLNTD1110010.noida.hcltech.com> Hi, I am new to iSER code and I was looking at isci_iser_pool_init function. What I could infer from it is that it initializes an element of isci_iser_queue structure (allocates memory ) for iscsi command task strutcture taking max value. Am I correct in my approach. Ifanyone can provide a bit details for it that will be very helpful. Thanks and Regards Mohit Katiyar HCL Technologies Ph: 9891988857 From mst at mellanox.co.il Tue Feb 28 08:48:50 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Feb 2006 18:48:50 +0200 Subject: [openib-general] [PATCH repost] remove ip_dev_find from addr.c Message-ID: <20060228164850.GX19855@mellanox.co.il> Sean, this seems to work for me. Please test. --- Remove dependency of addr.c on ip_dev_find so that we can work on vanilla 2.6.15. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.15/drivers/infiniband/core/addr.c =================================================================== --- linux-2.6.15.orig/drivers/infiniband/core/addr.c 2006-02-28 17:24:56.000000000 +0200 +++ linux-2.6.15/drivers/infiniband/core/addr.c 2006-02-28 18:14:12.000000000 +0200 @@ -46,6 +46,33 @@ MODULE_AUTHOR("Sean Hefty"); MODULE_DESCRIPTION("IB Address Translation"); MODULE_LICENSE("Dual BSD/GPL"); +static struct net_device *xxx_ip_dev_find(u32 addr) +{ + struct net_device *dev; + struct in_ifaddr **ifap; + struct in_ifaddr *ifa; + struct in_device *in_dev; + + read_lock(&dev_base_lock); + for (dev = dev_base; dev; dev = dev->next) + if ((in_dev = in_dev_get(dev))) { + for (ifap = &in_dev->ifa_list; (ifa = *ifap); + ifap = &ifa->ifa_next) { + if (addr == ifa->ifa_address) { + dev_hold(dev); + in_dev_put(in_dev); + goto found; + } + } + in_dev_put(in_dev); + } +found: + read_unlock(&dev_base_lock); + return dev; +} + +#define ip_dev_find xxx_ip_dev_find + struct addr_req { struct list_head list; struct sockaddr src_addr; -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Tue Feb 28 08:51:50 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Feb 2006 18:51:50 +0200 Subject: [openib-general] cmatose questions Message-ID: <20060228165150.GY19855@mellanox.co.il> Hello, Sean! I am trying out cmatose test. On the server side, I have loaded the server module. I have loaded the client the first time and got in log cmatose: starting client cmatose: connecting cmatose: event: 8, error: 8 cmatose: connect time: 4000 us cmatose: waiting to disconnect cmatose: test complete I then unloaded the client and loaded it again: modprobe rdma_cmatose dst_ip=11.4.8.155; rmmod rdma_cmatose Now I see cmatose: starting client cmatose: connecting and the process can't be killed. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From rdreier at cisco.com Tue Feb 28 08:57:59 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 28 Feb 2006 08:57:59 -0800 Subject: [openib-general] Re: Need for ONE OpenIB Release process that all members can agree to and that follows OpenIB Bylaws In-Reply-To: <20060228155605.GU19855@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 28 Feb 2006 17:56:05 +0200") References: <20060228155605.GU19855@mellanox.co.il> Message-ID: Michael> I second this, although since most stuff is just built Michael> with autotools, these releases are actually sometimes Michael> useful for a power-user. I agree (as I said in my original email, the release will be used by "the most sophisticated early adopters on the bloodiest bleeding edge" ;). And in fact it might be interesting for someone to figure out how to use JHBuild (http://www.gnome.org/~jamesh/jhbuild.html) or the like to make bleeding-edge installs easier for power users. - R. From robert.j.woodruff at intel.com Tue Feb 28 09:11:14 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Tue, 28 Feb 2006 09:11:14 -0800 Subject: [openib-general] Need for ONE OpenIB Release process that allmembers can agree to and that follows OpenIB Bylaws In-Reply-To: Message-ID: <000001c63c89$fb5251a0$6aa1070a@amr.corp.intel.com> Bill Boas wrote, >Components that we don't know what to do about, and will likely want to drop >unless someone can vouch for them: > * iSER > * SRP > * uDAPL" Why would you drop these ? People want all of them. woody From jlentini at netapp.com Tue Feb 28 09:19:14 2006 From: jlentini at netapp.com (James Lentini) Date: Tue, 28 Feb 2006 12:19:14 -0500 (EST) Subject: [openib-general] Need for ONE OpenIB Release process that allmembers can agree to and that follows OpenIB Bylaws In-Reply-To: <000001c63c89$fb5251a0$6aa1070a@amr.corp.intel.com> References: <000001c63c89$fb5251a0$6aa1070a@amr.corp.intel.com> Message-ID: On Tue, 28 Feb 2006, Bob Woodruff wrote: > >Components that we don't know what to do about, and will likely want to > drop > > >unless someone can vouch for them: > > * iSER > > * SRP > > * uDAPL" > > Why would you drop these ? People want all of them. Bill wasn't proposing to drop these. He was quoting Bryan O'Sullivan's original post: http://openib.org/pipermail/openib-general/2006-February/017019.html With regards to uDAPL, I think we have established that it should be part of the 1.0 release. From jackm at mellanox.co.il Tue Feb 28 09:24:33 2006 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Tue, 28 Feb 2006 19:24:33 +0200 Subject: [openib-general] [PATCH] mad: add GID/class checking for matching received to sent MADs Message-ID: <200602281924.33041.jackm@mellanox.co.il> Adds GID and class checking to mad receive processing when locating sent MAD. Signed-off-by: Jack Morgenstein Index: latest/drivers/infiniband/core/mad.c =================================================================== --- latest.orig/drivers/infiniband/core/mad.c +++ latest/drivers/infiniband/core/mad.c @@ -1641,14 +1641,59 @@ static int is_data_mad(struct ib_mad_age (rmpp_mad->rmpp_hdr.rmpp_type == IB_MGMT_RMPP_TYPE_DATA); } +static inline int rcv_has_same_class(struct ib_mad_send_wr_private *wr, + struct ib_mad_recv_wc *rwc) +{ + return ((struct ib_mad *)(wr->send_buf.mad))->mad_hdr.mgmt_class == + rwc->recv_buf.mad->mad_hdr.mgmt_class; +} + +static inline int rcv_has_same_gid(struct ib_mad_send_wr_private *wr, + struct ib_mad_recv_wc *rwc ) +{ + struct ib_ah_attr attr; + u8 send_resp, rcv_resp; + + send_resp = ((struct ib_mad *)(wr->send_buf.mad))-> + mad_hdr.method & IB_MGMT_METHOD_RESP; + rcv_resp = rwc->recv_buf.mad->mad_hdr.method & IB_MGMT_METHOD_RESP; + + if (!send_resp && rcv_resp) + /* is request/response. GID/LIDs are both local (same). */ + return 1; + + if (send_resp == rcv_resp) + /* both requests, or both responses. GIDs different */ + return 0; + + if (ib_query_ah(wr->send_buf.ah, &attr)) + /* Assume not equal, to avoid false positives. */ + return 0; + + if (!(attr.ah_flags & IB_AH_GRH) && !(rwc->wc->wc_flags & IB_WC_GRH)) + return attr.dlid == rwc->wc->slid; + else if ((attr.ah_flags & IB_AH_GRH) && + (rwc->wc->wc_flags & IB_WC_GRH)) + return memcmp(attr.grh.dgid.raw, + rwc->recv_buf.grh->sgid.raw, 16) == 0; + else + /* one has GID, other does not. Assume different */ + return 0; +} struct ib_mad_send_wr_private* -ib_find_send_mad(struct ib_mad_agent_private *mad_agent_priv, __be64 tid) +ib_find_send_mad(struct ib_mad_agent_private *mad_agent_priv, + struct ib_mad_recv_wc *mad_recv_wc) { struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad *mad; + + mad = (struct ib_mad *)mad_recv_wc->recv_buf.mad; list_for_each_entry(mad_send_wr, &mad_agent_priv->wait_list, agent_list) { - if (mad_send_wr->tid == tid) + if ((mad_send_wr->tid == mad->mad_hdr.tid) && + rcv_has_same_class(mad_send_wr, mad_recv_wc) && + rcv_has_same_gid(mad_send_wr, mad_recv_wc)) return mad_send_wr; } @@ -1659,7 +1704,10 @@ ib_find_send_mad(struct ib_mad_agent_pri list_for_each_entry(mad_send_wr, &mad_agent_priv->send_list, agent_list) { if (is_data_mad(mad_agent_priv, mad_send_wr->send_buf.mad) && - mad_send_wr->tid == tid && mad_send_wr->timeout) { + mad_send_wr->tid == mad->mad_hdr.tid && + mad_send_wr->timeout && + rcv_has_same_class(mad_send_wr, mad_recv_wc) && + rcv_has_same_gid(mad_send_wr, mad_recv_wc)) { /* Verify request has not been canceled */ return (mad_send_wr->status == IB_WC_SUCCESS) ? mad_send_wr : NULL; @@ -1702,7 +1750,7 @@ static void ib_mad_complete_recv(struct if (response_mad(mad_recv_wc->recv_buf.mad)) { tid = mad_recv_wc->recv_buf.mad->mad_hdr.tid; spin_lock_irqsave(&mad_agent_priv->lock, flags); - mad_send_wr = ib_find_send_mad(mad_agent_priv, tid); + mad_send_wr = ib_find_send_mad(mad_agent_priv, mad_recv_wc); if (!mad_send_wr) { spin_unlock_irqrestore(&mad_agent_priv->lock, flags); ib_free_recv_mad(mad_recv_wc); Index: latest/drivers/infiniband/core/mad_priv.h =================================================================== --- latest.orig/drivers/infiniband/core/mad_priv.h +++ latest/drivers/infiniband/core/mad_priv.h @@ -214,7 +214,8 @@ extern kmem_cache_t *ib_mad_cache; int ib_send_mad(struct ib_mad_send_wr_private *mad_send_wr); struct ib_mad_send_wr_private * -ib_find_send_mad(struct ib_mad_agent_private *mad_agent_priv, __be64 tid); +ib_find_send_mad(struct ib_mad_agent_private *mad_agent_priv, + struct ib_mad_recv_wc *mad_recv_wc); void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr, struct ib_mad_send_wc *mad_send_wc); Index: latest/drivers/infiniband/core/mad_rmpp.c =================================================================== --- latest.orig/drivers/infiniband/core/mad_rmpp.c +++ latest/drivers/infiniband/core/mad_rmpp.c @@ -575,15 +575,15 @@ int send_next_seg(struct ib_mad_send_wr_ return ib_send_mad(mad_send_wr); } -static void abort_send(struct ib_mad_agent_private *agent, __be64 tid, - u8 rmpp_status) +static void abort_send(struct ib_mad_agent_private *agent, + struct ib_mad_recv_wc *mad_recv_wc, u8 rmpp_status) { struct ib_mad_send_wr_private *mad_send_wr; struct ib_mad_send_wc wc; unsigned long flags; spin_lock_irqsave(&agent->lock, flags); - mad_send_wr = ib_find_send_mad(agent, tid); + mad_send_wr = ib_find_send_mad(agent, mad_recv_wc); if (!mad_send_wr) goto out; /* Unmatched send */ @@ -635,8 +635,7 @@ static void process_rmpp_ack(struct ib_m rmpp_mad = (struct ib_rmpp_mad *)mad_recv_wc->recv_buf.mad; if (rmpp_mad->rmpp_hdr.rmpp_status) { - abort_send(agent, rmpp_mad->mad_hdr.tid, - IB_MGMT_RMPP_STATUS_BAD_STATUS); + abort_send(agent, mad_recv_wc, IB_MGMT_RMPP_STATUS_BAD_STATUS); nack_recv(agent, mad_recv_wc, IB_MGMT_RMPP_STATUS_BAD_STATUS); return; } @@ -644,14 +643,13 @@ static void process_rmpp_ack(struct ib_m seg_num = be32_to_cpu(rmpp_mad->rmpp_hdr.seg_num); newwin = be32_to_cpu(rmpp_mad->rmpp_hdr.paylen_newwin); if (newwin < seg_num) { - abort_send(agent, rmpp_mad->mad_hdr.tid, - IB_MGMT_RMPP_STATUS_W2S); + abort_send(agent, mad_recv_wc, IB_MGMT_RMPP_STATUS_W2S); nack_recv(agent, mad_recv_wc, IB_MGMT_RMPP_STATUS_W2S); return; } spin_lock_irqsave(&agent->lock, flags); - mad_send_wr = ib_find_send_mad(agent, rmpp_mad->mad_hdr.tid); + mad_send_wr = ib_find_send_mad(agent, mad_recv_wc); if (!mad_send_wr) goto out; /* Unmatched ACK */ @@ -661,8 +659,7 @@ static void process_rmpp_ack(struct ib_m if (seg_num > mad_send_wr->total_seg || seg_num > mad_send_wr->newwin) { spin_unlock_irqrestore(&agent->lock, flags); - abort_send(agent, rmpp_mad->mad_hdr.tid, - IB_MGMT_RMPP_STATUS_S2B); + abort_send(agent, mad_recv_wc, IB_MGMT_RMPP_STATUS_S2B); nack_recv(agent, mad_recv_wc, IB_MGMT_RMPP_STATUS_S2B); return; } @@ -751,12 +748,10 @@ static void process_rmpp_stop(struct ib_ rmpp_mad = (struct ib_rmpp_mad *)mad_recv_wc->recv_buf.mad; if (rmpp_mad->rmpp_hdr.rmpp_status != IB_MGMT_RMPP_STATUS_RESX) { - abort_send(agent, rmpp_mad->mad_hdr.tid, - IB_MGMT_RMPP_STATUS_BAD_STATUS); + abort_send(agent, mad_recv_wc, IB_MGMT_RMPP_STATUS_BAD_STATUS); nack_recv(agent, mad_recv_wc, IB_MGMT_RMPP_STATUS_BAD_STATUS); } else - abort_send(agent, rmpp_mad->mad_hdr.tid, - rmpp_mad->rmpp_hdr.rmpp_status); + abort_send(agent, mad_recv_wc, rmpp_mad->rmpp_hdr.rmpp_status); } static void process_rmpp_abort(struct ib_mad_agent_private *agent, @@ -768,12 +763,10 @@ static void process_rmpp_abort(struct ib if (rmpp_mad->rmpp_hdr.rmpp_status < IB_MGMT_RMPP_STATUS_ABORT_MIN || rmpp_mad->rmpp_hdr.rmpp_status > IB_MGMT_RMPP_STATUS_ABORT_MAX) { - abort_send(agent, rmpp_mad->mad_hdr.tid, - IB_MGMT_RMPP_STATUS_BAD_STATUS); + abort_send(agent, mad_recv_wc, IB_MGMT_RMPP_STATUS_BAD_STATUS); nack_recv(agent, mad_recv_wc, IB_MGMT_RMPP_STATUS_BAD_STATUS); } else - abort_send(agent, rmpp_mad->mad_hdr.tid, - rmpp_mad->rmpp_hdr.rmpp_status); + abort_send(agent, mad_recv_wc, rmpp_mad->rmpp_hdr.rmpp_status); } struct ib_mad_recv_wc * @@ -787,8 +780,7 @@ ib_process_rmpp_recv_wc(struct ib_mad_ag return mad_recv_wc; if (rmpp_mad->rmpp_hdr.rmpp_version != IB_MGMT_RMPP_VERSION) { - abort_send(agent, rmpp_mad->mad_hdr.tid, - IB_MGMT_RMPP_STATUS_UNV); + abort_send(agent, mad_recv_wc, IB_MGMT_RMPP_STATUS_UNV); nack_recv(agent, mad_recv_wc, IB_MGMT_RMPP_STATUS_UNV); goto out; } @@ -806,8 +798,7 @@ ib_process_rmpp_recv_wc(struct ib_mad_ag process_rmpp_abort(agent, mad_recv_wc); break; default: - abort_send(agent, rmpp_mad->mad_hdr.tid, - IB_MGMT_RMPP_STATUS_BADT); + abort_send(agent, mad_recv_wc, IB_MGMT_RMPP_STATUS_BADT); nack_recv(agent, mad_recv_wc, IB_MGMT_RMPP_STATUS_BADT); break; } From rdreier at cisco.com Tue Feb 28 09:26:23 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 28 Feb 2006 09:26:23 -0800 Subject: [openib-general] Re: Re: RFC: SDP plans In-Reply-To: <20060228142406.GS19855@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 28 Feb 2006 16:24:06 +0200") References: <20060227173459.GB19855@mellanox.co.il> <20060227212002.GA7597@lst.de> <20060228003833.GD20064@mellanox.co.il> <20060228142406.GS19855@mellanox.co.il> Message-ID: Michael> I'd like to point out that other protocols maintain stuff Michael> in /proc/net/tcp6 Assuming I want to use sysfs, where do Michael> I stick my directory? /sys/class/net/sdp? Michael> /sys/class/infiniband/sdp? Hmm, I wasn't aware of /proc/net/tcp et al. But it is clearly the sort of legacy stuff (unfortunately used by netstat) that we don't want to add more of. I don't think sysfs is really the right place for it either, since the one-value-per-file rule is going to make things outrageously complicated. netlink (maybe using the netlink connector, depending on whether you think it makes things better or worse) seems like the best solution to me. debugfs is also a possibility, but you may want the socket information available without debugfs being mounted. - R. From mshefty at ichips.intel.com Tue Feb 28 09:39:08 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 28 Feb 2006 09:39:08 -0800 Subject: [openib-general] [PATCH] make cmatose compile In-Reply-To: <20060228134610.GP19855@mellanox.co.il> References: <20060228134610.GP19855@mellanox.co.il> Message-ID: <44048ABC.7040608@ichips.intel.com> Michael S. Tsirkin wrote: > Make cmatose compile under 2.6.15 Thanks - committed in 5532. - Sean From mshefty at ichips.intel.com Tue Feb 28 09:41:26 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 28 Feb 2006 09:41:26 -0800 Subject: [openib-general] [PATCH] mad_rmpp: fix check for old ACK In-Reply-To: <200602281634.04999.jackm@mellanox.co.il> References: <200602281634.04999.jackm@mellanox.co.il> Message-ID: <44048B46.5010208@ichips.intel.com> Jack Morgenstein wrote: > - if (newwin < mad_send_wr->newwin || seg_num < mad_send_wr->last_ack) > + if (newwin < mad_send_wr->newwin || seg_num <= mad_send_wr->last_ack) > goto out; /* Old ACK */ This would drop ACKs that are only updating the new window. - Sean From mshefty at ichips.intel.com Tue Feb 28 09:45:58 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 28 Feb 2006 09:45:58 -0800 Subject: [openib-general] cmatose questions In-Reply-To: <20060228165150.GY19855@mellanox.co.il> References: <20060228165150.GY19855@mellanox.co.il> Message-ID: <44048C56.3000708@ichips.intel.com> Michael S. Tsirkin wrote: > I then unloaded the client and loaded it again: > modprobe rdma_cmatose dst_ip=11.4.8.155; rmmod rdma_cmatose The kernel dst_ip isn't this smart. You'll need to provide the IP address as: x9b08040b Note that there's a userspace version of cmatose in librdmacm that's easier to interface to, and can handle normal IP addresses and host names as input. - Sean From bboas at systemfabricworks.com Tue Feb 28 09:46:53 2006 From: bboas at systemfabricworks.com (Bill Boas) Date: Tue, 28 Feb 2006 09:46:53 -0800 Subject: [openib-general] Need for ONE OpenIB Release process that allmembers can agree to and that follows OpenIB Bylaws In-Reply-To: Message-ID: Thank you, James, that's right. -----Original Message----- From: James Lentini [mailto:jlentini at netapp.com] Sent: Tuesday, February 28, 2006 9:19 AM To: Bob Woodruff Cc: 'Bill Boas'; openib-promoters at openib.org; openib-general at openib.org Subject: RE: [openib-general] Need for ONE OpenIB Release process that allmembers can agree to and that follows OpenIB Bylaws On Tue, 28 Feb 2006, Bob Woodruff wrote: > >Components that we don't know what to do about, and will likely want to > drop > > >unless someone can vouch for them: > > * iSER > > * SRP > > * uDAPL" > > Why would you drop these ? People want all of them. Bill wasn't proposing to drop these. He was quoting Bryan O'Sullivan's original post: http://openib.org/pipermail/openib-general/2006-February/017019.html With regards to uDAPL, I think we have established that it should be part of the 1.0 release. From rick.cecil at intel.com Tue Feb 28 09:45:54 2006 From: rick.cecil at intel.com (Cecil, Rick) Date: Tue, 28 Feb 2006 09:45:54 -0800 Subject: [openib-general] RE: [Openib-promoters] Need for ONE OpenIB Release process that allmembers can agree to and that follows OpenIB Bylaws Message-ID: <8859A62A93B99741A47AC47ECF8C35750BE159B3@fmsmsx407.amr.corp.intel.com> We need: * iSER * SRP * uDAPL to build system solutions. Rick ________________________________ From: openib-promoters-bounces at openib.org [mailto:openib-promoters-bounces at openib.org] On Behalf Of Bill Boas Sent: Monday, February 27, 2006 10:49 PM To: openib-promoters at openib.org; openib-general at openib.org Cc: chetm at us.ibm.com Subject: [Openib-promoters] Need for ONE OpenIB Release process that allmembers can agree to and that follows OpenIB Bylaws There appear to be 2 groups within OpenIB thinking about different approaches to preparing the code for Release 1.0. One group is thinking about downstreaming it to RedHat and Novell, another group seems to be thinking about separate releases from some IB suppliers than others. Lets remind ourselves of the purposes OpenIB was created and what all of the member companies have just re-affirmed in the Board meeting last Friday (by approving the re-worked By-laws). The principles are, I believe: (if there are misstatements below, lets discus openly) 1) OpenIB develops open source code creating a software stack. OpenIB (now OpenFabrics Alliance) is a corporation with Bylaws that all members should obey if they want the corporation to continue to function. It will only survive if, in general, all members self interests are served simultaneously with each member's own self interest. 2) OpenIB members by a 2/3rds vote of the members have to approve the content of that stack through the Proposal process described section 12 of the Bylaws. It is not up to a single member or group of members to decide on their own what is or is not in the OpenIB stack. This is deliberate to prevent one or more members gaining competitive advantage through the OpenIB stack over other members. 3) OpenIB downstreams kernel code to kernel.org 4) OpenIB code is distributed to end customers (like Wall St., labs, etc) and to mid tier customers of OpenIB (Oracle, IBM, Sun, Dell, LNXI etc.) via Linux distributions such as RedHat and Novell. 5) End customers told the IB companies in February 2004 and in December 2005 at Credit Suisse (HSIR meeting) that they wanted ONE OpenIB stack that runs on every IB vendors hardware, that interoperates with all other IB vendors h/w and s/w, is used by all mid-tier suppliers and that it comes with their Linux distribution. I realize that so far in OpenIB's evolution we have not worked out the issue of how to support end-customers while following these principles for the release process. But that, I suggest, is not a valid reason for breaking these principles. We should be able to deal with "Release" as one process and "Support" as another process - though of course there will be linkage between them but they are not the same process. The way do ne is not necessarily the way to do the other. This email is an appeal to the two groups to work together, not to work separately, and to work on solving these issues for the membership as a whole, not just their own company, or a select group. Please bring to the Board a proposal that serves all the membership. Here's what one group seems to be thinking (edited to remove "I"): "Here is a first cut at the set of components (protocols, drivers, userspace bits) that we think we should be supporting in 1.0. Please look over it and let us know if we are missing anything. HCA support (both kernel driver and userspace verbs components): * ehca * ipath * mthca IB protocols: * IPoIB * RC * SDP * SRQ * UC * UD Userland software: * libibverbs * libsdp * opensm As far as we can tell, most of the rest of OpenIB userland (libibcm, libibat, libibmad, etc) is logically part of OpenSM, can be treated as such (I think Doug is already doing this with his Red Hat spec files) and is unlikely to be used by other applications. Am I way off? Components that we don't know what to do about, and will likely want to drop unless someone can vouch for them: * iSER * SRP * uDAPL" Here's what the other group suggested: "Openib Commercial Grade Release 1.0 release criteria 1) CPU Architectures: a) x86_32 (Xeon) b) x84_64 (Nocona, Opteron) c) ia64 d) PPC64 (Power5, Power6) - Mellanox does not support these systems 2) Linux distributors and kernels a) RH: AS EL4 up3; Fedora C4 last update , and maybe FC5 b) SuSE: SuSE 10 last update (open - SLES10 beta) c) kernel.org: the latest that is available when generating rc1. In 1.0 it will probably be 2.6.16 (might be 2.6.17). 3) Packaging and installation a) The openib release will be packages in one tarball for both kernel and user-level. b) One install script will support full installation. The install will support typical and custom components I will send a different document with install definition to be reviewed and agreed between all. 4) HCA and Switch Support: a) HCAs: InfiniHost, InfiniHost III Ex (both modes: with memory and MemFree), InfiniHost III Lx b) Switches: Need to support all vendors' production switches - each vendor should send the list. 5) Switch Management Interoperability testing a) Follow the CIWG-OpenIB HCA-OEM Switch Interop Test Plan 6) Feature set per ULP: a) Will be defined later with each ULP maintainer. 7) Minimum cluster size to be tested a) Need at least 128 nodes cluster, bigger is better. 8) Scalability requirements a) SM: i) Bringup a subnet with 1,000 nodes in 2 minutes ii) SM should not be a bottle neck in any application running (IPoIB) b) MPI: i) MPI runner - should be able to launch thousands of processes (say 50,000) in a bounded time manner. ii) Memory consumption - should be able to run many processes on the same node (for now, 8 processes is the upper limit with the Opteron machines), in a many node (thousands of nodes) installation. iii) Sending HUGE messages in collectives - MPI should not fail for limited physical memory. 9) Performance requirements: First we need to agree on the performance benchmark for each ULP: a) Basic verbs - performance tests in openib (send, RDMA read/write latency & BW) b) IPoIB - netperf c) MPI - Pallas d) SDP - iperf e) SRP - iometer f) iSER - iometer 10) Documentation requirements a) Product brief b) Installation guide c) User guide d) Release notes e) Troubleshooting f) Test Plan and Test Report 11) Storage target test requirements a) Engenio target - Mellanox will be responsible of verification b) Cisco & SST - please add more target systems 12) Firmware and Hardware versions to be tested a) Both DDR and SDR modes should be supported. b) FW burned should be the last official released by Mellanox: i) InfiniHost III Lx: fw-25204-1.0.800 ii) InfiniHost III Ex: fw-25218-5.1.400 and fw-25208-4.7.600 (both will be released in 2 weeks) iii) InfiniHost: fw-23108-3.4.000 iv) InfiniScale III - fw-47396-0.8.3 v) InfiniScale - fw-43132-5.5.0 13) Specifications compliance: a) Verbs & management: InfiniBand Architecture Specification, Volume 1, Release 1.2 b) IPoIB: www.ietf.org: draft-ietf-ipoib-architecture-04 and draft-ietf-ipoib-ip-over-infiniband-07 c) SDP: Annex A4" of the InfiniBand Architecture Specification, Volume 1, Release 1.2 d) SRP: SCSI RDMA Protocol-2 (SRP-2), Doc. no. T10/1524-D. (www.t10.org/ftp/t10/drafts/srp2/srp2r00a.pdf ). e) MPI: www.mpi-forum.org/docs/mpi-11-html/mpi-report.html f) iSER: www.ietf.org/internet-drafts/draft-hufferd-iser-ib-01.pdf g) RDS: SS can you provide info The following two items are very important for the SW stack QA but not gating for starting the release process. 1) ISV test requirements - coverage for all ULPs 2) Database test requirements Cisco, SS and Voltaire should define those, since they already have test beds for commercial applications and databases." -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Tue Feb 28 09:59:15 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 28 Feb 2006 09:59:15 -0800 Subject: [openib-general] Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID In-Reply-To: <20060228130819.GJ19855@mellanox.co.il> References: <200602231825.19333.jackm@mellanox.co.il> <43FDF106.8070403@ichips.intel.com> <20060225170316.GA15973@mellanox.co.il> <44033E57.8090905@ichips.intel.com> <20060227181156.GC17414@mellanox.co.il> <4403422F.2000408@ichips.intel.com> <20060227182746.GB19265@mellanox.co.il> <44034CDF.9030204@ichips.intel.com> <20060227194215.GC20064@mellanox.co.il> <4403575F.1010305@ichips.intel.com> <20060228130819.GJ19855@mellanox.co.il> Message-ID: <44048F73.9040108@ichips.intel.com> Michael S. Tsirkin wrote: > Sorry for being dense, I'm not sure I understand. Do you mean response when you > say receive? We never have both a request and a response outstanding to the > same remove GID with the same TID, do we? I was meaning: Request - response bit is 0 Response - response bit is 1 Send - outgoing MAD (may be request or response) Receive - incoming MAD (may be request or response) In your example, you had host A sending a request to B (I'll call transaction 1), and host B sending a request to A (transaction 2). Host A has one request being sent, and another being received. An ACK from B to A must match with transaction 1. A NACK from B to A can match with either transaction. To match NACKs using the response bit implies that requests can only flow one direction, and responses the other. This adds a policy that I don't think should be part of the general MAD layer code. - Sean From bos at pathscale.com Tue Feb 28 10:00:31 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Tue, 28 Feb 2006 10:00:31 -0800 Subject: [openib-general] Re: RFC: SDP plans In-Reply-To: <20060228073209.GC17543@mellanox.co.il> References: <20060227173459.GB19855@mellanox.co.il> <19a929370602270944v30f79516id428b0913bd04a08@mail.gmail.com> <44034423.8090109@ichips.intel.com> <20060227184212.GD19265@mellanox.co.il> <440349FD.1050103@ichips.intel.com> <20060227193311.GA20064@mellanox.co.il> <1141077981.30345.3.camel@serpentine.pathscale.com> <20060228073209.GC17543@mellanox.co.il> Message-ID: <1141149631.24103.21.camel@camp4.serpentine.com> On Tue, 2006-02-28 at 09:32 +0200, Michael S. Tsirkin wrote: > So you either have a modified libsdp.conf (could you post it please?) > or there's a bug in libsdp. I'll have Ralph, who originally ran into these problems, follow up to see if he can still reproduce the problem. References: <000201c63bfa$33fa6af0$0ff9070a@amr.corp.intel.com> Message-ID: <1141149675.24103.23.camel@camp4.serpentine.com> On Tue, 2006-02-28 at 10:10 -0500, James Lentini wrote: > I don't see a benefit in placing generated binary objects under > version control. I'm inclined to agree. If nothing else, doing a "svn co" is already achingly slow, and I'd rather not make it even worse. References: <44034F36.2000402@ichips.intel.com> <20060227193637.GB20064@mellanox.co.il> <200602281112.41981.jackm@mellanox.co.il> Message-ID: <44049079.1050107@ichips.intel.com> Jack Morgenstein wrote: > Which session actually receives the ACK is a toss-up, since when a session > sends a segment, it is placed (upon send completion) at the tail of the wait > queue. Thus, arriving ACKs will likely be routed (based upon TID only) to > one of the duplicate sessions. Okay - this makes more sense. Thanks for the clarification. - Sean From bos at pathscale.com Tue Feb 28 10:04:26 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Tue, 28 Feb 2006 10:04:26 -0800 Subject: [openib-general] Re: Re: RFC: SDP plans In-Reply-To: References: <20060227173459.GB19855@mellanox.co.il> <20060227212002.GA7597@lst.de> <20060228003833.GD20064@mellanox.co.il> <20060228142406.GS19855@mellanox.co.il> Message-ID: <1141149866.24103.27.camel@camp4.serpentine.com> On Tue, 2006-02-28 at 09:26 -0800, Roland Dreier wrote: > netlink (maybe using the netlink connector, depending on whether you > think it makes things better or worse) seems like the best solution to > me. Just be aware that if you use connector, maintainers of 2.6.9 backports aren't going to be happy; you'll also be about the first user of connector, so you may have some bug-hunting to do. On the other hand, the plain old netlink API has changed somewhat between 2.6.9 and 2.6.15, so keep that in mind, too. Message-ID: <000101c63c8f$34bb5540$6aa1070a@amr.corp.intel.com> Bill Boas wrote, >There appear to be 2 groups within OpenIB thinking about different approaches to >preparing the code for Release 1.0. One group is thinking about downstreaming it >to RedHat and Novell, another group seems to be thinking about separate releases >from some IB suppliers than others. I think that it is pretty important that there be a way for customers to know exactly what version of openib code they are getting, so having having a versioning (release) mechanism from openib is definately needed and is what has started with the release 1.0 branch that was started last week. Also, lets not confuse release versioning with distribution. Projects like gcc do release versions, but end users typically get the code from the Linux distributors. OpenIB doing release versions will also make it easier for distros to pick up the code for distribution and be able to tell people what version it is, just like I can tell what version of gcc is on a particular RedHat or SUSE CD. my 2 cents, woody From mst at mellanox.co.il Tue Feb 28 10:12:10 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Feb 2006 20:12:10 +0200 Subject: [openib-general] Re: Re: RFC: SDP plans In-Reply-To: References: <20060227173459.GB19855@mellanox.co.il> <20060227212002.GA7597@lst.de> <20060228003833.GD20064@mellanox.co.il> <20060228142406.GS19855@mellanox.co.il> Message-ID: <20060228181209.GA21556@mellanox.co.il> Quoting r. Roland Dreier : > I don't think sysfs is really the right place for it either, since the > one-value-per-file rule is going to make things outrageously > complicated. Not complicated really, but it might be expensive. On the other hand, a simple test shows I can create 64K directories in sysfs without trouble. > netlink (maybe using the netlink connector, depending on whether you > think it makes things better or worse) seems like the best solution to > me. But this means a library will be required to get at this info, right? > debugfs is also a possibility, but you may want the socket > information available without debugfs being mounted. Yes, I will. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Tue Feb 28 10:13:00 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Feb 2006 20:13:00 +0200 Subject: [openib-general] cmatose questions In-Reply-To: <44048C56.3000708@ichips.intel.com> References: <20060228165150.GY19855@mellanox.co.il> <44048C56.3000708@ichips.intel.com> Message-ID: <20060228181300.GB21556@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [openib-general] cmatose questions > > Michael S. Tsirkin wrote: > >I then unloaded the client and loaded it again: > >modprobe rdma_cmatose dst_ip=11.4.8.155; rmmod rdma_cmatose > > The kernel dst_ip isn't this smart. You'll need to provide the IP address > as: > > x9b08040b > > Note that there's a userspace version of cmatose in librdmacm that's easier > to interface to, and can handle normal IP addresses and host names as input. Right, but I'm really mostly interested in userspace version. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Tue Feb 28 10:20:38 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Feb 2006 20:20:38 +0200 Subject: [openib-general] Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID In-Reply-To: <44048F73.9040108@ichips.intel.com> References: <20060225170316.GA15973@mellanox.co.il> <44033E57.8090905@ichips.intel.com> <20060227181156.GC17414@mellanox.co.il> <4403422F.2000408@ichips.intel.com> <20060227182746.GB19265@mellanox.co.il> <44034CDF.9030204@ichips.intel.com> <20060227194215.GC20064@mellanox.co.il> <4403575F.1010305@ichips.intel.com> <20060228130819.GJ19855@mellanox.co.il> <44048F73.9040108@ichips.intel.com> Message-ID: <20060228182038.GC21556@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID > > Michael S. Tsirkin wrote: > >Sorry for being dense, I'm not sure I understand. Do you mean response > >when you > >say receive? We never have both a request and a response outstanding to the > >same remove GID with the same TID, do we? > > I was meaning: > > Request - response bit is 0 > Response - response bit is 1 > Send - outgoing MAD (may be request or response) > Receive - incoming MAD (may be request or response) > > In your example, you had host A sending a request to B (I'll call > transaction 1), and host B sending a request to A (transaction 2). Host A > has one request being sent, and another being received. An ACK from B to A > must match with transaction 1. A NACK from B to A can match with either > transaction. But the spec says: . The method used for all packets sent from the Receiver to the Sender shall be SubnAdmGetTable() or SubnAdmGetTraceTable(), depending on which initiated the transfer. . The method used for all packets sent from the Sender to the Receiver shall be SubnAdmGetTableResp(). I interpreset this as: A NACK from B to A will have a response bit if its a NACK for MAD that A is sending (NACK for Receive MAD) and it will have a response bit cleared if its a NACK for MAD that B is sending (NACK for send MAD). > To match NACKs using the response bit implies that requests can only flow > one direction, and responses the other. This adds a policy that I don't > think should be part of the general MAD layer code. But I think the spec implies that ACK/NACK response bit is calculated by looking at whether you are Sender or Receiver, not by whether you are requester or responder. Am I wrong? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Tue Feb 28 10:23:27 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Feb 2006 20:23:27 +0200 Subject: [openib-general] Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID In-Reply-To: <44049079.1050107@ichips.intel.com> References: <44034F36.2000402@ichips.intel.com> <20060227193637.GB20064@mellanox.co.il> <200602281112.41981.jackm@mellanox.co.il> <44049079.1050107@ichips.intel.com> Message-ID: <20060228182327.GD21556@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [openib-general] Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID > > Jack Morgenstein wrote: > >Which session actually receives the ACK is a toss-up, since when a session > >sends a segment, it is placed (upon send completion) at the tail of the > >wait queue. Thus, arriving ACKs will likely be routed (based upon TID > >only) to one of the duplicate sessions. > > Okay - this makes more sense. Thanks for the clarification. Okay - so the proposed fix is to simply prevent duplicate transactions returning an error code to the user. This is similiar to the patch that Jack sent, only we have to add the GID, class, and method resp bit check. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From ralphc at pathscale.com Tue Feb 28 10:22:05 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Tue, 28 Feb 2006 10:22:05 -0800 Subject: [openib-general] Re: RFC: SDP plans In-Reply-To: <1141149631.24103.21.camel@camp4.serpentine.com> References: <20060227173459.GB19855@mellanox.co.il> <19a929370602270944v30f79516id428b0913bd04a08@mail.gmail.com> <44034423.8090109@ichips.intel.com> <20060227184212.GD19265@mellanox.co.il> <440349FD.1050103@ichips.intel.com> <20060227193311.GA20064@mellanox.co.il> <1141077981.30345.3.camel@serpentine.pathscale.com> <20060228073209.GC17543@mellanox.co.il> <1141149631.24103.21.camel@camp4.serpentine.com> Message-ID: <1141150925.30894.169.camel@brick.internal.keyresearch.com> Can you refresh my memory? What problem are you talking about? This thread has quite a few messages about SDP but I don't remember a discussion about bugs or libsdp. I haven't made any changes to libsdp.conf in my testing SDP with InfiniPath. I did see some very low bandwidth problems with src zero copies. I worked around the problem by setting the sdp_zcopy_thrsh_src_default to 10000000 (i.e., disable it). I haven't had time to figure out what causes the zero copy performance problem yet. On Tue, 2006-02-28 at 10:00 -0800, Bryan O'Sullivan wrote: > On Tue, 2006-02-28 at 09:32 +0200, Michael S. Tsirkin wrote: > > > So you either have a modified libsdp.conf (could you post it please?) > > or there's a bug in libsdp. > > I'll have Ralph, who originally ran into these problems, follow up to > see if he can still reproduce the problem. > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Ralph Campbell From robert.j.woodruff at intel.com Tue Feb 28 10:18:59 2006 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Tue, 28 Feb 2006 10:18:59 -0800 Subject: [openib-general] RFC: SDP plans In-Reply-To: Message-ID: <000201c63c93$71cf2de0$6aa1070a@amr.corp.intel.com> James wrote, >How about posting the RPMs on OpenIB.org? I'd suggest using the OpenIB >Wiki's "Downloads" page: https://openib.org/tiki/tiki-index.php?page=Downloads >I don't see a benefit in placing generated binary objects under >version control. What is the process for posting on this page. Seems like anyone can just post whatever they like here. What is there now is real old. I think that a release 1.0 dowloads area needs to be somewhat more controlled. woody From mshefty at ichips.intel.com Tue Feb 28 10:26:59 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 28 Feb 2006 10:26:59 -0800 Subject: [openib-general] Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID In-Reply-To: <20060228182038.GC21556@mellanox.co.il> References: <20060225170316.GA15973@mellanox.co.il> <44033E57.8090905@ichips.intel.com> <20060227181156.GC17414@mellanox.co.il> <4403422F.2000408@ichips.intel.com> <20060227182746.GB19265@mellanox.co.il> <44034CDF.9030204@ichips.intel.com> <20060227194215.GC20064@mellanox.co.il> <4403575F.1010305@ichips.intel.com> <20060228130819.GJ19855@mellanox.co.il> <44048F73.9040108@ichips.intel.com> <20060228182038.GC21556@mellanox.co.il> Message-ID: <440495F3.9040701@ichips.intel.com> Michael S. Tsirkin wrote: > But the spec says: > > . The method used for all packets sent from the Receiver to the Sender > shall be SubnAdmGetTable() or SubnAdmGetTraceTable(), depending > on which initiated the transfer. > > . The method used for all packets sent from the Sender to the Receiver > shall be SubnAdmGetTableResp(). This is for SA class only. Vendor specific classes do not have this requirement. Also, I don't see that there's any restriction that two systems can't both initiate SubnAdmGetTable requests to each other. (For example, some sort of distributed SA.) - Sean From halr at voltaire.com Tue Feb 28 10:36:27 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 28 Feb 2006 13:36:27 -0500 Subject: [openib-general] LID assignment policy of opensm In-Reply-To: <309a667c0602280812s6da1cc99s30f10b7ea940f82e@mail.gmail.com> References: <309a667c0602272244re7134ees6031bc11883372ea@mail.gmail.com> <1141128535.4335.4148.camel@hal.voltaire.com> <309a667c0602280812s6da1cc99s30f10b7ea940f82e@mail.gmail.com> Message-ID: <1141151786.4335.6624.camel@hal.voltaire.com> Hi Devesh, On Tue, 2006-02-28 at 11:12, Devesh Sharma wrote: > Hi Hal Thanks for replying. > This setisfies my needs if user can define his own guid to lid > mapping. > whether in this file user can define his own guid to lid mapping? To my knowledge it's not used that way in general but that could work if consistent with the OpenSM LID policy (e.g. LMC, etc.). The format of the file is as follows: 0x0008f10403960985 0x0007 0x0007 0x0008f10400410015 0x0003 0x0003 (e.g GUID, min LID, max LID so the above is for LMC 0 which is the default). -- Hal > Devesh > > On 28 Feb 2006 07:19:35 -0500, Hal Rosenstock > wrote: > Hi Devesh, > > On Tue, 2006-02-28 at 01:44, Devesh Sharma wrote: > > Hi list, > > Please anybody brife me about the LID assignment policy used > by opensm > > subnet manager. Can user specify fixed LID mappings using a > file? > > There is a file it creates with these in it so they can be > reused > subsequently. It is /var/cache/osm/guid2lid. > > opensm -h has the following option: > -c > --cache-options > Cache the given command line options into the file > /var/cache/osm/opensm.opts for use next invocation > The cache directory can be changed by the > environment > variable OSM_CACHE_DIR > > Is that suitable for your needs ? > > -- Hal > > > > > ______________________________________________________________________ > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > From jlentini at netapp.com Tue Feb 28 10:51:28 2006 From: jlentini at netapp.com (James Lentini) Date: Tue, 28 Feb 2006 13:51:28 -0500 (EST) Subject: [openib-general] RFC: SDP plans In-Reply-To: <000201c63c93$71cf2de0$6aa1070a@amr.corp.intel.com> References: <000201c63c93$71cf2de0$6aa1070a@amr.corp.intel.com> Message-ID: On Tue, 28 Feb 2006, Bob Woodruff wrote: > James wrote, > >How about posting the RPMs on OpenIB.org? I'd suggest using the OpenIB > >Wiki's "Downloads" page: > > https://openib.org/tiki/tiki-index.php?page=Downloads > > What is the process for posting on this page. Seems like anyone can > just post whatever they like here. That's true. > What is there now is real old. I think that a release 1.0 dowloads > area needs to be somewhat more controlled. I agree. From mst at mellanox.co.il Tue Feb 28 11:02:51 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Feb 2006 21:02:51 +0200 Subject: [openib-general] Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID In-Reply-To: <440495F3.9040701@ichips.intel.com> References: <20060227181156.GC17414@mellanox.co.il> <4403422F.2000408@ichips.intel.com> <20060227182746.GB19265@mellanox.co.il> <44034CDF.9030204@ichips.intel.com> <20060227194215.GC20064@mellanox.co.il> <4403575F.1010305@ichips.intel.com> <20060228130819.GJ19855@mellanox.co.il> <44048F73.9040108@ichips.intel.com> <20060228182038.GC21556@mellanox.co.il> <440495F3.9040701@ichips.intel.com> Message-ID: <20060228190251.GE21556@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID > > Michael S. Tsirkin wrote: > >But the spec says: > > > > . The method used for all packets sent from the Receiver to the > > Sender > > shall be SubnAdmGetTable() or SubnAdmGetTraceTable(), depending > > on which initiated the transfer. > > > > . The method used for all packets sent from the Sender to the > > Receiver > > shall be SubnAdmGetTableResp(). > > This is for SA class only. Vendor specific classes do not have this > requirement. OK, fine, but SA is the most interesting case. So let us either 1. Drop ABORT for vendor specific if there is more than one transaction with the same TID/GID/class outstanding 2. Assume vendor specific behaves in the same way as SA class, and ask users to adhere to this rule Further if you are going to work on a spec extension, it could simply add the requirement on the resp bit for vendor specific classes. Right? > Also, I don't see that there's any restriction that two > systems can't both initiate SubnAdmGetTable requests to each other. (For > example, some sort of distributed SA.) Sure, but again, if you initiate a request and then abort it, you clear the response bit, if you are receiving a request and decide to abort it, you set the response bit. Therefore if you get an abort you can look at the resp bit: if it is set this is a transaction that you initiated, if it is clear this is a transaction that remote side initiated. I conclude that there's no ambiguity. Am I mistaken? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From sean.hefty at intel.com Tue Feb 28 11:06:30 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 28 Feb 2006 11:06:30 -0800 Subject: [openib-general] [PATCH] RFC Verbs: add support for transport specific verbs Message-ID: Add support for transport specific extensions to the ib_device verbs. Relocate process_mad as an IB specific verb. This provides a mechanism to add iWarp specific functionality, such as the iWarp CM calls, to ib_device. Signed-off-by: Sean Hefty --- Index: include/rdma/ib_verbs.h =================================================================== --- include/rdma/ib_verbs.h (revision 5532) +++ include/rdma/ib_verbs.h (working copy) @@ -824,6 +824,16 @@ struct ib_cache { struct ib_gid_cache **gid_cache; }; +struct ib_verbs { + int (*process_mad)(struct ib_device *device, + int process_mad_flags, + u8 port_num, + struct ib_wc *in_wc, + struct ib_grh *in_grh, + struct ib_mad *in_mad, + struct ib_mad *out_mad); +}; + struct ib_device { struct device *dma_device; @@ -954,13 +964,10 @@ struct ib_device { int (*detach_mcast)(struct ib_qp *qp, union ib_gid *gid, u16 lid); - int (*process_mad)(struct ib_device *device, - int process_mad_flags, - u8 port_num, - struct ib_wc *in_wc, - struct ib_grh *in_grh, - struct ib_mad *in_mad, - struct ib_mad *out_mad); + + union { + struct ib_verbs ib; + } ext_verbs; struct module *owner; struct class_device class_dev; Index: core/mad.c =================================================================== --- core/mad.c (revision 5532) +++ core/mad.c (working copy) @@ -704,9 +704,9 @@ static int handle_outgoing_dr_smp(struct send_wr->wr.ud.port_num, &mad_wc); /* No GRH for DR SMP */ - ret = device->process_mad(device, 0, port_num, &mad_wc, NULL, - (struct ib_mad *)smp, - (struct ib_mad *)&mad_priv->mad); + ret = device->ext_verbs.ib.process_mad(device, 0, port_num, &mad_wc, + NULL, (struct ib_mad *)smp, + (struct ib_mad *)&mad_priv->mad); switch (ret) { case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY: @@ -1787,7 +1787,7 @@ static void ib_mad_recv_done_handler(str local: /* Give driver "right of first refusal" on incoming MAD */ - if (port_priv->device->process_mad) { + if (port_priv->device->ext_verbs.ib.process_mad) { int ret; if (!response) { @@ -1799,11 +1799,11 @@ local: goto out; } - ret = port_priv->device->process_mad(port_priv->device, 0, - port_priv->port_num, - wc, &recv->grh, - &recv->mad.mad, - &response->mad.mad); + ret = port_priv->device-> + ext_verbs.ib.process_mad(port_priv->device, 0, + port_priv->port_num, wc, + &recv->grh, &recv->mad.mad, + &response->mad.mad); if (ret & IB_MAD_RESULT_SUCCESS) { if (ret & IB_MAD_RESULT_CONSUMED) goto out; Index: core/sysfs.c =================================================================== --- core/sysfs.c (revision 5532) +++ core/sysfs.c (working copy) @@ -311,7 +311,7 @@ static ssize_t show_pma_counter(struct i struct ib_mad *out_mad = NULL; ssize_t ret; - if (!p->ibdev->process_mad) + if (!p->ibdev->ext_verbs.ib.process_mad) return sprintf(buf, "N/A (no PMA)\n"); in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); @@ -329,7 +329,7 @@ static ssize_t show_pma_counter(struct i in_mad->data[41] = p->port_num; /* PortSelect field */ - if ((p->ibdev->process_mad(p->ibdev, IB_MAD_IGNORE_MKEY, + if ((p->ibdev->ext_verbs.ib.process_mad(p->ibdev, IB_MAD_IGNORE_MKEY, p->port_num, NULL, NULL, in_mad, out_mad) & (IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY)) != (IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY)) { Index: core/smi.h =================================================================== --- core/smi.h (revision 5532) +++ core/smi.h (working copy) @@ -58,7 +58,7 @@ static inline int smi_check_local_smp(st { /* C14-9:3 -- We're at the end of the DR segment of path */ /* C14-9:4 -- Hop Pointer = Hop Count + 1 -> give to SMA/SM */ - return ((device->process_mad && + return ((device->ext_verbs.ib.process_mad && !ib_get_smp_direction(smp) && (smp->hop_ptr == smp->hop_cnt + 1))); } Index: hw/ipath/ipath_verbs.c =================================================================== --- hw/ipath/ipath_verbs.c (revision 5532) +++ hw/ipath/ipath_verbs.c (working copy) @@ -5949,7 +5949,7 @@ static int ipath_register_ib_device(cons dev->dealloc_fmr = ipath_dealloc_fmr; dev->attach_mcast = ipath_multicast_attach; dev->detach_mcast = ipath_multicast_detach; - dev->process_mad = ipath_process_mad; + dev->ext_verbs.ib.process_mad = ipath_process_mad; ret = ib_register_device(dev); if (ret) Index: hw/mthca/mthca_provider.c =================================================================== --- hw/mthca/mthca_provider.c (revision 5532) +++ hw/mthca/mthca_provider.c (working copy) @@ -1329,7 +1329,7 @@ int mthca_register_device(struct mthca_d dev->ib_dev.attach_mcast = mthca_multicast_attach; dev->ib_dev.detach_mcast = mthca_multicast_detach; - dev->ib_dev.process_mad = mthca_process_mad; + dev->ib_dev.ext_verbs.ib.process_mad = mthca_process_mad; if (mthca_is_memfree(dev)) { dev->ib_dev.req_notify_cq = mthca_arbel_arm_cq; From mst at mellanox.co.il Tue Feb 28 11:13:02 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Feb 2006 21:13:02 +0200 Subject: [openib-general] Re: Need for ONE OpenIB Release process that allmembers can agree to and that follows OpenIB Bylaws In-Reply-To: <000101c63c8f$34bb5540$6aa1070a@amr.corp.intel.com> References: <000101c63c8f$34bb5540$6aa1070a@amr.corp.intel.com> Message-ID: <20060228191302.GF21556@mellanox.co.il> Quoting Bob Woodruff : > I think that it is pretty important that there be a way for customers > to know exactly what version of openib code they are getting, so > having having a versioning (release) mechanism from openib is > definately needed and is what has started with the release 1.0 branch > that was started last week. I agree to this. What release should provide is some kind of stabilization process for userspace IB components, similiar to what exists already for kernel level. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Tue Feb 28 11:13:52 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Feb 2006 21:13:52 +0200 Subject: [openib-general] Re: cmatose questions In-Reply-To: <20060228181300.GB21556@mellanox.co.il> References: <20060228165150.GY19855@mellanox.co.il> <44048C56.3000708@ichips.intel.com> <20060228181300.GB21556@mellanox.co.il> Message-ID: <20060228191351.GG21556@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: Re: cmatose questions > > Quoting r. Sean Hefty : > > Subject: Re: [openib-general] cmatose questions > > > > Michael S. Tsirkin wrote: > > >I then unloaded the client and loaded it again: > > >modprobe rdma_cmatose dst_ip=11.4.8.155; rmmod rdma_cmatose > > > > The kernel dst_ip isn't this smart. You'll need to provide the IP address > > as: > > > > x9b08040b > > > > Note that there's a userspace version of cmatose in librdmacm that's easier > > to interface to, and can handle normal IP addresses and host names as input. > > Right, but I'm really mostly interested in userspace version. I mean: in kernel level version. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mshefty at ichips.intel.com Tue Feb 28 11:14:25 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 28 Feb 2006 11:14:25 -0800 Subject: [openib-general] Re: cmatose questions In-Reply-To: <20060228191351.GG21556@mellanox.co.il> References: <20060228165150.GY19855@mellanox.co.il> <44048C56.3000708@ichips.intel.com> <20060228181300.GB21556@mellanox.co.il> <20060228191351.GG21556@mellanox.co.il> Message-ID: <4404A111.8000408@ichips.intel.com> Michael S. Tsirkin wrote: >>>>I then unloaded the client and loaded it again: >>>>modprobe rdma_cmatose dst_ip=11.4.8.155; rmmod rdma_cmatose >>> >>>The kernel dst_ip isn't this smart. You'll need to provide the IP address >>>as: >>> >>>x9b08040b >>> >>>Note that there's a userspace version of cmatose in librdmacm that's easier >>>to interface to, and can handle normal IP addresses and host names as input. >> >>Right, but I'm really mostly interested in userspace version. > > > I mean: in kernel level version. You just need to specify the IP address in network-byte order as a single integer. The example above is for the IP address that you gave: 11.4.8.155. - Sean From mst at mellanox.co.il Tue Feb 28 11:20:57 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Feb 2006 21:20:57 +0200 Subject: [openib-general] Re: [PATCH] RFC Verbs: add support for transport specific verbs In-Reply-To: References: Message-ID: <20060228192056.GH21556@mellanox.co.il> Quoting r. Sean Hefty : > Subject: [PATCH] RFC Verbs: add support for transport specific verbs > > Add support for transport specific extensions to the ib_device verbs. > Relocate process_mad as an IB specific verb. > > This provides a mechanism to add iWarp specific functionality, such as > the iWarp CM calls, to ib_device. > > Signed-off-by: Sean Hefty Is it worth it to separate these things out? Even within IB lots of methods are optional - so why cant an iWarp device just avoid defining process_local_mad, and IB device avoid defining iWarp CM ops? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From bos at pathscale.com Tue Feb 28 11:22:39 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Tue, 28 Feb 2006 11:22:39 -0800 Subject: [openib-general] Re: RFC: SDP plans In-Reply-To: <1141150925.30894.169.camel@brick.internal.keyresearch.com> References: <20060227173459.GB19855@mellanox.co.il> <19a929370602270944v30f79516id428b0913bd04a08@mail.gmail.com> <44034423.8090109@ichips.intel.com> <20060227184212.GD19265@mellanox.co.il> <440349FD.1050103@ichips.intel.com> <20060227193311.GA20064@mellanox.co.il> <1141077981.30345.3.camel@serpentine.pathscale.com> <20060228073209.GC17543@mellanox.co.il> <1141149631.24103.21.camel@camp4.serpentine.com> <1141150925.30894.169.camel@brick.internal.keyresearch.com> Message-ID: <1141154559.20227.13.camel@serpentine.pathscale.com> On Tue, 2006-02-28 at 10:22 -0800, Ralph Campbell wrote: > Can you refresh my memory? What problem are you talking about? > This thread has quite a few messages about SDP but I don't > remember a discussion about bugs or libsdp. At least someone within PathScale ended up measuring IPoIB performance when they thought they were measuring SDP performance, even though they used libsdp, and they didn't notice until the numbers got some scrutiny. This is what I want to avoid having end users run into. Jeremy, was it you? Do you remember what was wrong? References: <20060228155605.GU19855@mellanox.co.il> Message-ID: <20060228192634.GI21556@mellanox.co.il> Quoting r. Roland Dreier : > And in fact it might be interesting for someone to figure out how > to use JHBuild (http://www.gnome.org/~jamesh/jhbuild.html) or the like > to make bleeding-edge installs easier for power users. OT: I wander why does everyone invent build tool when we have gnu make? A pet idea of mine is to try and use the kbuild system for some userspace package. The menuconfig stuff is really amazingly good. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mee at pathscale.com Tue Feb 28 11:39:33 2006 From: mee at pathscale.com (Jeremy Brown) Date: Tue, 28 Feb 2006 11:39:33 -0800 Subject: [openib-general] Re: RFC: SDP plans In-Reply-To: <1141154559.20227.13.camel@serpentine.pathscale.com> References: <19a929370602270944v30f79516id428b0913bd04a08@mail.gmail.com> <44034423.8090109@ichips.intel.com> <20060227184212.GD19265@mellanox.co.il> <440349FD.1050103@ichips.intel.com> <20060227193311.GA20064@mellanox.co.il> <1141077981.30345.3.camel@serpentine.pathscale.com> <20060228073209.GC17543@mellanox.co.il> <1141149631.24103.21.camel@camp4.serpentine.com> <1141150925.30894.169.camel@brick.internal.keyresearch.com> <1141154559.20227.13.camel@serpentine.pathscale.com> Message-ID: <20060228193933.GA32713@pathscale.com> On Tue, Feb 28, 2006 at 11:22:39AM -0800, Bryan O'Sullivan wrote: > On Tue, 2006-02-28 at 10:22 -0800, Ralph Campbell wrote: > > Can you refresh my memory? What problem are you talking about? > > This thread has quite a few messages about SDP but I don't > > remember a discussion about bugs or libsdp. > > At least someone within PathScale ended up measuring IPoIB performance > when they thought they were measuring SDP performance, even though they > used libsdp, and they didn't notice until the numbers got some scrutiny. > This is what I want to avoid having end users run into. > > Jeremy, was it you? Do you remember what was wrong? Yes... as you know, you need to give LD_PRELOAD=/usr/lib64/libsdp.so SIMPLE_LIBSDP=1 on the command line to get SDP. My previous error was to leave out "SIMPLE_LIBSDP=1". When I left that out, it silently failed over to IPoIB. Jeremy From halr at voltaire.com Tue Feb 28 11:35:34 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 28 Feb 2006 14:35:34 -0500 Subject: [openib-general] IPoIB broadcast MC group membership In-Reply-To: <79ae2f320602202210s7a668b52if968e1fad058019c@mail.gmail.com> References: <79ae2f320602201814s3df76825w50fb759238e91077@mail.gmail.com> <79ae2f320602201853n2982624an1c845dac9776ccef@mail.gmail.com> <79ae2f320602202210s7a668b52if968e1fad058019c@mail.gmail.com> Message-ID: <1141155334.4335.6968.camel@hal.voltaire.com> On Tue, 2006-02-21 at 01:10, Fabian Tillier wrote: > The Windows IPoIB follows a sequence like: > > if( GET broadcast group == NO_ERROR ) > if( SET join broadcast group != NO_ERROR ) repeat GET; > else > if( SET create broadcast group != NO_ERROR ) repeat GET; Another approach that could be used is to subscribe for MC group creation and deletion traps rather than a timed poll. BTW, is there some timer before the loop is repeated ? -- Hal From bos at pathscale.com Tue Feb 28 11:46:20 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Tue, 28 Feb 2006 11:46:20 -0800 Subject: [openib-general] Re: RFC: SDP plans In-Reply-To: <20060228193933.GA32713@pathscale.com> References: <19a929370602270944v30f79516id428b0913bd04a08@mail.gmail.com> <44034423.8090109@ichips.intel.com> <20060227184212.GD19265@mellanox.co.il> <440349FD.1050103@ichips.intel.com> <20060227193311.GA20064@mellanox.co.il> <1141077981.30345.3.camel@serpentine.pathscale.com> <20060228073209.GC17543@mellanox.co.il> <1141149631.24103.21.camel@camp4.serpentine.com> <1141150925.30894.169.camel@brick.internal.keyresearch.com> <1141154559.20227.13.camel@serpentine.pathscale.com> <20060228193933.GA32713@pathscale.com> Message-ID: <1141155980.20227.28.camel@serpentine.pathscale.com> On Tue, 2006-02-28 at 11:39 -0800, Jeremy Brown wrote: > you need to give > > LD_PRELOAD=/usr/lib64/libsdp.so SIMPLE_LIBSDP=1 > > on the command line to get SDP. My previous error was to leave out > "SIMPLE_LIBSDP=1". When I left that out, it silently failed over to > IPoIB. OK, thanks for jogging my memory. Michael, is this still the case? References: <79ae2f320602201814s3df76825w50fb759238e91077@mail.gmail.com> <79ae2f320602201853n2982624an1c845dac9776ccef@mail.gmail.com> <79ae2f320602202210s7a668b52if968e1fad058019c@mail.gmail.com> Message-ID: <1141155650.4335.7000.camel@hal.voltaire.com> Hi again Fab, On Tue, 2006-02-21 at 01:10, Fabian Tillier wrote: > Specifically, the problem relates to handling a 1X node trying to join > the broadcast group, and what the retry policy should be. If the > group already exists at 4X, the join should fail if the SM follows the > compliance statements in the IB spec. Correct; it should fail with ERR_REQ_INVALID > Because the code allowed for > the broadcast group not pre-existing (that is, a join could fail > because the group wasn't created), it was unclear whether a failure of > the join indicated that there was a setting incompatibility (1X vs. > 4X), or just whether the group needed to be created. A join to a nonexistent group should return ERR_REQ_INSUFFICIENT_COMPONENTS. Is this sufficient or is more needed ? -- Hal > Then, because > the code handled the race where some other node beat it to creation > and thus resulted in invalid settings, a failure in creation resulted > in a retry of the whole process, staring with a new GET query. From mshefty at ichips.intel.com Tue Feb 28 11:50:24 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 28 Feb 2006 11:50:24 -0800 Subject: [openib-general] Re: [PATCH] RFC Verbs: add support for transport specific verbs In-Reply-To: <20060228192056.GH21556@mellanox.co.il> References: <20060228192056.GH21556@mellanox.co.il> Message-ID: <4404A980.2040609@ichips.intel.com> Michael S. Tsirkin wrote: > Is it worth it to separate these things out? > Even within IB lots of methods are optional - so why cant an iWarp device just > avoid defining process_local_mad, and IB device avoid defining iWarp CM ops? There are 7 additional function needed by iWarp. How should these be added to ib_device? Using process_mad as an example, we would add all 7 function prototypes directly to ib_device. Tom's original proposal was to add an iWarp specific pointer to ib_device, with the functions declared as part of a structure referenced by that pointer. I'd just like consistency on how transport specific functionality is handled, more than I have a specific preference at this point. - Sean From mst at mellanox.co.il Tue Feb 28 11:54:39 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Feb 2006 21:54:39 +0200 Subject: [openib-general] Re: Re: RFC: SDP plans In-Reply-To: <1141155980.20227.28.camel@serpentine.pathscale.com> References: <20060227184212.GD19265@mellanox.co.il> <440349FD.1050103@ichips.intel.com> <20060227193311.GA20064@mellanox.co.il> <1141077981.30345.3.camel@serpentine.pathscale.com> <20060228073209.GC17543@mellanox.co.il> <1141149631.24103.21.camel@camp4.serpentine.com> <1141150925.30894.169.camel@brick.internal.keyresearch.com> <1141154559.20227.13.camel@serpentine.pathscale.com> <20060228193933.GA32713@pathscale.com> <1141155980.20227.28.camel@serpentine.pathscale.com> Message-ID: <20060228195439.GK21556@mellanox.co.il> Quoting r. Bryan O'Sullivan : > Subject: Re: Re: RFC: SDP plans > > On Tue, 2006-02-28 at 11:39 -0800, Jeremy Brown wrote: > > > you need to give > > > > LD_PRELOAD=/usr/lib64/libsdp.so SIMPLE_LIBSDP=1 Its typically a better idea not to put the path in LD_PRELOAD: LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=1 this way thigns will work for 32 and 64 bit apps > > on the command line to get SDP. My previous error was to leave out > > "SIMPLE_LIBSDP=1". When I left that out, it silently failed over to > > IPoIB. > > OK, thanks for jogging my memory. Michael, is this still the case? Hmm, this could happen if the libsdp.conf isnt found or isnt readable. It might be a good idea to make libsdp fail in this case. I'll put this on my list. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Tue Feb 28 11:55:58 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Feb 2006 21:55:58 +0200 Subject: [openib-general] Re: Re: [PATCH] RFC Verbs: add support for transport specific verbs In-Reply-To: <4404A980.2040609@ichips.intel.com> References: <20060228192056.GH21556@mellanox.co.il> <4404A980.2040609@ichips.intel.com> Message-ID: <20060228195558.GL21556@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: Re: [PATCH] RFC Verbs: add support for transport specific verbs > > Michael S. Tsirkin wrote: > >Is it worth it to separate these things out? > >Even within IB lots of methods are optional - so why cant an iWarp device > >just > >avoid defining process_local_mad, and IB device avoid defining iWarp CM > >ops? > > There are 7 additional function needed by iWarp. How should these be added > to ib_device? Using process_mad as an example, we would add all 7 function > prototypes directly to ib_device. Right. Thats what I had in mind. So they are NULL for IB devices and thats that . -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From sean.hefty at intel.com Tue Feb 28 12:11:01 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 28 Feb 2006 12:11:01 -0800 Subject: [openib-general] [PATCH] uMAD: remove receive side data copy Message-ID: Remove the additional data copy when handing a received MAD up to userspace. Only a single data copy to the user's buffer is now performed. Checks for the correct userspace buffer size are also tightened. Signed-off-by: Sean Hefty --- NOTE: I tested this on a node running opensm, but I have no way of testing that received RMPP MADs are copied correctly. Index: user_mad.c =================================================================== --- user_mad.c (revision 5532) +++ user_mad.c (working copy) @@ -121,9 +121,9 @@ struct ib_umad_file { struct ib_umad_packet { struct ib_mad_send_buf *msg; + struct ib_mad_recv_wc *recv_wc; struct list_head list; int length; - struct list_head seg_list; struct ib_user_mad mad; }; @@ -188,62 +188,6 @@ static int data_offset(u8 mgmt_class) return IB_MGMT_RMPP_HDR; } -static int copy_recv_mad(struct ib_mad_recv_wc *mad_recv_wc, - struct ib_umad_packet *packet) -{ - struct ib_mad_recv_buf *seg_buf; - struct ib_rmpp_mad *rmpp_mad; - void *data; - struct ib_rmpp_segment *seg; - int size, len, offset; - u8 flags; - - len = mad_recv_wc->mad_len; - if (len <= sizeof(struct ib_mad)) { - memcpy(&packet->mad.data, mad_recv_wc->recv_buf.mad, len); - return 0; - } - - /* Multipacket (RMPP) MAD */ - offset = data_offset(mad_recv_wc->recv_buf.mad->mad_hdr.mgmt_class); - - list_for_each_entry(seg_buf, &mad_recv_wc->rmpp_list, list) { - rmpp_mad = (struct ib_rmpp_mad *) seg_buf->mad; - flags = ib_get_rmpp_flags(&rmpp_mad->rmpp_hdr); - - if (flags & IB_MGMT_RMPP_FLAG_FIRST) { - size = sizeof(*rmpp_mad); - memcpy(&packet->mad.data, rmpp_mad, size); - } else { - data = (void *) rmpp_mad + offset; - if (flags & IB_MGMT_RMPP_FLAG_LAST) - size = len; - else - size = sizeof(*rmpp_mad) - offset; - seg = kmalloc(sizeof(struct ib_rmpp_segment) + - sizeof(struct ib_rmpp_mad) - offset, - GFP_KERNEL); - if (!seg) - return -ENOMEM; - memcpy(seg->data, data, size); - list_add_tail(&seg->list, &packet->seg_list); - } - len -= size; - } - return 0; -} - -static void free_packet(struct ib_umad_packet *packet) -{ - struct ib_rmpp_segment *seg, *tmp; - - list_for_each_entry_safe(seg, tmp, &packet->seg_list, list) { - list_del(&seg->list); - kfree(seg); - } - kfree(packet); -} - static void send_handler(struct ib_mad_agent *agent, struct ib_mad_send_wc *send_wc) { @@ -267,25 +211,20 @@ static void recv_handler(struct ib_mad_a { struct ib_umad_file *file = agent->context; struct ib_umad_packet *packet; - int length; if (mad_recv_wc->wc->status != IB_WC_SUCCESS) - goto out; + goto err1; - length = mad_recv_wc->mad_len; - packet = kzalloc(sizeof *packet + sizeof(struct ib_mad), GFP_KERNEL); + packet = kzalloc(sizeof *packet, GFP_KERNEL); if (!packet) - goto out; - INIT_LIST_HEAD(&packet->seg_list); - packet->length = length; + goto err1; - if (copy_recv_mad(mad_recv_wc, packet)) { - free_packet(packet); - goto out; - } + packet->length = mad_recv_wc->mad_len; + packet->recv_wc = mad_recv_wc; packet->mad.hdr.status = 0; - packet->mad.hdr.length = length + sizeof (struct ib_user_mad); + packet->mad.hdr.length = sizeof (struct ib_user_mad) + + mad_recv_wc->mad_len; packet->mad.hdr.qpn = cpu_to_be32(mad_recv_wc->wc->src_qp); packet->mad.hdr.lid = cpu_to_be16(mad_recv_wc->wc->slid); packet->mad.hdr.sl = mad_recv_wc->wc->sl; @@ -301,21 +240,87 @@ static void recv_handler(struct ib_mad_a } if (queue_packet(file, agent, packet)) - free_packet(packet); + goto err2; + return; -out: +err2: + kfree(packet); +err1: ib_free_recv_mad(mad_recv_wc); } +static ssize_t copy_recv_mad(char __user *buf, struct ib_umad_packet *packet, + size_t count) +{ + struct ib_mad_recv_buf *recv_buf; + int left, seg_payload, offset, max_seg_payload; + + /* We need enough room to copy the first (or only) MAD segment. */ + recv_buf = &packet->recv_wc->recv_buf; + if ((packet->length <= sizeof (*recv_buf->mad) && + count < sizeof (packet->mad) + packet->length) || + (packet->length > sizeof (*recv_buf->mad) && + count < sizeof (packet->mad) + sizeof (*recv_buf->mad))) + return -EINVAL; + + if (copy_to_user(buf, &packet->mad, sizeof (packet->mad))) + return -EFAULT; + + buf += sizeof (packet->mad); + seg_payload = min_t(int, packet->length, sizeof (*recv_buf->mad)); + if (copy_to_user(buf, recv_buf->mad, seg_payload)) + return -EFAULT; + + if (seg_payload < packet->length) { + /* + * Multipacket RMPP MAD message. Copy remainder of message. + * Note that last segment may have a shorter payload. + */ + if (count < sizeof (packet->mad) + packet->length) { + /* + * The buffer is too small, return the first RMPP segment, + * which includes the RMPP message length. + */ + return -ENOSPC; + } + offset = data_offset(recv_buf->mad->mad_hdr.mgmt_class); + max_seg_payload = sizeof (struct ib_mad) - offset; + + for (left = packet->length - seg_payload, buf += seg_payload; + left; left -= seg_payload, buf += seg_payload) { + recv_buf = container_of(recv_buf->list.next, + struct ib_mad_recv_buf, list); + seg_payload = min(left, max_seg_payload); + if (copy_to_user(buf, ((void *) recv_buf->mad) + offset, + seg_payload)) + return -EFAULT; + } + } + return sizeof (packet->mad) + packet->length; +} + +static ssize_t copy_send_mad(char __user *buf, struct ib_umad_packet *packet, + size_t count) +{ + ssize_t size = sizeof (packet->mad) + packet->length; + + if (count < size) + return -EINVAL; + + if (copy_to_user(buf, &packet->mad, size)) + return -EFAULT; + + return size; +} + static ssize_t ib_umad_read(struct file *filp, char __user *buf, size_t count, loff_t *pos) { struct ib_umad_file *file = filp->private_data; - struct ib_rmpp_segment *seg; struct ib_umad_packet *packet; - ssize_t ret, size; + ssize_t ret; - if (count < sizeof (struct ib_user_mad) + sizeof (struct ib_mad)) + if (count < sizeof (struct ib_user_mad)) return -EINVAL; spin_lock_irq(&file->recv_lock); @@ -338,52 +343,21 @@ static ssize_t ib_umad_read(struct file spin_unlock_irq(&file->recv_lock); - size = min_t(int, sizeof (struct ib_mad), packet->length); - if (copy_to_user(buf, &packet->mad, - sizeof(struct ib_user_mad) + size)) { - ret = -EFAULT; - goto err; - } + if (packet->recv_wc) + ret = copy_recv_mad(buf, packet, count); + else + ret = copy_send_mad(buf, packet, count); - if (count < packet->length + sizeof (struct ib_user_mad)) - /* - * User buffer too small. Return first RMPP segment (which - * includes RMPP message length). - */ - ret = -ENOSPC; - else if (packet->length <= sizeof(struct ib_mad)) - ret = packet->length + sizeof(struct ib_user_mad); - else { - int len = packet->length - sizeof(struct ib_mad); - struct ib_rmpp_mad *rmpp_mad = - (struct ib_rmpp_mad *) packet->mad.data; - int max_seg_payload = sizeof(struct ib_mad) - - data_offset(rmpp_mad->mad_hdr.mgmt_class); - int seg_payload; - /* - * Multipacket RMPP MAD message. Copy remainder of message. - * Note that last segment may have a shorter payload. - */ - buf += sizeof(struct ib_user_mad) + sizeof(struct ib_mad); - list_for_each_entry(seg, &packet->seg_list, list) { - seg_payload = min_t(int, len, max_seg_payload); - if (copy_to_user(buf, seg->data, seg_payload)) { - ret = -EFAULT; - goto err; - } - buf += seg_payload; - len -= seg_payload; - } - ret = packet->length + sizeof (struct ib_user_mad); - } -err: if (ret < 0) { /* Requeue packet */ spin_lock_irq(&file->recv_lock); list_add(&packet->list, &file->recv_list); spin_unlock_irq(&file->recv_lock); - } else - free_packet(packet); + } else { + if (packet->recv_wc) + ib_free_recv_mad(packet->recv_wc); + kfree(packet); + } return ret; } From rdreier at cisco.com Tue Feb 28 12:29:30 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 28 Feb 2006 12:29:30 -0800 Subject: [Openib-promoters] RE: [openib-general] Need for ONE OpenIB Release process that allmembers can agree to and that follows OpenIB Bylaws In-Reply-To: <000101c63c8f$34bb5540$6aa1070a@amr.corp.intel.com> (Bob Woodruff's message of "Tue, 28 Feb 2006 09:48:38 -0800") References: <000101c63c8f$34bb5540$6aa1070a@amr.corp.intel.com> Message-ID: Bob> Also, lets not confuse release versioning with Bob> distribution. Projects like gcc do release versions, but end Bob> users typically get the code from the Linux Bob> distributors. OpenIB doing release versions will also make it Bob> easier for distros to pick up the code for distribution and Bob> be able to tell people what version it is, just like I can Bob> tell what version of gcc is on a particular RedHat or SUSE Bob> CD. Yes, although SuSE gcc 4.0.2 may not exactly equal Red Hat gcc 4.0.2. And this is important: distributors should be able to fix bugs and even add features that they think their customers want. - R. From tom at opengridcomputing.com Tue Feb 28 12:48:40 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 28 Feb 2006 14:48:40 -0600 Subject: [openib-general] Re: [PATCH] RFC Verbs: add support for transport specific verbs In-Reply-To: <4404A980.2040609@ichips.intel.com> References: <20060228192056.GH21556@mellanox.co.il> <4404A980.2040609@ichips.intel.com> Message-ID: <1141159720.5319.41.camel@trinity.ogc.int> On Tue, 2006-02-28 at 11:50 -0800, Sean Hefty wrote: > Michael S. Tsirkin wrote: > > Is it worth it to separate these things out? > > Even within IB lots of methods are optional - so why cant an iWarp device just > > avoid defining process_local_mad, and IB device avoid defining iWarp CM ops? > > There are 7 additional function needed by iWarp. How should these be added to > ib_device? Using process_mad as an example, we would add all 7 function > prototypes directly to ib_device. ... And in fact in the end there will be more. This separation allows one transport to change without impacting the other. > > Tom's original proposal was to add an iWarp specific pointer to ib_device, with > the functions declared as part of a structure referenced by that pointer. > > I'd just like consistency on how transport specific functionality is handled, > more than I have a specific preference at this point. I like Sean's union approach better too. I think my original approach was aesthetically unpleasant work around (ugly hack). > > - Sean > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From ftillier at silverstorm.com Tue Feb 28 12:45:43 2006 From: ftillier at silverstorm.com (Fabian Tillier) Date: Tue, 28 Feb 2006 12:45:43 -0800 Subject: [openib-general] IPoIB broadcast MC group membership In-Reply-To: <1141155650.4335.7000.camel@hal.voltaire.com> References: <79ae2f320602201814s3df76825w50fb759238e91077@mail.gmail.com> <79ae2f320602201853n2982624an1c845dac9776ccef@mail.gmail.com> <79ae2f320602202210s7a668b52if968e1fad058019c@mail.gmail.com> <1141155650.4335.7000.camel@hal.voltaire.com> Message-ID: <79ae2f320602281245g6cd48d12qe32e7e92282379a9@mail.gmail.com> Hi Hal, On 28 Feb 2006 14:41:58 -0500, Hal Rosenstock wrote: > Hi again Fab, > > On Tue, 2006-02-21 at 01:10, Fabian Tillier wrote: > > Specifically, the problem relates to handling a 1X node trying to join > > the broadcast group, and what the retry policy should be. If the > > group already exists at 4X, the join should fail if the SM follows the > > compliance statements in the IB spec. > > Correct; it should fail with ERR_REQ_INVALID > > > Because the code allowed for > > the broadcast group not pre-existing (that is, a join could fail > > because the group wasn't created), it was unclear whether a failure of > > the join indicated that there was a setting incompatibility (1X vs. > > 4X), or just whether the group needed to be created. > > A join to a nonexistent group should return > ERR_REQ_INSUFFICIENT_COMPONENTS. > > Is this sufficient or is more needed ? Hmm, this should allow the loop to end - an ERR_REQ_INVALID response to the join would break out of the loop. ERR_REQ_INVALID to the create goes to join... I had missed this status code, thanks! I think it will work, and will let you know if it doesn't. Thanks, - Fab From mshefty at ichips.intel.com Tue Feb 28 12:48:56 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 28 Feb 2006 12:48:56 -0800 Subject: [openib-general] Re: [PATCH] mad: add GID/class checking for matching received to sent MADs In-Reply-To: <200602281924.33041.jackm@mellanox.co.il> References: <200602281924.33041.jackm@mellanox.co.il> Message-ID: <4404B738.4030202@ichips.intel.com> Jack Morgenstein wrote: > +static inline int rcv_has_same_gid(struct ib_mad_send_wr_private *wr, > + struct ib_mad_recv_wc *rwc ) > +{ > + struct ib_ah_attr attr; > + u8 send_resp, rcv_resp; > + > + send_resp = ((struct ib_mad *)(wr->send_buf.mad))-> > + mad_hdr.method & IB_MGMT_METHOD_RESP; > + rcv_resp = rwc->recv_buf.mad->mad_hdr.method & IB_MGMT_METHOD_RESP; > + > + if (!send_resp && rcv_resp) > + /* is request/response. GID/LIDs are both local (same). */ > + return 1; > + > + if (send_resp == rcv_resp) > + /* both requests, or both responses. GIDs different */ > + return 0; The two checks above are only checking response bits of a sent and received MAD. How do these relate to checking the GIDs? > + if (ib_query_ah(wr->send_buf.ah, &attr)) > + /* Assume not equal, to avoid false positives. */ > + return 0; Querying for the address handle every time seems expensive. Can't we save the necessary information inside the ah? Failing that, we should move the query when sending a MAD, so it can be done once. - Sean From mst at mellanox.co.il Tue Feb 28 13:03:16 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Feb 2006 23:03:16 +0200 Subject: [openib-general] Re: [PATCH] RFC Verbs: add support for transport specific verbs In-Reply-To: <1141159720.5319.41.camel@trinity.ogc.int> References: <20060228192056.GH21556@mellanox.co.il> <4404A980.2040609@ichips.intel.com> <1141159720.5319.41.camel@trinity.ogc.int> Message-ID: <20060228210316.GE21824@mellanox.co.il> Quoting Tom Tucker : > > Using process_mad as an example, we would add all 7 function > > prototypes directly to ib_device. > > ... And in fact in the end there will be more. Oh, I hope not much more. > This separation allows > one transport to change without impacting the other. What kind of impact does adding some new field have? My point is, I have to test whether the function is implemented anyway, so why add two checks: one for device type, another for function implementation? Its complicated and inefficient. The gain in memory usgae by using union is neglidgible, since its per device and we dont have that many devices. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From weiny2 at llnl.gov Tue Feb 28 13:03:34 2006 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 28 Feb 2006 13:03:34 -0800 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <1140803238.1158.42.camel@camp4.serpentine.com> References: <1140801374.1158.22.camel@camp4.serpentine.com> <1140802077.4336.34.camel@hal.voltaire.com> <1140803238.1158.42.camel@camp4.serpentine.com> Message-ID: <20060228130334.406acc52.weiny2@llnl.gov> On Fri, 24 Feb 2006 09:47:18 -0800 Bryan O'Sullivan wrote: > On Fri, 2006-02-24 at 12:31 -0500, Hal Rosenstock wrote: > > > > Userland software: > > > > > > * libibverbs > > > * libsdp > > > * opensm > > > > There are diags as well as OpenSM in the management directory. > > Do you want the diags in, or out? They're not packaged in any way, so > I'd vote for "out". > We (LLNL) use and like them. The more Diags the better. My vote is "in". Ira From caitlinb at broadcom.com Tue Feb 28 13:11:27 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 28 Feb 2006 13:11:27 -0800 Subject: [openib-general] Re: [PATCH] RFC Verbs: add support for transport specific verbs Message-ID: <54AD0F12E08D1541B826BE97C98F99F12977EA@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > Quoting r. Sean Hefty : >> Subject: [PATCH] RFC Verbs: add support for transport specific verbs >> >> Add support for transport specific extensions to the ib_device verbs. >> Relocate process_mad as an IB specific verb. >> >> This provides a mechanism to add iWarp specific functionality, such >> as the iWarp CM calls, to ib_device. >> >> Signed-off-by: Sean Hefty > > Is it worth it to separate these things out? > Even within IB lots of methods are optional - so why cant an > iWarp device just avoid defining process_local_mad, and IB > device avoid defining iWarp CM ops? Ultimately there needs to be a simple way of knowing what is transport specific. But a naming prefix is adequate for that purpose. So the question becomes whether an extra layer of referencing (requiring a little more typing, but no extra run-time cost) is justified by the savings of not storing irrelevant pointers. When it is just a pointer that already can be optional I'm not sure. When it would pay off a bit more is when attributes are segregated. But in either case it really won't amount to much. The more critical question is which is *safer*, having NULL values so that IB-specific code that stumbles on a iWARP QP (or vise versa) will see NULL values (as opposed to a false interpretation of the iWARP data) versus the safety of requiring a more explicit decision to reference transport specific fields. My thinking had been that the ib_ and iw_ underscores were good enough on the latter point and that the savings of actual data space from using a union was not justified. But that's only a first guess / slight preference. Is there a style precedent here in other kernel code that we should be following? From tom at opengridcomputing.com Tue Feb 28 13:25:59 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 28 Feb 2006 15:25:59 -0600 Subject: [openib-general] Re: [PATCH] RFC Verbs: add support for transport specific verbs In-Reply-To: <20060228210316.GE21824@mellanox.co.il> References: <20060228192056.GH21556@mellanox.co.il> <4404A980.2040609@ichips.intel.com> <1141159720.5319.41.camel@trinity.ogc.int> <20060228210316.GE21824@mellanox.co.il> Message-ID: <1141161959.5319.52.camel@trinity.ogc.int> On Tue, 2006-02-28 at 23:03 +0200, Michael S. Tsirkin wrote: > Quoting Tom Tucker : > > > Using process_mad as an example, we would add all 7 function > > > prototypes directly to ib_device. > > > > ... And in fact in the end there will be more. > > Oh, I hope not much more. llp_connect, llp_accept, llp_reject, llp_listen, llp_send, llp_recv, llp_close, net_event_notif > > > This separation allows > > one transport to change without impacting the other. > > What kind of impact does adding some new field have? It complicates the data structure and makes it somewhat more difficult to maintain. I also like the idea that iWARP support can be added in a separate file without hitting either ib_verbs.h or the core ib_device structure. This lessens the risk that a change in one transport's verbs will regress the other. > > My point is, I have to test whether the function is implemented anyway, > so why add two checks: one for device type, another for function > implementation? Its complicated and inefficient. > I don't see these as the same thing fmr / srq are optional features supported on both transports. MAD processing has no meaning on iWARP and the iWARP CM verbs have no meaning in IB. We're talking about partitioning transport specific verbs, not optional features. > The gain in memory usgae by using union is neglidgible, since its per > device and we dont have that many devices. > I think this is true, but I don't see memory optimization as the thrust of the proposal. I think the idea is improving maintainability. From mst at mellanox.co.il Tue Feb 28 13:38:19 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Feb 2006 23:38:19 +0200 Subject: [openib-general] Re: [PATCH] RFC Verbs: add support for transport specific verbs In-Reply-To: <1141161959.5319.52.camel@trinity.ogc.int> References: <20060228192056.GH21556@mellanox.co.il> <4404A980.2040609@ichips.intel.com> <1141159720.5319.41.camel@trinity.ogc.int> <20060228210316.GE21824@mellanox.co.il> <1141161959.5319.52.camel@trinity.ogc.int> Message-ID: <20060228213819.GF21824@mellanox.co.il> Quoting r. Tom Tucker : > Subject: Re: [openib-general] Re: [PATCH] RFC Verbs: add support for transport specific verbs > > On Tue, 2006-02-28 at 23:03 +0200, Michael S. Tsirkin wrote: > > Quoting Tom Tucker : > > > > Using process_mad as an example, we would add all 7 function > > > > prototypes directly to ib_device. > > > > > > ... And in fact in the end there will be more. > > > > Oh, I hope not much more. > > llp_connect, llp_accept, llp_reject, llp_listen, llp_send, llp_recv, > llp_close, net_event_notif 8 rather than 7 then, not too bad. > > > > > This separation allows > > > one transport to change without impacting the other. > > > > What kind of impact does adding some new field have? > > It complicates the data structure and makes it somewhat more difficult > to maintain. I also like the idea that iWARP support can be added in a > separate file without hitting either ib_verbs.h or the core ib_device > structure. Okay, but lets try to avoid adding runtime overhead. > This lessens the risk that a change in one transport's verbs > will regress the other. I dont see how - the union size will likely change anyway. > > My point is, I have to test whether the function is implemented anyway, > > so why add two checks: one for device type, another for function > > implementation? Its complicated and inefficient. > > > > I don't see these as the same thing fmr / srq are optional features > supported on both transports. MAD processing has no meaning on iWARP and > the iWARP CM verbs have no meaning in IB. We're talking about > partitioning transport specific verbs, not optional features. > > The gain in memory usgae by using union is neglidgible, since its per > > device and we dont have that many devices. > > > I think this is true, but I don't see memory optimization as the thrust > of the proposal. I think the idea is improving maintainability. Then at least lets make it a structure, not a union. This way a single test is sufficient to figure out whether a specific function is supported. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From ftillier at silverstorm.com Tue Feb 28 13:36:45 2006 From: ftillier at silverstorm.com (Fabian Tillier) Date: Tue, 28 Feb 2006 13:36:45 -0800 Subject: [openib-general] IPoIB broadcast MC group membership In-Reply-To: <1141155334.4335.6968.camel@hal.voltaire.com> References: <79ae2f320602201814s3df76825w50fb759238e91077@mail.gmail.com> <79ae2f320602201853n2982624an1c845dac9776ccef@mail.gmail.com> <79ae2f320602202210s7a668b52if968e1fad058019c@mail.gmail.com> <1141155334.4335.6968.camel@hal.voltaire.com> Message-ID: <79ae2f320602281336v441fda46pf06a09a4efa9f747@mail.gmail.com> On 28 Feb 2006 14:35:34 -0500, Hal Rosenstock wrote: > On Tue, 2006-02-21 at 01:10, Fabian Tillier wrote: > > The Windows IPoIB follows a sequence like: > > > > if( GET broadcast group == NO_ERROR ) > > if( SET join broadcast group != NO_ERROR ) repeat GET; > > else > > if( SET create broadcast group != NO_ERROR ) repeat GET; > > Another approach that could be used is to subscribe for MC group > creation and deletion traps rather than a timed poll. That wouldn't really help, as you could still have a race between getting the creation notification and the join succeeding wherein the group could be deleted. > BTW, is there some timer before the loop is repeated ? No, and you really don't want one. If the GET succeeded, then you want to join ASAP. Note that I got the above loop wrong - if the join fails, the code tries to create the group, not repeat the GET. For that step in the loop, you again don't want to wait, as you want the group created ASAP. The problem was one of never exiting the loop, which your suggestion of using the ERR_REQ_INSUFFICIENT_COMPONENTS should help with. - Fab From mst at mellanox.co.il Tue Feb 28 13:47:21 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Feb 2006 23:47:21 +0200 Subject: [openib-general] Re: [Openib-promoters] RE: Need for ONE OpenIB Release process that allmembers can agree to and that follows OpenIB Bylaws In-Reply-To: References: <000101c63c8f$34bb5540$6aa1070a@amr.corp.intel.com> Message-ID: <20060228214721.GG21824@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [Openib-promoters] RE: Need for ONE OpenIB Release process that allmembers can agree to and that follows OpenIB Bylaws > > Bob> Also, lets not confuse release versioning with > Bob> distribution. Projects like gcc do release versions, but end > Bob> users typically get the code from the Linux > Bob> distributors. OpenIB doing release versions will also make it > Bob> easier for distros to pick up the code for distribution and > Bob> be able to tell people what version it is, just like I can > Bob> tell what version of gcc is on a particular RedHat or SUSE > Bob> CD. > > Yes, although SuSE gcc 4.0.2 may not exactly equal Red Hat gcc 4.0.2. > And this is important: distributors should be able to fix bugs and > even add features that they think their customers want. Right, and note how each distribution has its own versioning scheme. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From caitlinb at broadcom.com Tue Feb 28 13:53:55 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 28 Feb 2006 13:53:55 -0800 Subject: [openib-general] Re: [PATCH] RFC Verbs: add support for transport specific verbs Message-ID: <54AD0F12E08D1541B826BE97C98F99F1297805@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > On Tue, 2006-02-28 at 23:03 +0200, Michael S. Tsirkin wrote: >> Quoting Tom Tucker : >>>> Using process_mad as an example, we would add all 7 function >>>> prototypes directly to ib_device. >>> >>> ... And in fact in the end there will be more. >> >> Oh, I hope not much more. > > llp_connect, llp_accept, llp_reject, llp_listen, llp_send, > llp_recv, llp_close, net_event_notif > >> >>> This separation allows >>> one transport to change without impacting the other. >> >> What kind of impact does adding some new field have? > > It complicates the data structure and makes it somewhat more > difficult to maintain. I also like the idea that iWARP > support can be added in a separate file without hitting > either ib_verbs.h or the core ib_device structure. This > lessens the risk that a change in one transport's verbs will regress > the other. > Being able to define transport specific fields in a transport specific header file would be a major benefit. I think it trumps all the other factors cited. The only trick is avoiding an extra layer of indirection for any fastpath operations. So we may want to keep all fastpath operations transport neutral, at least on a syntax basis, to avoid the extra dereference. I'm not sure there are any, though. From mst at mellanox.co.il Tue Feb 28 14:02:26 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 1 Mar 2006 00:02:26 +0200 Subject: [openib-general] Re: [PATCH] mad: add GID/class checking for matching received to sent MADs In-Reply-To: <4404B738.4030202@ichips.intel.com> References: <200602281924.33041.jackm@mellanox.co.il> <4404B738.4030202@ichips.intel.com> Message-ID: <20060228220225.GH21824@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [PATCH] mad: add GID/class checking for matching received to sent MADs > > Jack Morgenstein wrote: > >+static inline int rcv_has_same_gid(struct ib_mad_send_wr_private *wr, > >+ struct ib_mad_recv_wc *rwc ) > >+{ > >+ struct ib_ah_attr attr; > >+ u8 send_resp, rcv_resp; > >+ > >+ send_resp = ((struct ib_mad *)(wr->send_buf.mad))-> > >+ mad_hdr.method & IB_MGMT_METHOD_RESP; > >+ rcv_resp = rwc->recv_buf.mad->mad_hdr.method & IB_MGMT_METHOD_RESP; > >+ > >+ if (!send_resp && rcv_resp) > >+ /* is request/response. GID/LIDs are both local (same). */ > >+ return 1; > >+ > >+ if (send_resp == rcv_resp) > >+ /* both requests, or both responses. GIDs different */ > >+ return 0; > > The two checks above are only checking response bits of a sent and received > MAD. How do these relate to checking the GIDs? Well, the comments really explain what the checks do, don't they? Would rcv_has_same_initiator be a better name? > >+ if (ib_query_ah(wr->send_buf.ah, &attr)) > >+ /* Assume not equal, to avoid false positives. */ > >+ return 0; > > Querying for the address handle every time seems expensive. Its a very cheap operation. Look at the implementation: the low level driver already caches everything in hardware format, we just move some bits around. > Can't we save > the necessary information inside the ah? That would waste a lot of memory. Its not worth caching it. Our locking rules require for query_ah to be atomic which is a strong hint that it just retrieves the data cached by low level driver. > Failing that, we should move the > query when sending a MAD, so it can be done once. We do it on receive to match ACKs to transactions. I dont see how this check can be done on send. Patch? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From mst at mellanox.co.il Tue Feb 28 14:11:00 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 1 Mar 2006 00:11:00 +0200 Subject: [openib-general] Re: Re: [PATCH] RFC Verbs: add support for transport specific verbs In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1297805@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1297805@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <20060228221100.GI21824@mellanox.co.il> Quoting r. Caitlin Bestler : > The only trick is avoiding an extra layer of indirection > for any fastpath operations. So we may want to keep all > fastpath operations transport neutral, at least on a > syntax basis, to avoid the extra dereference. I'm not > sure there are any, though. No, thats not the only way. You can just do struct ib_ops { process_local_mad; }; struct iw_ops { llp_connect; }; struct ib_device { struct ib_ops ib; struct iw_ops iw; } And there's no runtime overhead to this - its not an extra indirection. You can even do: struct ib_device { struct ib_ops; struct iw_ops; } To hide an extra structure from the user. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From eitan at mellanox.co.il Tue Feb 28 14:11:28 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 1 Mar 2006 00:11:28 +0200 Subject: [openib-general] RE: Re: [PATCH] ib_mad: prevent duplicateoutstandingMADtransactions with same TID Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B84E@mtlexch01.mtl.com> Hi Sean > > Correct - unfortunately, I don't believe that there's any way to know if an > abort is for an RMPP message that is being sent, versus received. I.e. host A > can send an RMPP message to host B with TID=3 at the same time host B sends an > RMPP message to host A with TID=3. If host B sends an abort, host A has no idea > which transaction is being aborted. [EZ] The spec actually lets you know: C15-0.1.18: SA shall respond to SubnAdmGetTable() and SubnAdmGetTraceTable() requests by performing the Sender role in a receiver initiated RMPP transmission sequence as described in 13.6.6.1 Receiver- Initiated Transfer on page 790. * The method used for all packets sent from the Receiver to the Sender shall be SubnAdmGetTable() or SubnAdmGetTraceTable(), depending on which initiated the transfer. * The method used for all packets sent from the Sender to the Receiver shall be SubnAdmGetTableResp(). From mshefty at ichips.intel.com Tue Feb 28 14:12:29 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 28 Feb 2006 14:12:29 -0800 Subject: [openib-general] Re: [PATCH] RFC Verbs: add support for transport specific verbs In-Reply-To: <20060228213819.GF21824@mellanox.co.il> References: <20060228192056.GH21556@mellanox.co.il> <4404A980.2040609@ichips.intel.com> <1141159720.5319.41.camel@trinity.ogc.int> <20060228210316.GE21824@mellanox.co.il> <1141161959.5319.52.camel@trinity.ogc.int> <20060228213819.GF21824@mellanox.co.il> Message-ID: <4404CACD.9040505@ichips.intel.com> Michael S. Tsirkin wrote: > Okay, but lets try to avoid adding runtime overhead. This shouldn't add any runtime overhead than what's already there. The only difference is the notation used to get to the process_mad function. > I dont see how - the union size will likely change anyway. This could be avoided by making the union reference pointers to structures, rather than the structures themselves. I didn't go this route because I didn't want to update where the ib_device structures were allocated. This would have an impact on runtime performance, but the cost is only an additional memory read. >>>My point is, I have to test whether the function is implemented anyway, >>>so why add two checks: one for device type, another for function >>>implementation? Its complicated and inefficient. Multiple checks are already done today. For example, the MAD code checks device type before trying to use a device, then checks again to see if a function is implemented. Similarly, the CMA must also check device type. > Then at least lets make it a structure, not a union. > This way a single test is sufficient to figure out whether > a specific function is supported. I don't follow the advantages of embedding a structure inside ib_device. - Sean From mst at mellanox.co.il Tue Feb 28 14:21:56 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 1 Mar 2006 00:21:56 +0200 Subject: [openib-general] Re: Re: [PATCH] RFC Verbs: add support for transport specific verbs In-Reply-To: <4404CACD.9040505@ichips.intel.com> References: <20060228192056.GH21556@mellanox.co.il> <4404A980.2040609@ichips.intel.com> <1141159720.5319.41.camel@trinity.ogc.int> <20060228210316.GE21824@mellanox.co.il> <1141161959.5319.52.camel@trinity.ogc.int> <20060228213819.GF21824@mellanox.co.il> <4404CACD.9040505@ichips.intel.com> Message-ID: <20060228222155.GJ21824@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: Re: [PATCH] RFC Verbs: add support for transport specific verbs > > Michael S. Tsirkin wrote: > >Okay, but lets try to avoid adding runtime overhead. > > This shouldn't add any runtime overhead than what's already there. The > only difference is the notation used to get to the process_mad function. The overhead will appear when you call the function, since you will have to check the device type. I know its small, but lets not set the wrong precedent. Lets put it this way: we have e.g. in core/sysfs.c show_pma_counter: if (!p->ibdev->process_mad) return sprintf(buf, "N/A (no PMA)\n"); So I dont want to convert that to: if (!p->ibdev->type != INFINIBAND || !p->ibdev->process_mad) return sprintf(buf, "N/A (no PMA)\n"); But I'm fine with: if (!p->ibdev.ib->process_mad) return sprintf(buf, "N/A (no PMA)\n"); This is faster, and safer. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From bos at pathscale.com Tue Feb 28 14:21:48 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Tue, 28 Feb 2006 14:21:48 -0800 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <20060228130334.406acc52.weiny2@llnl.gov> References: <1140801374.1158.22.camel@camp4.serpentine.com> <1140802077.4336.34.camel@hal.voltaire.com> <1140803238.1158.42.camel@camp4.serpentine.com> <20060228130334.406acc52.weiny2@llnl.gov> Message-ID: <1141165308.20227.63.camel@serpentine.pathscale.com> On Tue, 2006-02-28 at 13:03 -0800, Ira Weiny wrote: > My vote is "in". Yes, that's the conclusion we've already reached. References: <200602281924.33041.jackm@mellanox.co.il> <4404B738.4030202@ichips.intel.com> <20060228220225.GH21824@mellanox.co.il> Message-ID: <4404CE24.3010907@ichips.intel.com> Michael S. Tsirkin wrote: >>>+static inline int rcv_has_same_gid(struct ib_mad_send_wr_private *wr, >>>+ struct ib_mad_recv_wc *rwc ) >>>+{ >>>+ struct ib_ah_attr attr; >>>+ u8 send_resp, rcv_resp; >>>+ >>>+ send_resp = ((struct ib_mad *)(wr->send_buf.mad))-> >>>+ mad_hdr.method & IB_MGMT_METHOD_RESP; >>>+ rcv_resp = rwc->recv_buf.mad->mad_hdr.method & IB_MGMT_METHOD_RESP; >>>+ >>>+ if (!send_resp && rcv_resp) >>>+ /* is request/response. GID/LIDs are both local (same). */ >>>+ return 1; >>>+ >>>+ if (send_resp == rcv_resp) >>>+ /* both requests, or both responses. GIDs different */ >>>+ return 0; >> >>The two checks above are only checking response bits of a sent and received >>MAD. How do these relate to checking the GIDs? > > > Well, the comments really explain what the checks do, don't they? > Would rcv_has_same_initiator be a better name? The comments don't help. Returning that the GIDs match just because a send is a request and a receive is a response is misleading. How does the function know this? >>>+ if (ib_query_ah(wr->send_buf.ah, &attr)) >>>+ /* Assume not equal, to avoid false positives. */ >>>+ return 0; >> >>Querying for the address handle every time seems expensive. > > > Its a very cheap operation. Look at the implementation: the low level > driver already caches everything in hardware format, we just move > some bits around. > > >>Can't we save >>the necessary information inside the ah? > > > That would waste a lot of memory. If the driver is caching the memory anyway, couldn't it just store the data in the ah, rather than some private structure? > Its not worth caching it. Our locking rules require for query_ah to be atomic > which is a strong hint that it just retrieves the data cached by low level > driver. If the locking rules require the data to be cached by the driver, why not make it available to the user directly? > We do it on receive to match ACKs to transactions. > I dont see how this check can be done on send. > Patch? I was suggesting that the query be done when sending the MAD and the results saved. This makes less sense if the ah attributes need to be cached by the driver anyway. - Sean From mst at mellanox.co.il Tue Feb 28 14:39:09 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 1 Mar 2006 00:39:09 +0200 Subject: [openib-general] Re: [PATCH] mad: add GID/class checking for matching received to sent MADs In-Reply-To: <4404CE24.3010907@ichips.intel.com> References: <200602281924.33041.jackm@mellanox.co.il> <4404B738.4030202@ichips.intel.com> <20060228220225.GH21824@mellanox.co.il> <4404CE24.3010907@ichips.intel.com> Message-ID: <20060228223909.GL21824@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [PATCH] mad: add GID/class checking for matching received to sent MADs > > Michael S. Tsirkin wrote: > >>>+static inline int rcv_has_same_gid(struct ib_mad_send_wr_private *wr, > >>>+ struct ib_mad_recv_wc *rwc ) > >>>+{ > >>>+ struct ib_ah_attr attr; > >>>+ u8 send_resp, rcv_resp; > >>>+ > >>>+ send_resp = ((struct ib_mad *)(wr->send_buf.mad))-> > >>>+ mad_hdr.method & IB_MGMT_METHOD_RESP; > >>>+ rcv_resp = rwc->recv_buf.mad->mad_hdr.method & IB_MGMT_METHOD_RESP; > >>>+ > >>>+ if (!send_resp && rcv_resp) > >>>+ /* is request/response. GID/LIDs are both local (same). */ > >>>+ return 1; > >>>+ > >>>+ if (send_resp == rcv_resp) > >>>+ /* both requests, or both responses. GIDs different */ > >>>+ return 0; > >> > >>The two checks above are only checking response bits of a sent and > >>received MAD. How do these relate to checking the GIDs? > > > > > >Well, the comments really explain what the checks do, don't they? > >Would rcv_has_same_initiator be a better name? > > The comments don't help. Returning that the GIDs match just because a send > is a request and a receive is a response is misleading. How does the > function know this? We are comparing the GIDs of the requestor, remember? I initiated one of the MADs and remote side initiated the other one so I know they cant match. > >>>+ if (ib_query_ah(wr->send_buf.ah, &attr)) > >>>+ /* Assume not equal, to avoid false positives. */ > >>>+ return 0; > >> > >>Querying for the address handle every time seems expensive. > > > > > >Its a very cheap operation. Look at the implementation: the low level > >driver already caches everything in hardware format, we just move > >some bits around. > > > > > >>Can't we save > >>the necessary information inside the ah? > > > > > >That would waste a lot of memory. > > If the driver is caching the memory anyway, couldn't it just store the data > in the ah, rather than some private structure? No, because its keeping it in hardware format. > >Its not worth caching it. Our locking rules require for query_ah to be > >atomic > >which is a strong hint that it just retrieves the data cached by low level > >driver. > > If the locking rules require the data to be cached by the driver, why not > make it available to the user directly? They dont actually require it, see. But thats what drivers happen to do. > >We do it on receive to match ACKs to transactions. > >I dont see how this check can be done on send. > >Patch? > > I was suggesting that the query be done when sending the MAD and the > results saved. I dont think this can work. > This makes less sense if the ah attributes need to be > cached by the driver anyway. Right. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies From caitlinb at broadcom.com Tue Feb 28 14:40:28 2006 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 28 Feb 2006 14:40:28 -0800 Subject: [openib-general] RE: Re: [PATCH] RFC Verbs: add support for transport specific verbs Message-ID: <54AD0F12E08D1541B826BE97C98F99F1297820@NT-SJCA-0751.brcm.ad.broadcom.com> Michael S. Tsirkin wrote: > Quoting r. Caitlin Bestler : >> The only trick is avoiding an extra layer of indirection for any >> fastpath operations. So we may want to keep all fastpath operations >> transport neutral, at least on a syntax basis, to avoid the extra >> dereference. I'm not sure there are any, though. > > No, thats not the only way. You can just do > > struct ib_ops { > process_local_mad; > }; > > struct iw_ops { > llp_connect; > }; > > struct ib_device { > struct ib_ops ib; > struct iw_ops iw; > } > > And there's no runtime overhead to this - its not an extra > indirection. > That works, as long as we accept that all "struct transport_ops" will be defined prior to this. That means that a change to struct iw_ops will still trigger a make of InfiniBand specific code. And (sizeof struct ib_device) will change. Since we are talking years per new transport rather than new transports per year this is probably acceptable. From mshefty at ichips.intel.com Tue Feb 28 14:44:35 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 28 Feb 2006 14:44:35 -0800 Subject: [openib-general] Re: [PATCH] RFC Verbs: add support for transport specific verbs In-Reply-To: <20060228222155.GJ21824@mellanox.co.il> References: <20060228192056.GH21556@mellanox.co.il> <4404A980.2040609@ichips.intel.com> <1141159720.5319.41.camel@trinity.ogc.int> <20060228210316.GE21824@mellanox.co.il> <1141161959.5319.52.camel@trinity.ogc.int> <20060228213819.GF21824@mellanox.co.il> <4404CACD.9040505@ichips.intel.com> <20060228222155.GJ21824@mellanox.co.il> Message-ID: <4404D253.2000908@ichips.intel.com> Michael S. Tsirkin wrote: > The overhead will appear when you call the function, since you > will have to check the device type. I know its small, but lets not > set the wrong precedent. > > Lets put it this way: we have e.g. in core/sysfs.c show_pma_counter: I would think that sysfs would need to be updated to check for device types anyway. The PMA is IB specific, as are several of the attributes: LID, GID, GUID, SM SL, etc. I'm not sure how often we'll be in transport neutral code with the need to call a transport specific function without first checking the transport type. The _right_ precedence may be to force the user to check the transport type before doing so. Whether a function is optional for a given transport is up to each transport to decide. At this point, I'm still undecided on what approach to use. - Sean From mshefty at ichips.intel.com Tue Feb 28 15:21:29 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 28 Feb 2006 15:21:29 -0800 Subject: [openib-general] Re: [PATCH] mad: add GID/class checking for matching received to sent MADs In-Reply-To: <20060228223909.GL21824@mellanox.co.il> References: <200602281924.33041.jackm@mellanox.co.il> <4404B738.4030202@ichips.intel.com> <20060228220225.GH21824@mellanox.co.il> <4404CE24.3010907@ichips.intel.com> <20060228223909.GL21824@mellanox.co.il> Message-ID: <4404DAF9.7070807@ichips.intel.com> Michael S. Tsirkin wrote: >>If the driver is caching the memory anyway, couldn't it just store the data >>in the ah, rather than some private structure? > > No, because its keeping it in hardware format. Is there a particular reason it needs it cached in this format? - Sean From mshefty at ichips.intel.com Tue Feb 28 15:43:27 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 28 Feb 2006 15:43:27 -0800 Subject: [openib-general] Re: [PATCH] mad: add GID/class checking for matching received to sent MADs In-Reply-To: <200602281924.33041.jackm@mellanox.co.il> References: <200602281924.33041.jackm@mellanox.co.il> Message-ID: <4404E01F.4080100@ichips.intel.com> Jack Morgenstein wrote: > Adds GID and class checking to mad receive processing when locating sent MAD. Committed in 5533. Thanks for updating this! - Sean From mshefty at ichips.intel.com Tue Feb 28 15:53:18 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 28 Feb 2006 15:53:18 -0800 Subject: [openib-general] Re: [PATCH] ib_mad: prevent duplicateoutstanding MADtransactions with same TID In-Reply-To: <20060228190251.GE21556@mellanox.co.il> References: <20060227181156.GC17414@mellanox.co.il> <4403422F.2000408@ichips.intel.com> <20060227182746.GB19265@mellanox.co.il> <44034CDF.9030204@ichips.intel.com> <20060227194215.GC20064@mellanox.co.il> <4403575F.1010305@ichips.intel.com> <20060228130819.GJ19855@mellanox.co.il> <44048F73.9040108@ichips.intel.com> <20060228182038.GC21556@mellanox.co.il> <440495F3.9040701@ichips.intel.com> <20060228190251.GE21556@mellanox.co.il> Message-ID: <4404E26E.1060700@ichips.intel.com> Michael S. Tsirkin wrote: > 2. Assume vendor specific behaves in the same way as SA class, and > ask users to adhere to this rule The current RMPP code actually applies this rule. The response bit is toggled when sending an ACK/STOP/ABORT. > Further if you are going to work on a spec extension, it could simply add the > requirement on the resp bit for vendor specific classes. Right? Correct. Copying Hal on this message, since he's bringing up the issue with the IBTA. > Sure, but again, if you initiate a request and then abort it, you > clear the response bit, if you are receiving a request and decide to abort it, > you set the response bit. > > Therefore if you get an abort you can look at the resp bit: if it is > set this is a transaction that you initiated, if it is clear this is > a transaction that remote side initiated. > > I conclude that there's no ambiguity. Am I mistaken? I believe that you're correct, but I need to consider this more with respect to receiving a duplicated request, while a response is being generated. - Sean From sean.hefty at intel.com Tue Feb 28 16:23:14 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 28 Feb 2006 16:23:14 -0800 Subject: [openib-general] Re: Re: [PATCH] RFC Verbs: add support for transport specific verbs In-Reply-To: <20060228221100.GI21824@mellanox.co.il> Message-ID: Here's a version that just embeds the structures, rather than creating a union. Signed-off-by: Sean Hefty --- After we decide on which approach to use, we can determine if other calls should be separated out, such as query_pkey. Roland/Hal, do you have a particular preference? Index: include/rdma/ib_verbs.h =================================================================== --- include/rdma/ib_verbs.h (revision 5532) +++ include/rdma/ib_verbs.h (working copy) @@ -824,6 +824,16 @@ struct ib_cache { struct ib_gid_cache **gid_cache; }; +struct ib_ops { + int (*process_mad)(struct ib_device *device, + int process_mad_flags, + u8 port_num, + struct ib_wc *in_wc, + struct ib_grh *in_grh, + struct ib_mad *in_mad, + struct ib_mad *out_mad); +}; + struct ib_device { struct device *dma_device; @@ -954,13 +964,8 @@ struct ib_device { int (*detach_mcast)(struct ib_qp *qp, union ib_gid *gid, u16 lid); - int (*process_mad)(struct ib_device *device, - int process_mad_flags, - u8 port_num, - struct ib_wc *in_wc, - struct ib_grh *in_grh, - struct ib_mad *in_mad, - struct ib_mad *out_mad); + + struct ib_ops ibop; struct module *owner; struct class_device class_dev; Index: core/mad.c =================================================================== --- core/mad.c (revision 5533) +++ core/mad.c (working copy) @@ -704,9 +704,9 @@ static int handle_outgoing_dr_smp(struct send_wr->wr.ud.port_num, &mad_wc); /* No GRH for DR SMP */ - ret = device->process_mad(device, 0, port_num, &mad_wc, NULL, - (struct ib_mad *)smp, - (struct ib_mad *)&mad_priv->mad); + ret = device->ibop.process_mad(device, 0, port_num, &mad_wc, NULL, + (struct ib_mad *) smp, + (struct ib_mad *) &mad_priv->mad); switch (ret) { case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY: @@ -1835,7 +1835,7 @@ static void ib_mad_recv_done_handler(str local: /* Give driver "right of first refusal" on incoming MAD */ - if (port_priv->device->process_mad) { + if (port_priv->device->ibop.process_mad) { int ret; if (!response) { @@ -1847,11 +1847,11 @@ local: goto out; } - ret = port_priv->device->process_mad(port_priv->device, 0, - port_priv->port_num, - wc, &recv->grh, - &recv->mad.mad, - &response->mad.mad); + ret = port_priv->device->ibop.process_mad(port_priv->device, 0, + port_priv->port_num, + wc, &recv->grh, + &recv->mad.mad, + &response->mad.mad); if (ret & IB_MAD_RESULT_SUCCESS) { if (ret & IB_MAD_RESULT_CONSUMED) goto out; Index: core/sysfs.c =================================================================== --- core/sysfs.c (revision 5532) +++ core/sysfs.c (working copy) @@ -311,7 +311,7 @@ static ssize_t show_pma_counter(struct i struct ib_mad *out_mad = NULL; ssize_t ret; - if (!p->ibdev->process_mad) + if (!p->ibdev->ibop.process_mad) return sprintf(buf, "N/A (no PMA)\n"); in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); @@ -329,7 +329,7 @@ static ssize_t show_pma_counter(struct i in_mad->data[41] = p->port_num; /* PortSelect field */ - if ((p->ibdev->process_mad(p->ibdev, IB_MAD_IGNORE_MKEY, + if ((p->ibdev->ibop.process_mad(p->ibdev, IB_MAD_IGNORE_MKEY, p->port_num, NULL, NULL, in_mad, out_mad) & (IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY)) != (IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY)) { Index: core/smi.h =================================================================== --- core/smi.h (revision 5532) +++ core/smi.h (working copy) @@ -58,7 +58,7 @@ static inline int smi_check_local_smp(st { /* C14-9:3 -- We're at the end of the DR segment of path */ /* C14-9:4 -- Hop Pointer = Hop Count + 1 -> give to SMA/SM */ - return ((device->process_mad && + return ((device->ibop.process_mad && !ib_get_smp_direction(smp) && (smp->hop_ptr == smp->hop_cnt + 1))); } Index: hw/ipath/ipath_verbs.c =================================================================== --- hw/ipath/ipath_verbs.c (revision 5532) +++ hw/ipath/ipath_verbs.c (working copy) @@ -5949,7 +5949,7 @@ static int ipath_register_ib_device(cons dev->dealloc_fmr = ipath_dealloc_fmr; dev->attach_mcast = ipath_multicast_attach; dev->detach_mcast = ipath_multicast_detach; - dev->process_mad = ipath_process_mad; + dev->ibop.process_mad = ipath_process_mad; ret = ib_register_device(dev); if (ret) Index: hw/mthca/mthca_provider.c =================================================================== --- hw/mthca/mthca_provider.c (revision 5532) +++ hw/mthca/mthca_provider.c (working copy) @@ -1329,7 +1329,7 @@ int mthca_register_device(struct mthca_d dev->ib_dev.attach_mcast = mthca_multicast_attach; dev->ib_dev.detach_mcast = mthca_multicast_detach; - dev->ib_dev.process_mad = mthca_process_mad; + dev->ib_dev.ibop.process_mad = mthca_process_mad; if (mthca_is_memfree(dev)) { dev->ib_dev.req_notify_cq = mthca_arbel_arm_cq; From yates2 at llnl.gov Tue Feb 28 16:24:28 2006 From: yates2 at llnl.gov (Kim Yates) Date: Tue, 28 Feb 2006 16:24:28 -0800 Subject: [openib-general] Suggested components to support in 1.0 In-Reply-To: <35EA21F54A45CB47B879F21A91F4862FACFDCB@taurus.voltaire.com> References: <35EA21F54A45CB47B879F21A91F4862FACFDCB@taurus.voltaire.com> Message-ID: If iSER will be in pretty good shape, I would like to see it included. On Feb 24, 2006, at 9:54 AM, Yaron Haviv wrote: > > >> -----Original Message----- >> From: openib-general-bounces at openib.org [mailto:openib-general- >> bounces at openib.org] On Behalf Of Bob Woodruff >> Sent: Friday, February 24, 2006 12:23 PM >> To: 'Bryan O'Sullivan'; openib-general >> Subject: RE: [openib-general] Suggested components to support in 1.0 >> >> Bryan wrote, >>> Components that I don't know what to do about, and will likely want > to >>> drop unless someone can vouch for them: >> >>> * iSER >>> * SRP >>> * uDAPL >> >> >> We need uDAPL and I am sure people want SRP and >> I think both are in good shape. >> I am not sure that iSer is quite ready, but will let Voltaire make > that >> call. >> >> woody > > Woody, I believe that OpenIB iSER is quickly getting there with the > amount of dedicated work Or, Dan, and others put into it > > We would definitely vote for it > > Yaron ________________________________________________________ Kim Yates LLNL Center for Applied Scientific Computing kimyates at llnl.gov (925) 424-3642 From swise at opengridcomputing.com Tue Feb 28 19:25:31 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 28 Feb 2006 19:25:31 -0800 Subject: [openib-general] cmatose questions In-Reply-To: <44048C56.3000708@ichips.intel.com> References: <20060228165150.GY19855@mellanox.co.il> <44048C56.3000708@ichips.intel.com> Message-ID: <1141183531.6890.6.camel@stevo-laptop> On Tue, 2006-02-28 at 09:45 -0800, Sean Hefty wrote: > Michael S. Tsirkin wrote: > > I then unloaded the client and loaded it again: > > modprobe rdma_cmatose dst_ip=11.4.8.155; rmmod rdma_cmatose > > The kernel dst_ip isn't this smart. You'll need to provide the IP address as: > > x9b08040b > > Note that there's a userspace version of cmatose in librdmacm that's easier to > interface to, and can handle normal IP addresses and host names as input. > Here's a patch that allows you to specify the ip addresses in dotted decimal format for the kernel version of cmatose. Signed-off-by: Steve Wise Index: cmatose.c =================================================================== --- cmatose.c (revision 5533) +++ cmatose.c (working copy) @@ -40,6 +40,7 @@ #include #include #include +#include #include @@ -49,8 +50,8 @@ /* * To execute: - * Server: insmod rdma_cmatose.ko "src_ip=ip" - * Client: insmod rdma_cmatose.ko ["src_ip=ip"] "dst_ip=ip" + * Server: insmod rdma_cmatose.ko "src_ip=x.x.x.x" + * Client: insmod rdma_cmatose.ko ["src_ip=x.x.x.x"] "dst_ip=y.y.y.y" */ struct cmatest_node { @@ -85,14 +86,14 @@ }; static struct cmatest test; -static int src_ip = 0; -static int dst_ip = 0; +static char *src_ip = "000.000.000.000"; +static char *dst_ip = "x00.000.000.000"; static int connections = 1; static int message_size = 100; static int message_count = 0; -module_param(src_ip, int, 0444); -module_param(dst_ip, int, 0444); +module_param(src_ip, charp, 0444); +module_param(dst_ip, charp, 0444); module_param(connections, int, 0444); module_param(message_size, int, 0444); module_param(message_count, int, 0444); @@ -411,7 +412,7 @@ for (i = 0; i < connections; i++) { test.nodes[i].id = i; - if (dst_ip) { + if (dst_ip[0] != 'x') { test.nodes[i].cma_id = rdma_create_id(cma_handler, &test.nodes[i], RDMA_PS_TCP); @@ -448,13 +449,13 @@ test.src_in.sin_family = AF_INET; test.src_in.sin_port = 7471; if (src_ip) - test.src_in.sin_addr.s_addr = src_ip; + test.src_in.sin_addr.s_addr = in_aton(src_ip); test.src_addr = (struct sockaddr *) &test.src_in; - if (dst_ip) { + if (dst_ip[0] != 'x') { test.dst_in.sin_family = AF_INET; test.dst_in.sin_port = 7471; - test.dst_in.sin_addr.s_addr = dst_ip; + test.dst_in.sin_addr.s_addr = in_aton(dst_ip); } test.dst_addr = (struct sockaddr *) &test.dst_in; } @@ -594,7 +595,7 @@ if (ret) return; - if (dst_ip) + if (dst_ip[0] != 'x') run_client(); else run_server(); From jgunthorpe at obsidianresearch.com Tue Feb 28 17:28:29 2006 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Tue, 28 Feb 2006 18:28:29 -0700 Subject: [openib-general] [PATCH] Minor fix for ibnetdiscover Message-ID: <20060301012829.GA18950@obsidianresearch.com> Trivial, ibnetdiscover does not display some switches for some topologies. This seems to come up if you connect a CA to a switch but nothing else to the switch. Output before: # Topology file: generated on Tue Feb 28 18:21:30 2006 # # Max of 1 hops discovered # Initiated from node 0002c9020020d654 port 0002c9020020d655 devid=0x6274 caguids=0x2c9020020d654 Ca 1 "H-0002c9020020d654" # MT25204 InfiniHostLx Mellanox Technologies [1] "S-0004538030003670"[1] # lid 3 lmc 0 After: # # Topology file: generated on Tue Feb 28 18:22:03 2006 # # Max of 1 hops discovered # Initiated from node 0002c9020020d654 port 0002c9020020d655 devid=0x0 switchguids=0x4538030003670 Switch 2 "S-0004538030003670" # Longbow-XR Prototype - Obsidian Research Corporation port 0 lid 4 [1] "H-0002c9020020d654"[1] devid=0x6274 caguids=0x2c9020020d654 Ca 1 "H-0002c9020020d654" # MT25204 InfiniHostLx Mellanox Technologies [1] "S-0004538030003670"[1] # lid 3 lmc 0 --- ibnetdiscover.c.old 2006-02-15 17:10:34.000000000 -0700 +++ ibnetdiscover.c 2006-02-28 18:13:16.000000000 -0700 @@ -532,7 +532,7 @@ #endif } - for (dist = 0; dist < maxhops_discovered; dist++) { + for (dist = 0; dist <= maxhops_discovered; dist++) { for (node = nodesdist[dist]; node; node = node->dnext) { From swise at opengridcomputing.com Tue Feb 28 19:30:15 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 28 Feb 2006 19:30:15 -0800 Subject: [openib-general] [PATCH] RFC Verbs: add support for transport specific verbs In-Reply-To: References: Message-ID: <1141183815.6890.8.camel@stevo-laptop> Aren't attach_mcast and detach_mcast also ib-specific? On Tue, 2006-02-28 at 11:06 -0800, Sean Hefty wrote: > Add support for transport specific extensions to the ib_device verbs. > Relocate process_mad as an IB specific verb. > > This provides a mechanism to add iWarp specific functionality, such as > the iWarp CM calls, to ib_device. > > Signed-off-by: Sean Hefty > > --- > > Index: include/rdma/ib_verbs.h > =================================================================== > --- include/rdma/ib_verbs.h (revision 5532) > +++ include/rdma/ib_verbs.h (working copy) > @@ -824,6 +824,16 @@ struct ib_cache { > struct ib_gid_cache **gid_cache; > }; > > +struct ib_verbs { > + int (*process_mad)(struct ib_device *device, > + int process_mad_flags, > + u8 port_num, > + struct ib_wc *in_wc, > + struct ib_grh *in_grh, > + struct ib_mad *in_mad, > + struct ib_mad *out_mad); > +}; > + > struct ib_device { > struct device *dma_device; > > @@ -954,13 +964,10 @@ struct ib_device { > int (*detach_mcast)(struct ib_qp *qp, > union ib_gid *gid, > u16 lid); > - int (*process_mad)(struct ib_device *device, > - int process_mad_flags, > - u8 port_num, > - struct ib_wc *in_wc, > - struct ib_grh *in_grh, > - struct ib_mad *in_mad, > - struct ib_mad *out_mad); > + > + union { > + struct ib_verbs ib; > + } ext_verbs; > > struct module *owner; > struct class_device class_dev; > Index: core/mad.c > =================================================================== > --- core/mad.c (revision 5532) > +++ core/mad.c (working copy) > @@ -704,9 +704,9 @@ static int handle_outgoing_dr_smp(struct > send_wr->wr.ud.port_num, &mad_wc); > > /* No GRH for DR SMP */ > - ret = device->process_mad(device, 0, port_num, &mad_wc, NULL, > - (struct ib_mad *)smp, > - (struct ib_mad *)&mad_priv->mad); > + ret = device->ext_verbs.ib.process_mad(device, 0, port_num, &mad_wc, > + NULL, (struct ib_mad *)smp, > + (struct ib_mad *)&mad_priv->mad); > switch (ret) > { > case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY: > @@ -1787,7 +1787,7 @@ static void ib_mad_recv_done_handler(str > > local: > /* Give driver "right of first refusal" on incoming MAD */ > - if (port_priv->device->process_mad) { > + if (port_priv->device->ext_verbs.ib.process_mad) { > int ret; > > if (!response) { > @@ -1799,11 +1799,11 @@ local: > goto out; > } > > - ret = port_priv->device->process_mad(port_priv->device, 0, > - port_priv->port_num, > - wc, &recv->grh, > - &recv->mad.mad, > - &response->mad.mad); > + ret = port_priv->device-> > + ext_verbs.ib.process_mad(port_priv->device, 0, > + port_priv->port_num, wc, > + &recv->grh, &recv->mad.mad, > + &response->mad.mad); > if (ret & IB_MAD_RESULT_SUCCESS) { > if (ret & IB_MAD_RESULT_CONSUMED) > goto out; > Index: core/sysfs.c > =================================================================== > --- core/sysfs.c (revision 5532) > +++ core/sysfs.c (working copy) > @@ -311,7 +311,7 @@ static ssize_t show_pma_counter(struct i > struct ib_mad *out_mad = NULL; > ssize_t ret; > > - if (!p->ibdev->process_mad) > + if (!p->ibdev->ext_verbs.ib.process_mad) > return sprintf(buf, "N/A (no PMA)\n"); > > in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); > @@ -329,7 +329,7 @@ static ssize_t show_pma_counter(struct i > > in_mad->data[41] = p->port_num; /* PortSelect field */ > > - if ((p->ibdev->process_mad(p->ibdev, IB_MAD_IGNORE_MKEY, > + if ((p->ibdev->ext_verbs.ib.process_mad(p->ibdev, IB_MAD_IGNORE_MKEY, > p->port_num, NULL, NULL, in_mad, out_mad) & > (IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY)) != > (IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY)) { > Index: core/smi.h > =================================================================== > --- core/smi.h (revision 5532) > +++ core/smi.h (working copy) > @@ -58,7 +58,7 @@ static inline int smi_check_local_smp(st > { > /* C14-9:3 -- We're at the end of the DR segment of path */ > /* C14-9:4 -- Hop Pointer = Hop Count + 1 -> give to SMA/SM */ > - return ((device->process_mad && > + return ((device->ext_verbs.ib.process_mad && > !ib_get_smp_direction(smp) && > (smp->hop_ptr == smp->hop_cnt + 1))); > } > Index: hw/ipath/ipath_verbs.c > =================================================================== > --- hw/ipath/ipath_verbs.c (revision 5532) > +++ hw/ipath/ipath_verbs.c (working copy) > @@ -5949,7 +5949,7 @@ static int ipath_register_ib_device(cons > dev->dealloc_fmr = ipath_dealloc_fmr; > dev->attach_mcast = ipath_multicast_attach; > dev->detach_mcast = ipath_multicast_detach; > - dev->process_mad = ipath_process_mad; > + dev->ext_verbs.ib.process_mad = ipath_process_mad; > > ret = ib_register_device(dev); > if (ret) > Index: hw/mthca/mthca_provider.c > =================================================================== > --- hw/mthca/mthca_provider.c (revision 5532) > +++ hw/mthca/mthca_provider.c (working copy) > @@ -1329,7 +1329,7 @@ int mthca_register_device(struct mthca_d > > dev->ib_dev.attach_mcast = mthca_multicast_attach; > dev->ib_dev.detach_mcast = mthca_multicast_detach; > - dev->ib_dev.process_mad = mthca_process_mad; > + dev->ib_dev.ext_verbs.ib.process_mad = mthca_process_mad; > > if (mthca_is_memfree(dev)) { > dev->ib_dev.req_notify_cq = mthca_arbel_arm_cq; > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mshefty at ichips.intel.com Tue Feb 28 17:56:35 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 28 Feb 2006 17:56:35 -0800 Subject: [openib-general] [PATCH] RFC Verbs: add support for transport specific verbs In-Reply-To: <1141183815.6890.8.camel@stevo-laptop> References: <1141183815.6890.8.camel@stevo-laptop> Message-ID: <4404FF53.5090802@ichips.intel.com> Steve Wise wrote: > Aren't attach_mcast and detach_mcast also ib-specific? At this point, yes. There are a couple more calls that are IB specific as well. The intent is to define the framework that we want to use to define transport specific calls. Once that's defined, we can relocate other calls to the transport specific area through a series of patches. - Sean From mshefty at ichips.intel.com Tue Feb 28 18:06:29 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 28 Feb 2006 18:06:29 -0800 Subject: [openib-general] [PATCH] RFC Verbs: add support for transport specific verbs In-Reply-To: References: Message-ID: <440501A5.8050306@ichips.intel.com> Sean Hefty wrote: > Add support for transport specific extensions to the ib_device verbs. > Relocate process_mad as an IB specific verb. > > This provides a mechanism to add iWarp specific functionality, such as > the iWarp CM calls, to ib_device. Third proposal from Fab: struct rdma_device { most everything }; struct ib_device { struct rdma_device dev; IB specific operations; }; Of course, this is a much larger change. - Sean From mshefty at ichips.intel.com Tue Feb 28 18:08:43 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 28 Feb 2006 18:08:43 -0800 Subject: [openib-general] cmatose questions In-Reply-To: <1141183531.6890.6.camel@stevo-laptop> References: <20060228165150.GY19855@mellanox.co.il> <44048C56.3000708@ichips.intel.com> <1141183531.6890.6.camel@stevo-laptop> Message-ID: <4405022B.50009@ichips.intel.com> Steve Wise wrote: > Here's a patch that allows you to specify the ip addresses in dotted > decimal format for the kernel version of cmatose. Thanks! Committed in 5534. - Sean From devesh28 at gmail.com Tue Feb 28 20:42:05 2006 From: devesh28 at gmail.com (Devesh Sharma) Date: Wed, 1 Mar 2006 10:12:05 +0530 Subject: [openib-general] LID assignment policy of opensm In-Reply-To: <1141151786.4335.6624.camel@hal.voltaire.com> References: <309a667c0602272244re7134ees6031bc11883372ea@mail.gmail.com> <1141128535.4335.4148.camel@hal.voltaire.com> <309a667c0602280812s6da1cc99s30f10b7ea940f82e@mail.gmail.com> <1141151786.4335.6624.camel@hal.voltaire.com> Message-ID: <309a667c0602282042r3094c92ey6cab429229cb5e0b@mail.gmail.com> On 28 Feb 2006 13:36:27 -0500, Hal Rosenstock wrote: > > Hi Devesh, > > On Tue, 2006-02-28 at 11:12, Devesh Sharma wrote: > > Hi Hal Thanks for replying. > > This setisfies my needs if user can define his own guid to lid > > mapping. > > whether in this file user can define his own guid to lid mapping? > > To my knowledge it's not used that way in general but that could work if > consistent with the OpenSM LID policy (e.g. LMC, etc.). consistent with the OpenSM LID policy means what? Is it that, That each GUID sholud have unique LID range? The format of the file is as follows: > > 0x0008f10403960985 0x0007 0x0007 > > 0x0008f10400410015 0x0003 0x0003 > > (e.g GUID, min LID, max LID so the above is for LMC 0 which is the > default). > > -- Hal > > > Devesh > > > > On 28 Feb 2006 07:19:35 -0500, Hal Rosenstock > > wrote: > > Hi Devesh, > > > > On Tue, 2006-02-28 at 01:44, Devesh Sharma wrote: > > > Hi list, > > > Please anybody brife me about the LID assignment policy used > > by opensm > > > subnet manager. Can user specify fixed LID mappings using a > > file? > > > > There is a file it creates with these in it so they can be > > reused > > subsequently. It is /var/cache/osm/guid2lid. > > > > opensm -h has the following option: > > -c > > --cache-options > > Cache the given command line options into the file > > /var/cache/osm/opensm.opts for use next invocation > > The cache directory can be changed by the > > environment > > variable OSM_CACHE_DIR > > > > Is that suitable for your needs ? > > > > -- Hal > > > > > > > > > ______________________________________________________________________ > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue Feb 28 22:09:28 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Mar 2006 01:09:28 -0500 Subject: [openib-general] LID assignment policy of opensm In-Reply-To: <309a667c0602282042r3094c92ey6cab429229cb5e0b@mail.gmail.com> References: <309a667c0602272244re7134ees6031bc11883372ea@mail.gmail.com> <1141128535.4335.4148.camel@hal.voltaire.com> <309a667c0602280812s6da1cc99s30f10b7ea940f82e@mail.gmail.com> <1141151786.4335.6624.camel@hal.voltaire.com> <309a667c0602282042r3094c92ey6cab429229cb5e0b@mail.gmail.com> Message-ID: <1141193360.4335.9776.camel@hal.voltaire.com> On Tue, 2006-02-28 at 23:42, Devesh Sharma wrote: > > On 28 Feb 2006 13:36:27 -0500, Hal Rosenstock > wrote: > Hi Devesh, > > On Tue, 2006-02-28 at 11:12, Devesh Sharma wrote: > > Hi Hal Thanks for replying. > > This setisfies my needs if user can define his own guid to > lid > > mapping. > > whether in this file user can define his own guid to lid > mapping? > > To my knowledge it's not used that way in general but that > could work if > consistent with the OpenSM LID policy (e.g. LMC, etc.). > > consistent with the OpenSM LID policy means what? Is it that, That > each GUID sholud have unique LID range? Yes; also match the LMC (min/max LID), no duplicates/overlaps. > The format of the file is as follows: > > 0x0008f10403960985 0x0007 0x0007 > > 0x0008f10400410015 0x0003 0x0003 > > (e.g GUID, min LID, max LID so the above is for LMC 0 which is > the > default). > > -- Hal > > > Devesh > > > > On 28 Feb 2006 07:19:35 -0500, Hal Rosenstock < > halr at voltaire.com> > > wrote: > > Hi Devesh, > > > > On Tue, 2006-02-28 at 01:44, Devesh Sharma wrote: > > > Hi list, > > > Please anybody brife me about the LID assignment > policy used > > by opensm > > > subnet manager. Can user specify fixed LID > mappings using a > > file? > > > > There is a file it creates with these in it so they > can be > > reused > > subsequently. It is /var/cache/osm/guid2lid. > > > > opensm -h has the following option: > > -c > > --cache-options > > Cache the given command line options into > the file > > /var/cache/osm/opensm.opts for use next > invocation > > The cache directory can be changed by the > > environment > > variable OSM_CACHE_DIR > > > > Is that suitable for your needs ? > > > > -- Hal > > > > > > > > > ______________________________________________________________________ > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > > > From eitan at mellanox.co.il Tue Feb 28 23:20:23 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 01 Mar 2006 09:20:23 +0200 Subject: [openib-general] LID assignment policy of opensm In-Reply-To: <309a667c0602282042r3094c92ey6cab429229cb5e0b@mail.gmail.com> References: <309a667c0602272244re7134ees6031bc11883372ea@mail.gmail.com> <1141128535.4335.4148.camel@hal.voltaire.com> <309a667c0602280812s6da1cc99s30f10b7ea940f82e@mail.gmail.com> <1141151786.4335.6624.camel@hal.voltaire.com> <309a667c0602282042r3094c92ey6cab429229cb5e0b@mail.gmail.com> Message-ID: <44054B37.4020307@mellanox.co.il> Hi Devesh If a user writes the file such that it is legal (see below) OpenSM will use it. When a STANDBY SM gets mastership it has two choices: a. Use the LIDs from the fabric b. Enforce the assignments provided by the guid2lid file The option Legal means: 1. there are no LID conflicts 2. lid range start is aligned with 2^(LMC) NOTE that the end of the range can extend beyond start + 2^(LMC) - 1 /var/cache/osm/opensm.opts: # If true honor the guid2lid file when coming out of standby # state, if such file exists and is valid honor_guid2lid_file FALSE Eitan Devesh Sharma wrote: > On 28 Feb 2006 13:36:27 -0500, Hal Rosenstock wrote: > >>Hi Devesh, >> >>On Tue, 2006-02-28 at 11:12, Devesh Sharma wrote: >> >>>Hi Hal Thanks for replying. >>>This setisfies my needs if user can define his own guid to lid >>>mapping. >>>whether in this file user can define his own guid to lid mapping? >> >>To my knowledge it's not used that way in general but that could work if >>consistent with the OpenSM LID policy (e.g. LMC, etc.). > > > > consistent with the OpenSM LID policy means what? Is it that, That each GUID > sholud have unique LID range? > > > The format of the file is as follows: > >>0x0008f10403960985 0x0007 0x0007 >> >>0x0008f10400410015 0x0003 0x0003 >> >>(e.g GUID, min LID, max LID so the above is for LMC 0 which is the >>default). >> >>-- Hal >> >> >>>Devesh >>> >>>On 28 Feb 2006 07:19:35 -0500, Hal Rosenstock >>>wrote: >>> Hi Devesh, >>> >>> On Tue, 2006-02-28 at 01:44, Devesh Sharma wrote: >>> > Hi list, >>> > Please anybody brife me about the LID assignment policy used >>> by opensm >>> > subnet manager. Can user specify fixed LID mappings using a >>> file? >>> >>> There is a file it creates with these in it so they can be >>> reused >>> subsequently. It is /var/cache/osm/guid2lid. >>> >>> opensm -h has the following option: >>> -c >>> --cache-options >>> Cache the given command line options into the file >>> /var/cache/osm/opensm.opts for use next invocation >>> The cache directory can be changed by the >>> environment >>> variable OSM_CACHE_DIR >>> >>> Is that suitable for your needs ? >>> >>> -- Hal >>> > >>> > >>> >>______________________________________________________________________ >> >>> > >>> > _______________________________________________ >>> > openib-general mailing list >>> > openib-general at openib.org >>> > http://openib.org/mailman/listinfo/openib-general >>> > >>> > To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general >>> >>> >> >> > > > ------------------------------------------------------------------------ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at mellanox.co.il Tue Feb 28 23:55:19 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 1 Mar 2006 09:55:19 +0200 Subject: [openib-general] Re: [PATCH] mad: add GID/class checking for matching received to sent MADs In-Reply-To: <4404DAF9.7070807@ichips.intel.com> References: <200602281924.33041.jackm@mellanox.co.il> <4404B738.4030202@ichips.intel.com> <20060228220225.GH21824@mellanox.co.il> <4404CE24.3010907@ichips.intel.com> <20060228223909.GL21824@mellanox.co.il> <4404DAF9.7070807@ichips.intel.com> Message-ID: <20060301075519.GA22655@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [PATCH] mad: add GID/class checking for matching received to sent MADs > > Michael S. Tsirkin wrote: > >>If the driver is caching the memory anyway, couldn't it just store the > >>data in the ah, rather than some private structure? > > > >No, because its keeping it in hardware format. > > Is there a particular reason it needs it cached in this format? Yes, hardware reads the address handle from memory when processing UD work requests. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies